Sweep Next-Edit: Open Source Fast Local Code Autocomplete Model
Sweep open-sourced a 1.5B autocomplete model that beats 7B competitors by fixing what others got wrong: they discovered that poor tokenization choices (like using 7-token boundary markers) and chat templates with untrained tokens tank performance more than model size helps.
Read OriginalMy Notes (2)
How Sweep used a genetic algorithm to find the best diff format
- Many ways to represent diffs: unified diff, side-by-side,
original/updatedblocks,old/newlabels, with or without.diffmarkers. - Used a genetic algorithm to search this space instead of picking one manually.
The genetic algorithm process
Initialize population: started with 10 prompt format variants, each with different combinations of diff representation choices.
Evaluate fitness: ran inference on a held-out validation set, measured exact-match accuracy.
Selection and breeding: kept top performers, created new variants by combining elements from winners (e.g., diff markers from one format, label style from another).
Mutation: randomly tweaked some formats to explore new variations.
Ran for 3 generations, evaluated ~30 different prompt formats total.
What won
original/updatedblocks without unified diff syntax—surprised them at first.- Of the final winning formats, this was the simplest one.
Why original/updated blocks beat unified diffs for LLMs
They ran a genetic algorithm over ~30 prompt format variants across 3 generations to find the optimal diff representation.
Started with 10 variants, evaluated exact-match accuracy on a validation set, bred top performers, mutated, repeated.
The winner was
original/updatedblocks without unified diff syntax—this surprised them at first.
Unified diff:
if (params.length >= 8 && params[7] != null) {
- linkState = LinkState.valueOf(params[7].toUpperCase());
+ linkState = LinkState.valueOf(params[7].toUpperCase(Locale.ENGLISH));
}
original/updated blocks:
original:
if (params.length >= 8 && params[7] != null) {
linkState = LinkState.valueOf(params[7].toUpperCase());
}
updated:
if (params.length >= 8 && params[7] != null) {
linkState = LinkState.valueOf(params[7].toUpperCase(Locale.ENGLISH));
}
How LLMs actually see unified diffs
Humans read diffs on a 2D screen, so we can visually align the
-and+lines and spot what changed.LLMs see it as a single stream of tokens with
\ncharacters—much harder to parse.Qwen's technical report doesn't mention unified diffs in pretraining data, so the format is out of distribution.
Why this matters beyond Sweep
- Claude Sonnet prefers
str_replaceover unified diffs for the same reason—it's closer to pretraining data and easier to generate.
Summary used for search
TLDR
• Competitors like Instinct and Zeta fail because they use chat templates with untrained tokens and boundary markers that tokenize to 7 tokens each—causing the model to frequently misplace region boundaries
• Sweep ran a genetic algorithm across 30 prompt formats and found original/updated blocks outperform unified diffs because they're closer to pretraining data distribution
• They use a fixed 21-line sliding window instead of AST-based boundaries—more consistent for training, and with n-gram speculative decoding it completes in sub-100ms
• After SFT on 100k examples, they ran 2000 steps of RL with parse rewards to eliminate unparseable outputs and oversized diffs
• Result: 67.82% accuracy vs 54.09% for Mercury Coder, running locally in under 500ms
In Detail
Sweep's core insight is that existing next-edit models fail due to fundamental tokenization and formatting mistakes, not insufficient parameters. Instinct uses Qwen2.5-Coder's chat template tokens that were never seen during pretraining, and both Instinct and Zeta use <|editable_region_start|> markers that tokenize to 7 tokens each. This causes the most common failure mode: misplacing the end marker and generating extra or missing trailing lines. They also include unnecessary system prompts like "You are Instinct, developed by Continue" that add noise disproportionately affecting small models.
Sweep's solution involved three key innovations. First, they optimized the diff format by running a genetic algorithm across ~30 prompt variants over 3 generations. The winning format used original/updated blocks instead of unified diffs—counterintuitive but more readable for LLMs since unified diffs require parsing 2D visual alignment in a 1D token stream. Second, they switched from variable-length AST-based boundaries to a fixed 21-line sliding window (10 lines above/below cursor). While AST boundaries seemed elegant, they caused training instability from inconsistent output lengths and forced the model to learn both content generation and boundary prediction simultaneously. Third, they properly leveraged Qwen's pretrained special tokens like <|file_sep|> instead of inventing new ones.
Their training process started with SFT on 100k examples from popular permissively-licensed repos, upsampled to match JetBrains language distribution (Java, Kotlin, C#, PHP, Ruby). Then they ran 2000 steps of on-policy RL with tree-sitter parse rewards and diff size regularization to eliminate the model's tendency to generate unparseable code or oversized changes. The final 1.5B model achieves 67.82% exact-match accuracy versus 54.09% for Mercury Coder (the next best model), while running locally in under 500ms with their custom TensorRT-LLM fork using FP8 quantization and n-gram speculative decoding.