← Bookmarks 📄 Article

Sweep Next-Edit: Open Source Fast Local Code Autocomplete Model

Sweep open-sources a 1.5B parameter code autocomplete model that beats 7B competitors by fixing tokenization mistakes everyone else made—and explains exactly why Zeta and Instinct fail so badly.

· ai ml
Read Original
Listen to TTS
0:008:00
Summary used for search

• Other models fail because they use chat templates on untrained tokens and boundary markers that tokenize into 7 tokens each—the model can't learn to place <|editable_region_end|> correctly
• Sweep's format uses original/updated diff blocks instead of unified diffs because LLMs see them as 1D token streams, not 2D visual diffs humans read
• They ran a genetic algorithm over 30 prompt format variants to find the optimal combination, then used RL with parse rewards to fix bad habits like generating unparseable code
• Fixed 21-line sliding windows train better than AST-based boundaries because the model knows exactly how many lines to output instead of learning both content and boundary prediction
• Result: 67.82% accuracy at 1.5B params vs 25-55% for competitors, sub-500ms with custom TensorRT-LLM + n-gram speculative decoding

Sweep's core thesis is that open-source next-edit models fail not from lack of scale, but from suboptimal formatting and tokenization choices. They demonstrate this by showing how Zeta and Instinct make critical mistakes: using chat templates from Qwen2.5-Coder-Base on tokens the base model was never trained on, using boundary markers like <|editable_region_start|> that tokenize poorly (7 tokens each), and including excessive instructions that add noise disproportionately affecting small models. The most common failure mode is misplacing the end boundary marker, leading to missing or extraneous trailing lines.

Their solution involves three key innovations. First, they use original/updated diff blocks instead of unified diffs because models process tokens as 1D streams—what humans read as aligned 2D diffs becomes incomprehensible token soup to LLMs. They validated this with a genetic algorithm that evaluated ~30 prompt format variants over 3 generations. Second, they use fixed 21-line sliding windows instead of AST-based boundaries because variable-length outputs made training unstable and forced the model to learn both content generation and boundary prediction simultaneously. Third, they provide recent changes in separate file blocks using Qwen's pretrained special tokens rather than custom markers.

The training process combines supervised fine-tuning on 100k examples from popular permissively-licensed repos (upsampled to match JetBrains language distribution) followed by 2000 steps of on-policy RL. The RL phase uses tree-sitter parse rewards and regularization to fix SFT's bad habits—generating unparseable code and making changes that are too large. The result is a 1.5B model achieving 67.82% exact-match accuracy versus 25-55% for competitors, running in under 500ms via custom TensorRT-LLM with FP8 quantization and n-gram speculative decoding.