Next-edit autocomplete differs from standard autocomplete by using your recent edits as context when predicting completions. The model is small enough to run locally while outperforming models 4x its size on both speed and accuracy.
We tested against Mercury (Inception), Zeta (Zed), and Instinct (Continue) across five benchmarks: next-edit above/below cursor, tab-to-jump for distant changes, standard FIM, and noisiness. We found exact-match accuracy correlates best with real usability because code is fairly precise and the solution space is small.
Prompt format turned out to matter more than we expected. We ran a genetic algorithm over 30+ diff formats and found simple `original`/`updated` blocks beat unified diffs. The verbose format is just easier for smaller models to understand.
Training was SFT on ~100k examples from permissively-licensed repos (4hrs on 8xH100), then RL for 2000 steps with tree-sitter parse checking and size regularization. The RL step fixes edge cases SFT can’t like, generating code that doesn’t parse or overly verbose outputs.
We're open-sourcing the weights so the community can build fast, privacy-preserving autocomplete for any editor. If you're building for VSCode, Neovim, or something else, we'd love to see what you make with it!
plutodev•3h ago
The diff-format insight is especially interesting. Smaller models struggling with unified diffs lines up with what I’ve seen too simpler original/updated blocks reduce noise and improve intent capture.
On the infra side, training a 1.5B model in ~4 hours on 8×H100 is impressive. For folks experimenting with similar mid-scale models, we’ve been running comparable workloads on decentralized GPU aggregators (I’ve used io.net) to avoid cloud quota limits and keep costs predictable with the tradeoff that you handle orchestration yourself.
Curious if you saw diminishing returns when including older edits as context? That cutoff seems tricky in larger repos.
kouteiheika•34m ago
It's hard to compare without more details about the training process and the dataset, but, is it? Genuine question, because I had the opposite impression. Like, for example, recently I did a full finetuning run on a 3B model chewing through a 146k entry dataset (with 116k entries having reasoning traces, so they're not short) in 7 hours on a single RTX 6000.