This data is very valuable if you're trying to create fully automated SWEs, while most foundation model providers have probably been scraping together second hand data to simulate long horizon engineering work. Cursor probably has way more of this data, and I wonder how Microsoft's own Copilot is doing (and how they share this data with the foundation model providers)...
open source alternative https://huggingface.co/SWE-bench/SWE-agent-LM-32B
though I haven't been able to find a mlx quant that wasn't completely broken.
The definition of vibe coding - trust the process, let it make errors and recover
Their stated goal is to improve on the frontier models. It's ambitious, but on the other hand they were a model company before they were an IDE company (IIRC) and they have a lot of data, and the scope is to make a model which is specialized for their specific case.
At the very least I would expect they would succeed in specializing a fronteir model for their use-case by feeding their pipeline of data (whether they should have that data to begin with is another question).
The blog post doesn't say much about the model itself, but there's a few candidates to fine tune from.
Cynical take: describing yourself as a full stack AI IDE company sounds very invest-able in a "what if they're right" kind of way. They could plausibly ask for higher valuations, etc.
Optimistic take: fine tuning a model for their use-case (incomplete code snippets with a very specific data model of context) should work. Or even has from their claims. It certainly sounds plausible that fine-tuning a frontier model would make it better for their needs. Whether it's reasonable to go beyond fine-tuning and consider pre-training etc. I don't know. If I remember correctly they were a model company before Windsurf, so they have the skillset.
Bonus take: doesn't this mean they're basically training on large-scale gathered user data?
firejake308•13h ago
kristopolous•5h ago
anshumankmr•5h ago
allenleein•5h ago
riffraff•4h ago
kcorbitt•4h ago
Most likely they built this as a post-train of an open model that is already strong on coding like Qwen 2.5.
rfoo•2h ago
It is very puzzling why "wrapper" companies don't (and religiously say they won't ever) do something on this front. The only barrier is talents.
anshumankmr•2h ago
That being said I am sure a lot of the so called wrapper companies are paying insanely well too, but competing with FAANGMULA might be trickier for them.
NitpickLawyer•1h ago
Archonical•52m ago
whywhywhywhy•33m ago
OtherShrezzing•1h ago
dyl000•3h ago
for coding you use anthropic or google models, I haven't found anyone who swears by openAI models for coding... Their reasoning models are either too expensive or hallucinate massively to the point of being useless... I would assume the gpt 4.1 family will be popular for SWE's
Having a smaller scope model (agentic coding only) allows for much cheaper inference and windsurf building its own moat (so far agentic IDE's haven't had a moat)
jjani•3h ago
This suggests OpenAI models do have tasks they're better at than the "less rounded" competition, who have taks they're weaker in. Could you name a single sucg task (except for image generation, which is an entirely different usecase), that OpenAI models are better at than Gemini 2.5 and Claude 3.7 without costing at least 5x as much?
seunosewa•2h ago
jstummbillig•5m ago