Windsurf SWE-1: Our First Frontier Models

https://windsurf.com/blog/windsurf-wave-9-swe-1

93•arittr•16h ago

Comments

firejake308•13h ago

I'm confused why they are working on their own frontier models if they are going to be bought by OpenAI anyway. I guess this is something they were working on before the announcement?

kristopolous•5h ago

Must have been. These things take months.

anshumankmr•5h ago

Getting more money perhaps also, if they believed their model to be good, and had amassed some good training data Open AI can leverage, apart from the user base.

allenleein•5h ago

It seems OpenAI acquired Windsurf but is letting it operate independently, keeping its own brand and developing its own coding models. That way, if Windsurf runs into technical problems, the backlash lands on Windsurf—not OpenAI. It’s a smart way to innovate while keeping the main brand safe.

riffraff•4h ago

But doesn't this mean they have twice the costs in training? I was under the impression that was still the most expensive part of these companies' balance.

kcorbitt•4h ago

It's very unlikely that they're doing their own pre-training, which is the longest and most expensive part of creating a frontier model (if they were, they'd likely brag about it).

Most likely they built this as a post-train of an open model that is already strong on coding like Qwen 2.5.

rfoo•2h ago

mid/post training does not cost that much, except maybe large scale RL, but even this is more of an infra problem. If anything, the cost is mostly in running various experiments (i.e. the process of doing research).

It is very puzzling why "wrapper" companies don't (and religiously say they won't ever) do something on this front. The only barrier is talents.

anshumankmr•2h ago

You might be underestimating the barrier to hiring the really smart people. Open AI/Google etc would be hiring and poaching people like crazy, offering cushy bonuses and TCs that would make blow your mind.(Like say Noam Brown at Open AI) And some of the more ambitious ones would start their own ventures (like say Ilya etc.).

That being said I am sure a lot of the so called wrapper companies are paying insanely well too, but competing with FAANGMULA might be trickier for them.

NitpickLawyer•1h ago

FAANGMULA ... Microsoft, Uber?, L??, Anthropic? Who's the L?

Archonical•52m ago

Lyft.

whywhywhywhy•33m ago

Any half decent and methodical software engineer can fine tune/repurpose a model if you have the data and the money to burn on compute and experiment runs, which they do.

OtherShrezzing•1h ago

This is effectively how Microsoft is treating OpenAI.

dyl000•3h ago

openAI models have an issue where they are pretty good at everything but not incredible at anything. They're too well rounded.

for coding you use anthropic or google models, I haven't found anyone who swears by openAI models for coding... Their reasoning models are either too expensive or hallucinate massively to the point of being useless... I would assume the gpt 4.1 family will be popular for SWE's

Having a smaller scope model (agentic coding only) allows for much cheaper inference and windsurf building its own moat (so far agentic IDE's haven't had a moat)

jjani•3h ago

> openAI models have an issue where they are pretty good at everything but not incredible at anything. They're too well rounded.

This suggests OpenAI models do have tasks they're better at than the "less rounded" competition, who have taks they're weaker in. Could you name a single sucg task (except for image generation, which is an entirely different usecase), that OpenAI models are better at than Gemini 2.5 and Claude 3.7 without costing at least 5x as much?

seunosewa•2h ago

They were working on the model before the acquisition. It makes sense to test it and see how it does instead of throwing the work away. Their data will probably be used to improve gpt-4.1, o4 mini high, and other OpenAI coding models

jstummbillig•5m ago

Why would OpenAI not let smart people work on models? That seems to be what they do. The point is: They are no longer "their own" models. They are now OpenAI models. If they suck, if they are redundant, if there is no idea there that makes sense, that effort will not continue indefinitely.

blixt•3h ago

> Enabled from the insight from our heavily-used Windsurf Editor, we got to work building a completely new data model (the shared timeline) and a training recipe that encapsulates incomplete states, long-running tasks, and multiple surfaces.

This data is very valuable if you're trying to create fully automated SWEs, while most foundation model providers have probably been scraping together second hand data to simulate long horizon engineering work. Cursor probably has way more of this data, and I wonder how Microsoft's own Copilot is doing (and how they share this data with the foundation model providers)...

figassis•57m ago

And is probably why OpenAI paid $$$ to acquire

lemming•51m ago

The company that is best placed to collect tons of high quality data of this type is undoubtedly Google. They’ve had publications talking about how they capture data from their in house SWE tools and use it to improve their tooling.

whywhywhywhy•7m ago

There is a world where the wrapper makers surpass the current model makers in their area of focus. Cursor/Windsurf have all the data on when people got so frustrated with Claude they switched to Gemini/GPT and also all the data of when the problem was actually solved and when it wasn't.

dyl000•3h ago

it was only a matter of time, they have too much good data to not train their own models, not to mention that claude API calls were probably killing their profitability.

open source alternative https://huggingface.co/SWE-bench/SWE-agent-LM-32B

though I haven't been able to find a mlx quant that wasn't completely broken.

aquir•2h ago

It's a shame that my development work needs a specific VSCode extension (domain specific language for ERP systems) so my options are VSCode+Copilot or Cursor.

antirez•2h ago

So because they need to have a better business model, they will try to move users to weaker models compared to the best available? This "AI inside the editor" thing makes every day less sense in many dimensions: it makes you not really capable of escaping the accept, accept, accept trap. It makes the design interaction with the LLM too much about code and too little about the design itself. And you can't do what many of us do: have that three subscriptions for the top LLMs available (it's 60$ for 3, after all) and use each for it's best. And by default write your stuff without help if LLMs are not needed in a given moment.

visarga•2h ago

> it makes you not really capable of escaping the accept, accept, accept trap

The definition of vibe coding - trust the process, let it make errors and recover

conartist6•1h ago

"press pay to think for me button" "press pay to think for me button" "press pay to think for me button" "press pay to think for me button" "press pay to think for me button" I love it

ipnon•2h ago

I don't think they are targeting software engineers as users. They are seeking those on the software engineering margins, users who know what Python and for-loops are but don't care to configure Aider and review each of the overwhelming number of models released daily. They want to tell the editor to add function foo to bar.py. I suspect this latter market segment is much larger than the former!

bluelightning2k•1h ago

I don't like or agree with this take. You're basically saying - "something good exists, so why try to improve upon it".

Their stated goal is to improve on the frontier models. It's ambitious, but on the other hand they were a model company before they were an IDE company (IIRC) and they have a lot of data, and the scope is to make a model which is specialized for their specific case.

At the very least I would expect they would succeed in specializing a fronteir model for their use-case by feeding their pipeline of data (whether they should have that data to begin with is another question).

The blog post doesn't say much about the model itself, but there's a few candidates to fine tune from.

bluelightning2k•1h ago

Two takes here. Cynical and optimistic.

Cynical take: describing yourself as a full stack AI IDE company sounds very invest-able in a "what if they're right" kind of way. They could plausibly ask for higher valuations, etc.

Optimistic take: fine tuning a model for their use-case (incomplete code snippets with a very specific data model of context) should work. Or even has from their claims. It certainly sounds plausible that fine-tuning a frontier model would make it better for their needs. Whether it's reasonable to go beyond fine-tuning and consider pre-training etc. I don't know. If I remember correctly they were a model company before Windsurf, so they have the skillset.

Bonus take: doesn't this mean they're basically training on large-scale gathered user data?

Baby is healed with first personalized gene-editing treatment

The first year of free-threaded Python

Compressed music might be harmful to the ears

Ollama's new engine for multimodal models

Material 3 Expressive

A leap year check in three instructions

The Awful German Language (1880)

Náhuatl and Mayan Language Renaissance Occurring in Mexico

Teal – A statically-typed dialect of Lua

Cracked - method chaining/CSS-style selector web audio library

The unreasonable effectiveness of an LLM agent loop with tool use

Beyond Text: On-Demand UI Generation for Better Conversational Experiences

BuyMeACoffee silently dropped support for many countries (2024)

Wasmer (YC S19) Is Hiring a Rust Compiler Engineer

Initialization in C++ is bonkers (2017)

Comma 3X: Initial Impressions

Lock-Free Rust: How to Build a Rollercoaster While It's on Fire

Launch HN: Tinfoil (YC X25): Verifiable Privacy for Cloud AI

Archisuits (2005–2006)

Apple Blocks Fortnite's Return to iOS App Store, Epic Claims

Leeks and Leaks – Daniel.haxx.se

Thermoelectric generator based on a robust carbon nanotube/BiSbTe foam

Zinc Microcapacitors Are the Best of Both Worlds

Show HN: A free AI risk assessment tool for LLM applications

Tek – A music making program for 24-bit Unicode terminals

Windsurf SWE-1: Our First Frontier Models

Coinbase 8K SEC filing for breach

GTK Krell Monitors

A Tiny Boltzmann Machine

NASA keeps ancient Voyager 1 spacecraft alive with Hail Mary thruster fix

Baby is healed with first personalized gene-editing treatment

The first year of free-threaded Python

Compressed music might be harmful to the ears

Ollama's new engine for multimodal models

Material 3 Expressive

A leap year check in three instructions

The Awful German Language (1880)

Náhuatl and Mayan Language Renaissance Occurring in Mexico

Teal – A statically-typed dialect of Lua

Cracked - method chaining/CSS-style selector web audio library

The unreasonable effectiveness of an LLM agent loop with tool use

Beyond Text: On-Demand UI Generation for Better Conversational Experiences

BuyMeACoffee silently dropped support for many countries (2024)

Wasmer (YC S19) Is Hiring a Rust Compiler Engineer

Initialization in C++ is bonkers (2017)

Comma 3X: Initial Impressions

Lock-Free Rust: How to Build a Rollercoaster While It's on Fire

Launch HN: Tinfoil (YC X25): Verifiable Privacy for Cloud AI

Archisuits (2005–2006)

Apple Blocks Fortnite's Return to iOS App Store, Epic Claims

Leeks and Leaks – Daniel.haxx.se

Thermoelectric generator based on a robust carbon nanotube/BiSbTe foam

Zinc Microcapacitors Are the Best of Both Worlds

Show HN: A free AI risk assessment tool for LLM applications

Tek – A music making program for 24-bit Unicode terminals

Windsurf SWE-1: Our First Frontier Models

Coinbase 8K SEC filing for breach

GTK Krell Monitors

A Tiny Boltzmann Machine

NASA keeps ancient Voyager 1 spacecraft alive with Hail Mary thruster fix

Windsurf SWE-1: Our First Frontier Models

Comments