Benchmarks don't reflect real-world coding ability. So we made real-world coding the benchmark.
Comments
agtestdvn•1h ago
I work at Windsurf and would love to discuss product-agnostically any ideas/thoughts people have around how we as a community can evaluate models better. I feel like benchmarks like SWEbench are all saturated and gamed/trained on. I also feel like online arenas are mostly used by vibecoders. And our arena mode def isn't the final form factor either!
the trick is to get it to be usable within context. what started out as a simple evals concept quickly became a lot of debating over how to properly present worktrees in an IDE. hope to hear your feedback.
agtestdvn•1h ago