Show HN: Arenas suck, here's why we just added one to Windsurf

https://windsurf.com/blog/windsurf-wave-14

4•agtestdvn•1w ago

Benchmarks don't reflect real-world coding ability. So we made real-world coding the benchmark.

Comments

agtestdvn•1w ago

I work at Windsurf and would love to discuss product-agnostically any ideas/thoughts people have around how we as a community can evaluate models better. I feel like benchmarks like SWEbench are all saturated and gamed/trained on. I also feel like online arenas are mostly used by vibecoders. And our arena mode def isn't the final form factor either!

swyx•1w ago

(team member) my comparison matrix of why Product Arenas differ from Global Arenas here: https://x.com/swyx/status/2017342647963431363

the trick is to get it to be usable within context. what started out as a simple evals concept quickly became a lot of debating over how to properly present worktrees in an IDE. hope to hear your feedback.

Looking for 4 Autistic Co-Founders for AI Startup (Equity-Based)

AI-native capabilities, a new API Catalog, and updated plans and pricing

What changed in tech from 2010 to 2020?

From Human Ergonomics to Agent Ergonomics

Advanced Inertial Reference Sphere

Toyota Developing a Console-Grade, Open-Source Game Engine with Flutter and Dart

Typing for Love or Money: The Hidden Labor Behind Modern Literary Masterpieces

Show HN: A longitudinal health record built from fragmented medical data

CoreWeave's $30B Bet on GPU Market Infrastructure

Creating and Hosting a Static Website on Cloudflare for Free

"The Stanford scam proves America is becoming a nation of grifters"

Elon Musk on Space GPUs, AI, Optimus, and His Manufacturing Method

X (Twitter) is back with a new X API Pay-Per-Use model

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

Show HN: Deterministic signal triangulation using a fixed .72% variance constant

Scientists Discover Levitating Time Crystals You Can Hold, Defy Newton’s 3rd Law

When Michelangelo Met Titian

Solving NYT Pips with DLX

Baldur's Gate to be turned into TV series – without the game's developers

Interview with 'Just use a VPS' bro (OpenClaw version) [video]

EchoJEPA: Latent Predictive Foundation Model for Echocardiography

Disablling Go Telemetry

Effective Nihilism

The UK government didn't want you to see this report on ecosystem collapse

No 10 blocks report on impact of rainforest collapse on food prices

Seedance 2.0 Is Coming

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

Dexterous robotic hands: 2009 – 2014 – 2025

Interop 2025: A Year of Convergence

JobArena – Human Intuition vs. Artificial Intelligence