frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

CursorBench 3.1

https://cursor.com/evals
33•handfuloflight•2h ago

Comments

o10449366•1h ago
I feel like this benchmark reiterates my disbelief that anyone uses the latest Anthropic models for any productive work. They seem to be the best at burning tokens and spawning unnecessary subagents even for well-defined and tightly scoped tasks.

Can we get a count of people that have had Claude read irrelevant documents or perform unnecessary web searches even when told not to from the beginning?

I'm starting to wonder if this increased token usage is inadvertently bleeding into how Anthropic actually trains their model, especially leading up to IPO. As older models are deprecated and users are forced onto newer models, if the default is less efficient and more token expensive that directly results in higher "profit" for Anthropic in terms of the consumption their users have to tolerate - lest they jump to a competitor.

pbowyer•58m ago
> I feel like this benchmark reiterates my disbelief that anyone uses the latest Anthropic models for any productive work. They seem to be the best at burning tokens and spawning unnecessary subagents even for well-defined and tightly scoped tasks.

I keep Claude around for some specific tasks:

- Linked up to Figma MCP to implement front-end stuff

- Data analysis, in the "Connect AI to a data source and ask questions" way. I've tried both Opus 4.8 high and GPT 5.5 high for this and Opus is stronger because it gets the intent in the question better

I used to keep it around for planning too, but the 4.8 plans have had more holes than swiss cheese.

anilgulecha•1h ago
is composer 2.5 that good at that pricepoint? Seems like the gemini flash playbook of trying to get most bang for the buck.
aabdi•1h ago
yes, its very good.
danfritz•1h ago
It's my daily driver, it's fast affordable and with a bit of guidance gets the job done.

I only reach for Claud when i need to plan something big or want to have a sparring partner to fire of some ideas.

I think what a lot of people don't realize is that you don't need a fronteer model for 80% of coding tasks. Composer 2.5 is often more than good enough, less token hungry and way faster

shockembopper•39m ago
I have been doing the same for quite a while now. Composer 2.5 is incredible when you’re working in the loop.
uf00lme•1h ago
It's surprising usable and cheap enough to run in 'fast' mode when vibing something quick. For simple code I find I prefer the code it writes over GLM or Gemini family.
fumar•1h ago
It’s fast and affordable.
tekacs•1h ago
I'm pretty baffled by their choice of axes. I would have thought that the left was the cheapest, not the most expensive. I appreciate that this layout means that top right can be best, but it's still unintuitive to have this backwards cost axis IMO.

Putting that aside, I spend all day every day implementing very, very hard things right on the edge of what agents are (barely, sometimes) capable of, and I have had to keep Opus on max for things that need 'real validation' for a while now. And that has felt like 'the only way' to get Opus to perform even close to 5.5 xhigh. I'm only using Opus at all because GPT-5.5 in the subscriptions only has a small (400k, but 258k effective) context window.

The difference is that 5.5 xhigh is extremely fast in most practical cases, both efficiently implementing _overall_, and responding very quickly with great adaptive thinking if you ask it something that it doesn't have to think about. Opus 4.8 Max will needlessly chew on everything and can take hours to implement even simple things, so I can mostly only use it for planning/review.

Fable is much much better at adaptive thinking / responding quickly (although probably still worse than 5.5 xhigh), and... I think folks have said enough elsewhere about its strengths and weaknesses. Sadly still not a reliable implementor for my hard tasks though (that's still GPT's domain) – it tends to leave big, dangerous holes hiding inside implementations unless babied.

pbowyer•1h ago
> I'm only using Opus at all because GPT-5.5 in the subscriptions only has a small (400k, but 258k effective) context window.

Do you find that makes a difference in your work? I've been using 5.5 high/xhigh to optimize and benchmark a C codebase, and just reading the initial code virtually fills the first context window. A session will auto-compact 5-15 times, but it seems to do okay in spite of that because the task is mainly focused on the latest window each time.

I think for programming the strength of GPT over Opus is winning here over the context window.

mdasen•54m ago
I'm a bit skeptical.

Cursor's benchmark finds that Cursor's model (Composer 2.5) is basically as good as Opus 4.8 max and GPT-5.5 xhigh, but at a fraction of the price.

Artificial Analysis' testing shows Composer 2.5 to be pretty far behind: https://artificialanalysis.ai/agents/coding-agents. You look at the DeepSWE benchmark (which is probably the hardest to game at this point) and GPT-5.5 xhigh gets a 64, Opus 4.8 max gets 56, and Cursor 2.5 gets 16.

I don't doubt that Cursor works well for some people. It's beating DeepSeek v4 Pro in the DeepSWE benchmark and that's a very capable model. But I'm skeptical of the claims that it's a competitor for Opus 4.8 and GPT-5.5. It just seems convenient that their model does so well on their own benchmark while third party benchmarks have it far behind. Maybe it's a really great benchmark and a better measure than third party ones - I'd love for a cheap model to do as well as the expensive ones.

famouswaffles•48m ago
Cursor sessions are pretty much what composer models are RL'd on. This bench and the training data are/should be basically the same distribution.
datadrivenangel•13m ago
For lighter interactive agentic coding, where you type stuff into an IDE and a minute or three later get results back for review, composer 2.5 is honestly pretty great. The results get notably worse for larger tasks though.
BugsJustFindMe•29m ago
I've used both Composer 2.5 and GPT 5.5 (both in Cursor and in Codex) extensively, and the idea that Composer 2.5 is anywhere close in performance to GPT 5.5 is absolutely farcical. It's faster, but it's nowhere near as good.
__natty__•8m ago
It's hard to believe Composer 2.5 is that good. I tried to compare it with GLM 5.2 or Opus 4.6 and it lacked thinking about the problem and critical reasoning. It's great for executing plans made by other models, but even then it does some weird code manipulation that is far from how other files around actually work.

Kimi K2.7 Code is generally available in GitHub Copilot

https://github.blog/changelog/2026-07-01-kimi-k2-7-is-now-available-in-github-copilot/
81•unliftedq•3h ago•20 comments

ZCode – Harness for GLM-5.2

https://zcode.z.ai/en
363•chvid•9h ago•283 comments

Oomwoo, an open-source robot vacuum you build yourself

https://makerspet.com/blog/building-an-open-source-robot-vacuum-meet-oomwoo/
234•devicelimit•6h ago•41 comments

A new Android malware from Google

https://f-droid.org/2026/07/01/adv-malware.html
182•drewfax•4h ago•68 comments

Bring back crappy forums

https://tedium.co/2026/07/01/online-web-forums-retrospective/
204•pentagrama•5h ago•120 comments

CursorBench 3.1

https://cursor.com/evals
33•handfuloflight•2h ago•15 comments

What to learn to be a graphics programmer

https://blog.demofox.org/2026/07/01/what-to-learn-to-be-a-graphics-programmer/
318•atan2•13h ago•164 comments

Opening up 'Zero-Knowledge Proof' technology to promote privacy in age assurance

https://blog.google/innovation-and-ai/technology/safety-security/opening-up-zero-knowledge-proof-...
139•consumer451•9h ago•124 comments

FFmpeg 9.1's new AAC encoder

https://hydrogenaudio.org/index.php/topic,129691.0.html
359•ledoge•17h ago•109 comments

Ask HN: Who is hiring? (July 2026)

189•whoishiring•16h ago•198 comments

How do wombats poop cubes?

https://www.science.org/content/article/how-do-wombats-poop-cubes-scientists-get-bottom-mystery
103•bushwart•1d ago•45 comments

The Wisdom of Quinn the Eskimo (Apple Developer Technical Support Engineer)

https://github.com/macshome/The-Wisdom-of-Quinn
12•gregsadetsky•2d ago•6 comments

The Underhanded C Contest

https://underhanded-c.org/
86•ccabraldev•9h ago•10 comments

Qualcomm Linux 2.0

https://www.qualcomm.com/developer/blog/2026/06/qualcomm-linux-2-now-available
96•gilgamesh3•10h ago•36 comments

Weave Robotics launches Isaac 1, a $7,999 home robot with Fall 2026 deliveries

https://www.weaverobotics.com/isaac-1
154•ryanmerket•13h ago•215 comments

Show HN: Searchable directory of 22k+ products from worker-owned co-ops

https://www.workerowned.info/
331•IESAI_ski•10h ago•65 comments

Learn Vim motions with an ice-cream van

https://thisismodest.com/vimscoops/
51•marcusmichaels•13h ago•10 comments

For first time, a cell built from scratch grows and divides

https://www.quantamagazine.org/for-the-first-time-a-cell-built-from-scratch-grows-and-divides-202...
837•defrost•17h ago•274 comments

Asymmetric Quantization: Near-Lossless Retrieval with 97% Storage Reduction

https://www.mixedbread.com/blog/asymmetric-quant
3•breadislove•2d ago•0 comments

Monetization Gateway: Charge for any resource behind Cloudflare via x402

https://blog.cloudflare.com/monetization-gateway/
283•soheilpro•17h ago•196 comments

Why jet engines aren't made in China

https://aakash.substack.com/p/why-jet-engines-arent-made-in-china
127•paulpauper•1d ago•105 comments

Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

https://senior-swe-bench.snorkel.ai/
65•matt_d•4h ago•57 comments

Ask HN: Who wants to be hired? (July 2026)

124•whoishiring•16h ago•301 comments

The vibration of the pager has a sound all its own

https://www.notyouremergency.com/triage-intro
17•mooreds•3d ago•3 comments

Chip Off The Old Block

https://www.astralcodexten.com/p/chip-off-the-old-block
76•paulpauper•10h ago•7 comments

The Apple Disk II Controller Card (2021)

https://www.bigmessowires.com/2021/11/12/the-amazing-disk-ii-controller-card/
76•stmw•2d ago•20 comments

Proliferate (YC S25) Is Hiring

https://www.ycombinator.com/companies/proliferate/jobs/mMHvKR9-founding-product-engineer
1•pablo24602•10h ago

Launch HN: Parsewise (YC P25) – Reason Across Documents with an API

51•gergelycsegzi•17h ago•50 comments

How We Made IPFS Content Publishing 10x Faster

https://probelab.io/blog/optimistic-provide/
161•dennis-tra•16h ago•55 comments

Healthy but sedentary people show early decline in cellular energy production

https://news.cuanschutz.edu/news-stories/healthy-but-sedentary-individuals-show-early-decline-in-...
114•littlexsparkee•8h ago•73 comments