DeepSeek V4 Pro beats GPT-5.5 Pro on precision

https://runtimewire.com/article/deepseek-v4-pro-beats-gpt-5-5-pro-on-precision

78•yogthos•1h ago

Comments

embedding-shape•33m ago

... according to grok-4-1-fast-non-reasoning who was the judge, on 4 tasks in total, score was 38 to 33 so obviously huge conclusions can be made.

> We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had grok-4-1-fast-non-reasoning score each one. DeepSeek: DeepSeek V4 Pro scored 38.0 to OpenAI: GPT-5.5 Pro's 33.0.

andai•27m ago

grok-4-1-fast was retired about a month ago.

Requests to grok-4-1-fast-non-reasoning now silently route to grok-4.3 (a 5x more expensive model), with reasoning set to "none".

https://docs.x.ai/developers/migration/may-15-retirement

TFA was published today, which implies grok-4.3 was used.

largbae•27m ago

Pretty small sample size here, but it's hard to avoid the conclusion that DeepSeek and friends will start to put some serious downward pressure on frontier lab token pricing.

Hopefully this dynamic continues long enough to make local/private inference the leading solution for coding.

ekidd•1m ago

The OP uses tons of typical AI turns of phrase, and Pangram classified it as AI with high confidence.

So it doesn't surprise me at all that the methodology is weak, too.

ElenaDaibunny•27m ago

Yep, matches my experience. gpt keeps adding fields and changing types on structured output when you need it to just follow the spec~

SwellJoe•11m ago

I tried adding GPT 5.5 Pro to a vulnerability scanning benchmark I made (https://swelljoe.com/post/will-it-mythos/), and it blew through the $100 budget limit halfway through. DeepSeek V4 Pro cost about a dollar for the whole benchmark. GPT Pro cost an average of $22 per case (a case could be 1-5 files with a recent known vulnerability, usually just a single file and a prompt along the lines of "does this file have any vulnerabilities").

GPT 5.5 Pro found two out of four cases that it got to before blowing its budget. Maybe it would have been the best of the bunch with infinite budget, but Opus 4.8, DeepSeek V4 Pro, and MiMo 2.5 Pro found four of nine of the bugs. Opus was an order of magnitude cheaper than GPT 5.5 Pro (and something like 30% cheaper than GPT 5.5), DeepSeek and MiMo were two orders of magnitude cheaper at roughly a dime per case.

GPT Pro also chews a lot and a long time, relatively speaking.

I can't come up with a use case where I can rationally spend ~31 times what Opus costs to use GPT 5.5 Pro, and I won't be doing any more benchmarking with it.

Given how much token costs are becoming an issue people talk about, the fact that there are models that cost dramatically less than the big American providers is going to be an issue for Anthropic and OpenAI. I'm happy to pay a premium (within reason) for the best model for interactive coding, but for API use, where having the model repeat it itself, compare against other models, have models judge other models work, etc. is not time-consuming for a human and is just a matter of implementing the harnesses and framework for proving correctness, I can't come up with a reason to spend ten or two hundred times as much as DeepSeek.

zaptrem•7m ago

Can you include GPT 5.5 non-pro (extra high thinking I guess) in your comparison? GPT Pro is the "I am willing to torch cash for a sooometimes slighty better result" option, not the one people are actually expected to use daily. That's probably part of the reason it's not in Codex

bel8•3m ago

90210 – running the show without property tax

I built a domain registrar that shows renewal prices before you register

Are Memories Transferable – Or Edible?

Dopamine Fracking

New Medicaid work rule worries patient advocates, states

Show HN: Authmeta.dev – the OAuth inspector you wish you had

Letter complaining about delay in postal delivery in Victorian London-8 May 1881

When Trump Jawbones the Market, Bet Against Him at Your Peril

Show HN: TeardownHQ, teardowns/playbooks of how indie startups grew

Barcelona's Sagrada Família Nears Completion–and Inflames a Tourism Backlash

Jeff Bezos Is Funding a Wild Hunt for the Brain's 'Core Algorithm'

Cremona Art Week

Israel says it has struck Iran after taking missile fire

Sunset of the Consumer Version of Gemini Code Assist on GitHub

The coming rise of anti-AI populism

A New Ad Campaign Tries to Make A.I. A Little Less Scary

Painting the Internet: A Different Kind of Warhol Worm [pdf]

Texas grid flags risks as data centers, crypto sites fail voltage tests

April in Servo: new Android UI, focus, forms, security fixes, and more

The source of economic shocks matters for their political outcomes

Tech sell-off widens as South Korea index plunges

Yoti denies reporting GrapheneOS user, says screenshots may be fake

Earthquake of magnitude 7.8 strikes off southern Philippines

Algorithmic Monocultures in Hiring

NPM-Scan: Detecting Six Major NPM Supply Chain Campaigns (June 2026)

Show HN: ARouter – drop-in OpenAI/Anthropic proxy that cuts cost and fails over

What it costs to run a one-Rails-app SaaS per month

President says Netanyahu will have 'no choice' but to accept a deal with Iran

Force-sensing mobile microrobotic grippers for gentle and precise bioassembly

New drug 'functionally cures' many hepatitis B virus infections