Chinese AI models are ~8 months behind and falling further behind

https://twitter.com/scaling01/status/2050395242663223751

2•enraged_camel•1h ago

Comments

giardini•1h ago

No problem: they're always at most just one theft away from you!8-)

jqpabc123•1h ago

Chinese models are cheaper and likely to remain so due to lower energy costs.

tokkkie•1h ago

chinese models feel strong in japan — kanji. but outside language? maybe ... max sonnet 4.5 level.

do benchmarks reflect that gap in english region?

allears•54m ago

Not everybody needs cutting-edge performance. Cost per token is turning out to be more important.

ilia-a•26m ago

That doesn't seem right and seems to miss GLM 5.1 and Kimi 2.6. Not to mention there is the whole argument of cost/value that Chinese OSS models have vs GPT/Claude.

ollin•15m ago

The source here is "CAISI Evaluation of DeepSeek V4 Pro" [1]; the US NIST ran their own benchmarks (including several internal ones) and reported the following table:

    | Domain               | Benchmark              | Model (reasoning level) |                             |                          |                       |
    |--:-------------------|------------------------|-------------------------|-----------------------------|--------------------------|-----------------------|
    |                      |                        | OpenAI GPT-5.5 (xhigh)  | OpenAI GPT-5.4 mini (xhigh) | Anthropic Opus 4.6 (max) | DeepSeek V4 Pro (max) |
    | Cyber                | CTF-Archive-Diamond    | **71%**                 | 32%                         | 46%                      | 32%                   |
    | Software Engineering | SWE-Bench Verified*    | **81%**                 | 73%                         | 79%                      | 74%                   |
    |                      | PortBench              | **78%**                 | 41%                         | 60%                      | 44%                   |
    | Natural Sciences     | FrontierScience        | **79%**                 | 74%                         | 72%                      | 74%                   |
    |                      | GPQA-Diamond           | **96%**                 | 87%                         | 91%                      | 90%                   |
    | Abstract Reasoning   | ARC-AGI-2 semi-private | **79%**                 | –                           | 63%                      | 46%                   |
    | Mathematics          | OTIS-AIME-2025         | **100%**                | 90%                         | 92%                      | 97%                   |
    |                      | PUMaC 2024             | **96%**                 | 93%                         | 95%                      | **96%**               |
    |                      | SMT 2025               | **99%**                 | 92%                         | 94%                      | 96%                   |
    | IRT-Estimated Elo    | **IRT-Estimated Elo**  | **1260 ± 28**           | 749 ± 46                    | 999 ± 27                 | 800 ± 28              |

Notably, two of the benchmarks with the biggest capability gap are CAISI-internal/private ones (CTF-Archive-Diamond, PortBench). I read this as "DeepSeek is well-tuned for public benchmarks, and less generally intelligent than GPT5.5 on held-out tasks" but a less-charitable reading would be "US government reports US models do best on benchmarks that only the US government can run". Agent benchmarking is fraught with peril [2] and an impartial benchmarker (who disproportionately overlooks bugs/issues in their evaluation of certain models) can absolutely tilt the scales, so I would not be surprised if a PRC-led benchmarking of frontier models came to the opposite conclusion.

[1] https://www.nist.gov/news-events/news/2026/05/caisi-evaluati...

[2] https://epoch.ai/gradient-updates/why-benchmarking-is-hard

18th-century mechanical volcano roars to life 250 years later

WeSearch

Making 10 apps in 30 Days

Iceland's Pools and Hot Tubs Now UNESCO-Recognized. Some Locals Aren't Thrilled.

Show HN: Predicting the 2026 Kentucky Derby with 1T Monte Carlo Sims on Burla

AI talks draw backlash from Mass. state lawmakers

Life update: Zig, AI, unemployment, and more [video]

How Oregon's Data Center Boom Is Supercharging a Water Crisis

Palantir Comes to Campus

Shitpostmodernism: Understanding the Slopgeneration

AI Agents Are the Mass-Produced Cars of Software

Opioid maker Purdue Pharma shuts down as part of $7.4B deal

Disneyland Now Uses Face Recognition on Visitors

Digital Ecosystems: Interactive Multi-Agent Neural Cellular Automata

How are Life-Size Figures Created at hololive production?

Vibecoded my dream game, GeoGuesser for guns, now its helping with student bills

The Railway and the Balloon

Floating Armoury

Customizing Claude Code spinner verbs

Back end-for-Front end: The most secure architecture for browser-based apps

Voyager and the Art of Graceful Degradation

Did I photograph the Aurora or was it something else? (2016)

Upcoming Blender Development Fund and AI Policies

The Annoying Usefulness of Emacs [video]

The Sky Tonight

New US phone network for Christians to block porn and gender-related content

Making Your Writing Work Harder for You

Show HN: TradingAgents without the API bill – run multi agents in Claude Code

Stop Supplying. Start Owning

Uber wants to turn its drivers into a sensor grid for AV companies