frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Arena AI Model ELO History

https://mayerwin.github.io/AI-Arena-History/
28•mayerwin•2h ago
Hi HN,

I built a live tracker to visualize the lifecycle and performance changes of flagship AI models.

We've all experienced the phenomenon where a flagship model feels amazing at launch, but weeks later, it suddenly feels a bit off. I wanted to see if this was just a feeling or a measurable reality, so I built a dashboard to track historical ELO ratings from Arena AI.

Instead of a massive spaghetti chart of every single model variant, the logic plots exactly ONE continuous curve per major AI lab. It dynamically tracks their highest-rated flagship model over time, which makes both the sudden generational jumps and the slow performance decays much easier to see. It took quite a lot of iterations to get the chart to look nice on mobile as well. Optional dark mode included.

However, I have a specific data blindspot that I'm hoping this community might have insights on.

Arena AI largely relies on testing API endpoints. But as we know, consumer chat UIs often layer on heavy system prompts, safety wrappers, or silently switch to heavily quantized models under high load to save compute. API benchmarks don't fully capture this "nerfing" that everyday web users experience.

Does anyone know of any historical ELO or evaluation datasets that specifically scrape or test outputs from the consumer web UIs rather than raw APIs?

I'd love to integrate that data for a more accurate picture of the consumer experience. The project is open-source (repo link in the footer), so I'd appreciate any feedback, or pointers to datasets!

Comments

underyx•1h ago
> the slow performance decays

the decays are just more capable other models entering the population, making all prior models lose more frequently

eis•39m ago
The ELO rating system measures relative performance to the other models. As the other models improve or rather newer better models enter the list, the ELO score of a given existing model will tend to decrease even though there might be no changes whatsoever to the model or its system prompt.

You can't use ELO scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests.

tedsanders•33m ago
FYI, Elo isn't an acronym - it's a person's name. No need to capitalize it as ELO.
SilverElfin•32m ago
You’re right: https://en.wikipedia.org/wiki/Elo_rating_system
tedsanders•31m ago
For what it's worth, I work at OpenAI and I can guarantee you that we don't switch to heavily quantized models or otherwise nerf them when we're under high load. It's true that the product experience can change over time - we're frequently tweaking ChatGPT & Codex with the intention of making them better - but we don't pull any nefarious time-of-day shenanigans or similar. You should get what you pay for.
selcuka•24m ago
> we don't switch to heavily quantized models

That sounded like a press bulletin, so just to let you clarify yourself: Does that mean you may switch to lightly quantized models?

jychang•14m ago
There's almost 0% chance that OpenAI doesn't quantize the model right off the bat.

I am willing to bet large amounts of money that OpenAI would never release a model served as fully BF16 in the year of our lord 2026. That would be insane operationally. They're almost certainly doing QAT to FP4 for FFN, and a similar or slightly larger quant for attention tensors.

selcuka•9m ago
It's ok if they never release a BF16 model, but it's less ok if they release it, win the benchmarks, then quantise it after a few weeks.
Ciph•19m ago
Thank you for your answer. I have a similar question as OP, but in regards of the GPT models in MS copilot. My experience is that the response quality is much better when calling the API directly or through the webUI.

I know this might be a question that's impossible for you to answer, but if you can shed any light to this matter, I'd be grateful as I am doing an analysis over what AI solutions that can be suitable for my organisation.

refulgentis•13m ago
Is this slop? It has wildly aggressive language that agrees with a subset of pop sentiment, re: models being “nerfed”. It promises to reveal this nerfing. Then, it goes on to…provide an innocuous mapping of LM Arena scores that always go up?

Claude for Small Business

https://www.anthropic.com/news/claude-for-small-business
121•neilfrndes•2h ago•68 comments

Scorched Earth 2000 – Web

http://www.scorch2000.com/web/
200•meshko•5h ago•78 comments

Linux gaming is faster because Windows APIs are becoming Linux kernel features

https://www.xda-developers.com/linux-gaming-is-getting-faster-because-windows-apis-are-becoming-l...
639•haunter•3d ago•423 comments

Cisco workforce reductions

https://blogs.cisco.com/news/our-path-forward
139•ahmedomran8•4h ago•102 comments

Arena AI Model ELO History

https://mayerwin.github.io/AI-Arena-History/
28•mayerwin•2h ago•10 comments

Setting up a free *.city.state.us locality domain (2025)

https://fredchan.org/blog/locality-domains-guide/
544•speckx•15h ago•170 comments

A History of IDEs at Google

https://laurent.le-brun.eu/blog/a-history-of-ides-at-google
338•laurentlb•4d ago•228 comments

MacBook Neo Deep Dive: Benchmarks, Wafer Economics, and the 8GB Gamble

https://www.jdhodges.com/blog/macbook-neo-benchmarks-analysis/
169•tosh•11h ago•179 comments

The Emacsification of Software

https://sockpuppet.org/blog/2026/05/12/emacsification/
260•rdslw•23h ago•169 comments

Avoiding and reducing microplastic false positives from dry glove contact

https://pubs.rsc.org/en/content/articlelanding/2026/ay/d5ay01801c
25•efavdb•5h ago•0 comments

Chess puzzle I found in my dad's old book

https://ardoedo.it/kempelen/
137•Eswo•2d ago•39 comments

Show HN: Nibble

https://github.com/glouw/nibble
30•glouwbug•4h ago•2 comments

Twin brothers wipe 96 government databases minutes after being fired

https://arstechnica.com/tech-policy/2026/05/drop-database-what-not-to-do-after-losing-an-it-job/
379•jnord•1d ago•275 comments

delta time

https://www.deltatime.life/
27•mxfh•5h ago•9 comments

Microsoft BitLocker – YellowKey zero-day exploit

https://www.tomshardware.com/tech-industry/cyber-security/microsoft-bitlocker-protected-drives-ca...
101•cookiengineer•3h ago•52 comments

Princeton mandates proctoring for in-person exams, upending 133 year precedent

https://www.dailyprincetonian.com/article/2026/05/princeton-news-adpol-proctoring-in-person-exami...
301•bookofjoe•10h ago•436 comments

The US is winning the AI race where it matters most: commercialization

https://avkcode.github.io/blog/us-winning-ai-race.html
187•akrylov•16h ago•514 comments

Xs of Y – roguelike that names itself every run. Written in 4kLoC

https://github.com/nooga/xsofy
174•andsoitis•4d ago•75 comments

Golden Testing a CAD Library

https://doscienceto.it/blog/posts/2026-04-27-golden-testing-cad.html
10•PaulHoule•2d ago•0 comments

Launch HN: Ardent (YC P26) – Postgres sandboxes in seconds with zero migration

https://www.tryardent.com/
81•vc289•13h ago•33 comments

Extraordinary Ordinals

https://text.marvinborner.de/2026-04-09-17.html
5•marvinborner•2d ago•0 comments

How can Apple deal with the memory shortage?

https://asymco.com/2026/05/11/the-great-memory-panic-of-2026/
81•tambourine_man•2d ago•79 comments

Heritability of human life span is ~50% when heritability is redefined

https://dynomight.net/lifespan/
85•surprisetalk•1d ago•51 comments

'A Four-Eyed World' Review: The Story of Spectacles

https://www.wsj.com/arts-culture/books/a-four-eyed-world-review-the-story-of-spectacles-504334ac
5•Hooke•4d ago•0 comments

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

https://github.com/cactus-compute/needle
662•HenryNdubuaku•1d ago•187 comments

Reverting the incremental GC in Python 3.14 and 3.15

https://discuss.python.org/t/reverting-the-incremental-gc-in-python-3-14-and-3-15/107014
216•curiousgal•4d ago•85 comments

The other half of AI safety

https://personalaisafety.com/p/the-other-half-of-ai-safety
64•sofiaqt•5h ago•81 comments

S-100 Virtual Workbench

https://grantmestrength.github.io/S100/
114•rbanffy•14h ago•25 comments

Marco Polo: Finding a friend with only distance and motion

https://www.jackhogan.me/blog/marco-polo
54•jackhogan11•2d ago•7 comments

Mystery Microsoft bug leaker keeps the zero-days coming

https://www.theregister.com/security/2026/05/13/disgruntled-researcher-releases-two-more-microsof...
104•e12e•5h ago•34 comments