Why eval startups fail (2025)

24•jxmorris12•1d ago

Comments

theteapot•1h ago

What's an eval?

choult•1h ago

Evaluations of different implementations of a tech. Kind of like a meta service layer on top of an industry, such as "Which frontier model is best?"

I do agree that the author does not do a good job of introducing the term.

wseqyrku•1h ago

"Which frontier model is best?"

What kind of stupid business is this. Though nothing can beat SEO in that spirit.

thomasliao•1h ago

It's an important question! If you are paying a lot of money to use AI models, you care that you are using the best for your task. And it turns out that figuring out which AI models is best for your task is not trivial and requires some expertise.

wseqyrku•57m ago

That was too nice of a reply, I apologize. I just can't understand the thought process and that what exactly are we optimizing for? If you are paying a lot of money to use AI models, you already have so much overhead that precise ranking in an eval is not gonna make much difference between equally "frontier" models. Especially since models are sensitive to the input. So the eval is just gonna evaluate the eval with very high accuracy. It might be equivalent to the illusion of safety thing applied to financial risk.

moomin•42m ago

It's not just for choice of model, you can use it for your prompting as well (basically anything to do with your setup). And yes, running evals is expensive and mostly of use to people with serious spend.

thomasliao•37m ago

>equally "frontier" models

A key point I want to make is that the notion of "frontier" is somewhat fictive in the sense that a model which dominates all others on a given eval is not guaranteed to be number one on another eval, even if both evals are ostensibly for the same task.

For example, the best publicly-available model (i.e. excluding Claude Mythos and Fable) on DeepSWE[0] is gpt-5.5-xhigh at 67%, which is soundly better than claude-opus-4.8-max at 59%. I would say an 8pp gap on a benchmark is quite large. But on FrontierCode[1], claude-opus-4.8-xhigh is the best, at a score of 13.4% compared to gpt-5.5-medium at 6.3%.

That's quite a significant reversal!

Now, one might wish to claim that either DeepSWE or FrontierCode is poorly constructed and that the other is more accurate. But I think you'll find that the degree to which eval-design considerations in this case affect measurement is probably of no less magnitude than user-specific considerations affect measurement in general.

[0] https://deepswe.datacurve.ai/ [1] https://cognition.com/blog/frontier-code

thomasliao•1h ago

(Author) It's short for "evaluation", a test for an AI model. Specifically, an AI evaluation comprises (1) a dataset of prompts (as questions / tasks / queries), (2) some way to score model performance on each prompt, like a set of correct answers or a grading rubric that you can use with an LLM autograder, and (3) a metric, such as accuracy¹. (If you're already familiar with the term "benchmark", it's the same thing; for some reason the former has become the term of art in the past few years).

For example, a simple eval is a dataset of multiple-choice questions, which each have one correct answer, and scored by accuracy. An example of this kind of eval is the Massive Multitask Language Understanding benchmark (2020) (https://arxiv.org/abs/2009.03300).

A more complex eval is FrontierCode (2026). Questions in FrontierCode represent coding tasks needed for real-world repos and are evaluated against rubrics scoring for correctness, code quality, cleanliness, and other factors. https://cognition.com/blog/frontier-code.

¹Note that this is a slightly different definition we used in [0], which used a definition of a fixed input-output correspondence pairs combined with a metric. What's different from 2021 is: models are now given more open-ended inputs (prompts like "find the bug" and a codebase rather than a simple question), have freeform generation (rather than choosing a fixed answer), and are graded in a more complex manner (e.g. beyond correctness, one might care for a coding eval also to grade adherence to coding guidelines, test coverage, etc).

[0] Liao, T., Taori, R., Raji, I. D., & Schmidt, L. (2021, January). Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://thomasliao.com/are_we_learning_yet.pdf

jorisw•24m ago

Would've started the article out alluding to this, or added a tooltip or something to this effect

rockyj•7m ago

IMHO - In an AI context an "eval" is answering the question - "Is this AI / LLM call helping me or is doing the right thing?"

AI is not deterministic like regular code, so imagine you use it for "search" (RAG) or for summarizing or for classifying emails etc. How do you know it is giving you the right results? In this context, AI evals are an important idea and very often neglected.

You can use an initial "dataset" to evaluate your prompt and AI calls + code (think test cases), this dataset will of-course be curated by humans. But as the software is used, you want to incorporate, real production data as well and run the evaluation pre and post launch. Sounds simple, but can get complicated specially since this area is new and as the post mentioned there are too many players and options out there (since everyone thought this is a money maker).

bitlad•1h ago

Everything eventually fails. Nothing is constant, not even evals.

Etheryte•1h ago

Except regex, no matter how technologically advanced your company, somewhere someone is slapping regex on something that has no business being regexed.

Asmod4n•1h ago

And llms seeing this keep on repeating that mistake, like trying to parse html with regexp.

bryanrasmussen•1h ago

You're in a business, and you think, to improve things I'm going to slap a regex on this. Now you're in two businesses.

GL26•1h ago

The problem with eval is the fact that the information is not updating itself fast enough so that you want the latest model performance benchmarks. Bloomberg succeeded because it sells info that is expires in the next hour.

jdw64•1h ago

If you look at the history of software engineering, the ones that made the most money were usually not the companies that built the applications themselves, but the ones that built the tools to verify, deploy, and build them, such as CI/CD, static analysis tools, and testing frameworks.

Personally, I agree with the Goodhart problem, but isn't the reason Eval startups fail because they try to sell an 'evaluation service' rather than a 'verification toolchain'? The problem, it seems, is that AI verification toolchains require a model in the end, because they internalize AI and sell it under the name of a 'harness.'

So an AI verification(eval) toolchain would have to be structurally different. Verifying AI code isn't about whether it compiles. AI code can always be made to compile. The issue involves various semantic criticisms, such as overfitting to existing designs and tests. To catch those issues, you ultimately need to build an AI. But building that AI is expensive. So in the end, AI verification companies depend on external model providers for the core components of their verification engine. I think this is a bad business decision

whinvik•1h ago

> made the most money

> built the tools to verify, deploy, and build them, such as CI/CD, static analysis tools, and testing frameworks.

Curious. Which company made money with testing frameworks?

jdw64•1h ago

I thought about mentioning Atlassian (Jira) and JetBrains, but come to think of it, they aren't really testing frameworks. They cover the entire development workflow overall. I guess I was thinking too short.

noelwelsh•22m ago

The "shovels for gold miners" analogy is generally a good one. It applies to Nvidia, for example. It doesn't generally apply to developers though. Developer tooling is notoriously difficult to monetize. Developers themselves are a shovel.

wseqyrku•1h ago

> Not enough eval customers

Aha.

torginus•31m ago

Imo it's very simple - AI is a big function inverter. If you have a better cost function than frontier labs, as in, you are better at judging model output quality, then you can use that cost function to RL the next generation of models.

Therefore your knowledge is better used in training than letting users be slightly better at the token casino. Which is mentioned in this post as well, eval startup people either go to work at frontier labs or finetune startups.

jampekka•21m ago

I think there's gonna be (or perhaps already is) a huge demand for evaling individual systems. Many countries are starting to adopt some criteria for LLM usage for public use, and I doubt govs are gonna develop in-house knowhow for this. These will likely form some kinds of "independent auditor" models, as the system provider has too strong conflicts of intetest.

It's probably not gonna be exactly glorious work, but designing expert evals settings and collecting and crunching the data for quality assurance and control is going to be needed.

coldtea•20m ago

Because they operate on untrusted input

We're making Bunny DNS free: because a faster internet won't build itself

Vulnerability reports are not special anymore

Raspberry Pi Pico W as USB Wi-Fi Adapter

Jerry's Map

In memory of the man who put red and green squiggles under words

FUTO Swipe – A new swipe typing model

Show HN: An ASCII 3D Rendering Engine

Statistics that live in your SQL

Qwen-AgentWorld: Language World Models for General Agents

Why eval startups fail (2025)

"Fix" MacBook Neo Cursor Lag: Record 1 Pixel of the Screen Every 10 Seconds

European Commission's Metsola Overrides MEPs to Force Through Chat Control

Ashby (YC W19) Is Hiring EMEA Engineers Who Can Design

Grok Build 0.1: Intelligence, Performance and Price Analysis

Printing Gaussian Splats

Remaking BBC test cards to teach you video processing

Rhombus Language 1.0

Swift Package Index joins Apple

Vector Graphics in Lil

A man was gifted his dream car by Kevin Mitnick, who he helped put in prison

Usbliter8: an A12/A13 SecureROM Exploit

Show HN: TikZ Editor – WYSIWYG editor for figures in LaTeX

The worthlessness of Vitamin D is mildly exaggerated

Dirty Little Zine – a tool for making an 8 page printable Zine

How to burst the AI bubble: Strike at its roots

Millimeter wave technology drills 100 meters into granite

Meta Pauses Employee-Tracking Program Following Internal Data Leak

The Teensy Executable Revisited

F* file system – file search that reads SSD directly bypassing OS kernel

Lithp.py (~2008)

We're making Bunny DNS free: because a faster internet won't build itself

Vulnerability reports are not special anymore

Raspberry Pi Pico W as USB Wi-Fi Adapter

Jerry's Map

In memory of the man who put red and green squiggles under words

FUTO Swipe – A new swipe typing model

Show HN: An ASCII 3D Rendering Engine

Statistics that live in your SQL

Qwen-AgentWorld: Language World Models for General Agents

Why eval startups fail (2025)

"Fix" MacBook Neo Cursor Lag: Record 1 Pixel of the Screen Every 10 Seconds

European Commission's Metsola Overrides MEPs to Force Through Chat Control

Ashby (YC W19) Is Hiring EMEA Engineers Who Can Design

Grok Build 0.1: Intelligence, Performance and Price Analysis

Printing Gaussian Splats

Remaking BBC test cards to teach you video processing

Rhombus Language 1.0

Swift Package Index joins Apple

Vector Graphics in Lil

A man was gifted his dream car by Kevin Mitnick, who he helped put in prison

Usbliter8: an A12/A13 SecureROM Exploit

Show HN: TikZ Editor – WYSIWYG editor for figures in LaTeX

The worthlessness of Vitamin D is mildly exaggerated

Dirty Little Zine – a tool for making an 8 page printable Zine

How to burst the AI bubble: Strike at its roots

Millimeter wave technology drills 100 meters into granite

Meta Pauses Employee-Tracking Program Following Internal Data Leak

The Teensy Executable Revisited

F* file system – file search that reads SSD directly bypassing OS kernel

Lithp.py (~2008)

Why eval startups fail (2025)

Comments