About AI Evals

114•TheIronYuppie•2d ago

Comments

afro88•7h ago

Some great info, but I have to disagree with this:

> Q: How much time should I spend on model selection?

> Many developers fixate on model selection as the primary way to improve their LLM applications. Start with error analysis to understand your failure modes before considering model switching. As Hamel noted in office hours, “I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence. Does error analysis suggest that your model is the problem?”

If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2.5, or Claude 3.7 to 4) that will level up your system pretty easily. Use the best models you can, if you can afford it.

lumost•6h ago

The vast majority of a I startups will fail for reasons other than model costs. If you crack your use case, model costs should fall exponentially.

softwaredoug•6h ago

I might disagree as these models are pretty inscrutable, and behavior on your specific task can be dramatically different on a new/“better” model. Teams would do well to have the right evals to make this decision rather than get surprised.

Also the “if you can afford it” can be fairly non trivial decision.

smcleod•6h ago

Yeah totally agree, I see so many systems perform badly only to find out they're using an older generation mode and simply updating to the current mode fixes many of their issues.

simonw•6h ago

I think the key part if that advice is the without evidence bit:

> I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence.

If you try to fix problems by switching from eg Gemini 2.5 Flash to OpenAI o3 but you don't have any evals in place how will you tell if the model switch actually helped?

phillipcarter•5h ago

> If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2.5, or Claude 3.7 to 4)

How do you know that their evals match behavior in your application? What if the older, "worse" model actually does some things better, but if you don't have comprehensive enough evals for your own domain, you simply don't know to check the things it's good at?

FWIW I agree that in general, you should start with the most powerful model you can afford, and use that to bootstrap your evals. But I do not think you can rely on generic benchmarks and evals as a proxy for your own domain. I've run into this several times where an ostensibly better model does no better than the previous generation.

shrumm•4h ago

The ‘with evidence’ part is key as simonw said. One anecdote from evals at Cleric - it’s rare to see a new model do better on our evals vs the current one. The reality is that you’ll optimize prompts etc for the current model.

Instead, if a new model only does marginally worse - that’s a strong signal that the new model is indeed better for our use case.

ndr•3h ago

Quality can drop drastically even moving from Model N to N+1 from the same provider, let alone a different one.

You'll have to adjust a bunch of prompts and measure. And if you didn't have a baseline to begin with good luck YOLOing your way out of it.

calebkaiser•6h ago

I'm biased in that I work on an open source project in this space, but I would strongly recommend starting with a free/open source platform for debugging/tracing, annotating, and building custom evals.

This niche of the field has come a very long way just over the last 12 months, and the tooling is so much better than it used to be. Trying to do this from scratch, beyond a "kinda sorta good enough for now" project, is a full-time engineering project in and of itself.

I'm a maintainer of Opik, but you have plenty of options in the space these days for whatever your particular needs are: https://github.com/comet-ml/opik

mbanerjeepalmer•3h ago

Yes I'm not sure I really want to vibe code something that does auto evals on a sample of my OTEL traces any more than I want to build my own analytics library.

Alternatives to Opik include Braintrust (closed), Promptfoo (open, https://github.com/promptfoo/promptfoo) and Laminar (open, https://github.com/lmnr-ai/lmnr).

Onawa•2h ago

I've used and liked Promptfoo a lot, but I've run into issues when trying to do evaluations with too many independent variables. Works great for `models * prompts * variables`, but broke down when we wanted `models * prompts * variables^x`.

andybak•5h ago

> About AI Evals

Maybe it's obvious to some - but I was hoping that page started off by explaining what the hell an AI Eval specifically is.

I can probably guess from context but I'd love to have some validation.

phren0logy•5h ago

Here's another article by the same author with more background on AI Evals: https://hamel.dev/blog/posts/evals/

I've appreciated Hamel's thinking on this topic.

xpe•2h ago

From that article:

> On a related note, unlike traditional unit tests, you don’t necessarily need a 100% pass rate. Your pass rate is a product decision, depending on the failures you are willing to tolerate.

Not sure how I feel about this, given expectations, culture, and tooling around CI. This suggestion seems to blur the line between a score from an eval and the usual idea of a unit test.

P.S. It is also useful to track regressions on a per-test basis.

davedx•4h ago

I've worked with LLM's for the better part of the last couple of years, including on evals, but I still don't understand a lot of what's being suggested. What exactly is a "custom annotation tool", for annotating what?

calebkaiser•3h ago

Typically, you would collect a ton of execution traces from your production app. Annotating them can mean a lot of different things, but often it means some mixture of automated scoring and manual review. At the earliest stages, you're usually annotating common modes of failure, so you can say like "In 30% of failures, the retrieval component of our RAG app is grabbing irrelevant context." or "In 15% of cases, our chat agent misunderstood the user's query and did not ask clarifiying questions."

You can then create datasets out of these traces, and use them to benchmark improvements you make to your application.

spmurrayzzz•3h ago

Concrete example from my own workflows: in my IDE whenever I accept or reject a FIM completion, I capture that data (the prefix, the suffix, the completion, and the thumbs up/down signal) and put it in a database. The resultant dataset is annotated such that I can use it for analysis, debugging, finetuning, prompt mgmt, etc. The "custom" tooling part in this case would be that I'm using a forked version of Zed that I've customized in part for this purpose.

pamelafox•2h ago

Fantastic FAQ, thank you Hamel for writing it up. We had an open space on AI Evals at Pycon this year, and had lots of discussion around similar questions. I only wrote down the questions, however:

# Evaluation Metrics & Methodology

* What metrics do you use (e.g., BERTScore, ROUGE, F1)? Are similarity metrics still useful?

* Do you use step-by-step evaluations or evaluate full responses?

* How do you evaluate VLM (vision-language model) summarization? Do you sample outputs or extract named entities?

* How do you approach offline (ground truth) vs. online evaluation?

* How do you handle uncertainty or "don’t know" cases? (Temperature settings?)

* How do you evaluate multi-turn conversations?

* A/B comparisons and discrete labels (e.g., good/bad) are easier to interpret.

* It’s important to counteract bias toward your own favorite eval questions—ensure a diverse dataset.

## Prompting & Models

* Do you modify prompts based on the specific app being evaluated?

* Where do you store prompts—text files, Prompty, database, or in code?

* Do you have domain experts edit or review prompts?

* How do you choose which model to use?

## Evaluation Infrastructure

* How do you choose an evaluation framework?

* What platforms do you use to gather domain expert feedback or labels?

* Do domain experts label outputs or also help with prompt design?

## User Feedback & Observability

* Do you collect thumbs up / thumbs down feedback?

* How does observability help identify failure modes?

* Do models tend to favor their own outputs? (There's research on this.)

I personally work on adding evaluation to our most popular Azure RAG samples, and put a Textual CLI interface in this repo that I've found helpful for reviewing the eval results: https://github.com/Azure-Samples/ai-rag-chat-evaluator

mmanulis•1h ago

Any chance you can share what the answers were for choosing an evaluation framework?

satisfice•2h ago

This reads like a collection of ad hoc advice overfitted to experience that is probably obsolete or will be tomorrow. And we don’t even know if it does fit the author’s experience.

I am looking for solid evidence of the efficacy of folk theories about how to make AI perform evaluation.

Seems to me a bunch of people are hoping that AI can test AI, and that it can to some degree. But in the end AI cannot be accountable for such testing, and we can never know all the holes in its judgment, nor can we expect that fixing a hole will not tear open other holes.

ReDeiPirati•1h ago

> Q: What makes a good custom interface for reviewing LLM outputs? Great interfaces make human review fast, clear, and motivating. We recommend building your own annotation tool customized to your domain ...

Ah! This is a horrible advice. Why should you recommend reinventing the wheel where there is already great open source software available? Just use https://github.com/HumanSignal/label-studio/ or any other type of open source annotation software you want to get started. These tools cover already pretty much all the possible use-cases, and if they aren't you can just build on top of them instead of building it from zero.

abletonlive•47m ago

This awful advice can’t be blanket applied and misses the point: starting from zero is extremely easy now with LLMs, the last 10% is the hardest part. Not only that, if you don’t start from zero you aren’t able to build from whatever you think the new first principles are. Spacex would not exist if it tried to extend old paradigm of rocketry.

There’s nothing wrong with starting from scratch or rebuilding an existing tool from the ground up. There’s no reason to blindly build from the status quo.

ReDeiPirati•21m ago

I'd have agreed with you, if the principles would be different. But what was showed in the content is EXACTLY what those tools are doing today. Actually those tools are way more powerful and considering & covering way more scenarios.

> There’s nothing wrong with starting from scratch or rebuilding an existing tool from the ground up. There’s no reason to blindly build from the status quo.

Generally speaking all the options are ok, but not if you want to have something up as fast as you can or if your team is piloting something. I think the time you spend to vibe code it is greater than to setting any of those tools up.

And BTW, you shouldn't vibe code something that flows proprietary data. At least you would work with co-pilots

Introducing tmux-rs

Launch HN: K-Scale Labs (YC W24) – Open-Source Humanoid Robots

AV1@Scale: Film Grain Synthesis, The Awakening

Opening up ‘Zero-Knowledge Proof’ technology

Caching is an Abstraction, not an Optimization

Poor Man's Back End-as-a-Service (BaaS), Similar to Firebase/Supabase/Pocketbase

ViscaCamLink – Camera control application for PTZ cameras

Show HN: HomeBrew HN – generate personal context for content ranking

Fei-Fei Li: Spatial intelligence is the next frontier in AI [video]

The End of Moore's Law for AI? Gemini Flash Offers a Warning

Alice's Adventures in a Differentiable Wonderland

AI for Scientific Search

About AI Evals

Peasant Railgun

Tools: Code Is All You Need

Stalking the Statistically Improbable Restaurant with Data

Postcard is now open source

Astronomers discover 3I/ATLAS – Third interstellar object to visit Solar System

Kyber (YC W23) Is Hiring Enterprise BDRs

Parallelizing SHA256 Calculation on FPGA

Encoding Jake Gyllenhaal into one million checkboxes (2024)

That XOR Trick (2020)

Exploiting the IKKO Activebuds “AI powered” earbuds (2024)

Importance of context management in AI NPCs

Pbf2sqlite: Reading OpenStreetMap into a SQLite Database

Spending Too Much Money on a Coding Agent

Whole-genome ancestry of an Old Kingdom Egyptian

Mysterious life form found on ship that docked in Cleveland

Trans-Taiga Road (2004)

AI note takers are flooding Zoom calls as workers opt to skip meetings

About AI Evals

Comments

Introducing tmux-rs

Launch HN: K-Scale Labs (YC W24) – Open-Source Humanoid Robots

AV1@Scale: Film Grain Synthesis, The Awakening

Opening up ‘Zero-Knowledge Proof’ technology

Caching is an Abstraction, not an Optimization

Poor Man's Back End-as-a-Service (BaaS), Similar to Firebase/Supabase/Pocketbase

ViscaCamLink – Camera control application for PTZ cameras

Show HN: HomeBrew HN – generate personal context for content ranking

Fei-Fei Li: Spatial intelligence is the next frontier in AI [video]

The End of Moore's Law for AI? Gemini Flash Offers a Warning

Alice's Adventures in a Differentiable Wonderland

AI for Scientific Search

About AI Evals

Peasant Railgun

Tools: Code Is All You Need

Stalking the Statistically Improbable Restaurant with Data

Postcard is now open source

Astronomers discover 3I/ATLAS – Third interstellar object to visit Solar System

Kyber (YC W23) Is Hiring Enterprise BDRs

Parallelizing SHA256 Calculation on FPGA

Encoding Jake Gyllenhaal into one million checkboxes (2024)

That XOR Trick (2020)

Exploiting the IKKO Activebuds “AI powered” earbuds (2024)

Importance of context management in AI NPCs

Pbf2sqlite: Reading OpenStreetMap into a SQLite Database

Spending Too Much Money on a Coding Agent

Whole-genome ancestry of an Old Kingdom Egyptian

Mysterious life form found on ship that docked in Cleveland

Trans-Taiga Road (2004)

AI note takers are flooding Zoom calls as workers opt to skip meetings