This niche of the field has come a very long way just over the last 12 months, and the tooling is so much better than it used to be. Trying to do this from scratch, beyond a "kinda sorta good enough for now" project, is a full-time engineering project in and of itself.
I'm a maintainer of Opik, but you have plenty of options in the space these days for whatever your particular needs are: https://github.com/comet-ml/opik
Alternatives to Opik include Braintrust (closed), Promptfoo (open, https://github.com/promptfoo/promptfoo) and Laminar (open, https://github.com/lmnr-ai/lmnr).
Maybe it's obvious to some - but I was hoping that page started off by explaining what the hell an AI Eval specifically is.
I can probably guess from context but I'd love to have some validation.
I've appreciated Hamel's thinking on this topic.
> On a related note, unlike traditional unit tests, you don’t necessarily need a 100% pass rate. Your pass rate is a product decision, depending on the failures you are willing to tolerate.
Not sure how I feel about this, given expectations, culture, and tooling around CI. This suggestion seems to blur the line between a score from an eval and the usual idea of a unit test.
P.S. It is also useful to track regressions on a per-test basis.
You can then create datasets out of these traces, and use them to benchmark improvements you make to your application.
# Evaluation Metrics & Methodology
* What metrics do you use (e.g., BERTScore, ROUGE, F1)? Are similarity metrics still useful?
* Do you use step-by-step evaluations or evaluate full responses?
* How do you evaluate VLM (vision-language model) summarization? Do you sample outputs or extract named entities?
* How do you approach offline (ground truth) vs. online evaluation?
* How do you handle uncertainty or "don’t know" cases? (Temperature settings?)
* How do you evaluate multi-turn conversations?
* A/B comparisons and discrete labels (e.g., good/bad) are easier to interpret.
* It’s important to counteract bias toward your own favorite eval questions—ensure a diverse dataset.
## Prompting & Models
* Do you modify prompts based on the specific app being evaluated?
* Where do you store prompts—text files, Prompty, database, or in code?
* Do you have domain experts edit or review prompts?
* How do you choose which model to use?
## Evaluation Infrastructure
* How do you choose an evaluation framework?
* What platforms do you use to gather domain expert feedback or labels?
* Do domain experts label outputs or also help with prompt design?
## User Feedback & Observability
* Do you collect thumbs up / thumbs down feedback?
* How does observability help identify failure modes?
* Do models tend to favor their own outputs? (There's research on this.)
I personally work on adding evaluation to our most popular Azure RAG samples, and put a Textual CLI interface in this repo that I've found helpful for reviewing the eval results: https://github.com/Azure-Samples/ai-rag-chat-evaluator
I am looking for solid evidence of the efficacy of folk theories about how to make AI perform evaluation.
Seems to me a bunch of people are hoping that AI can test AI, and that it can to some degree. But in the end AI cannot be accountable for such testing, and we can never know all the holes in its judgment, nor can we expect that fixing a hole will not tear open other holes.
Ah! This is a horrible advice. Why should you recommend reinventing the wheel where there is already great open source software available? Just use https://github.com/HumanSignal/label-studio/ or any other type of open source annotation software you want to get started. These tools cover already pretty much all the possible use-cases, and if they aren't you can just build on top of them instead of building it from zero.
There’s nothing wrong with starting from scratch or rebuilding an existing tool from the ground up. There’s no reason to blindly build from the status quo.
> There’s nothing wrong with starting from scratch or rebuilding an existing tool from the ground up. There’s no reason to blindly build from the status quo.
Generally speaking all the options are ok, but not if you want to have something up as fast as you can or if your team is piloting something. I think the time you spend to vibe code it is greater than to setting any of those tools up.
And BTW, you shouldn't vibe code something that flows proprietary data. At least you would work with co-pilots
afro88•7h ago
> Q: How much time should I spend on model selection?
> Many developers fixate on model selection as the primary way to improve their LLM applications. Start with error analysis to understand your failure modes before considering model switching. As Hamel noted in office hours, “I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence. Does error analysis suggest that your model is the problem?”
If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2.5, or Claude 3.7 to 4) that will level up your system pretty easily. Use the best models you can, if you can afford it.
lumost•6h ago
softwaredoug•6h ago
Also the “if you can afford it” can be fairly non trivial decision.
smcleod•6h ago
simonw•6h ago
> I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence.
If you try to fix problems by switching from eg Gemini 2.5 Flash to OpenAI o3 but you don't have any evals in place how will you tell if the model switch actually helped?
phillipcarter•5h ago
How do you know that their evals match behavior in your application? What if the older, "worse" model actually does some things better, but if you don't have comprehensive enough evals for your own domain, you simply don't know to check the things it's good at?
FWIW I agree that in general, you should start with the most powerful model you can afford, and use that to bootstrap your evals. But I do not think you can rely on generic benchmarks and evals as a proxy for your own domain. I've run into this several times where an ostensibly better model does no better than the previous generation.
shrumm•4h ago
Instead, if a new model only does marginally worse - that’s a strong signal that the new model is indeed better for our use case.
ndr•3h ago
You'll have to adjust a bunch of prompts and measure. And if you didn't have a baseline to begin with good luck YOLOing your way out of it.