Over, and over again my experience building production AI tools/systems has been that evaluations are vital for improving performance.
I've also see a lot of people proposing some variation of "LLM as critic" as a solution to this, but I've never seen empirical evidence that this works. Further more, I've worked with a pretty well respected researcher in this space and in our internal experiment we found that LLMs where not good critics.
Results are always changing, so I'm very open to the possibility that someone has successfully figured out how to use "LLM as critic" but without the foundations of some basic evals to compare by, I remain skeptical.
> Further more, I've worked with a pretty well respected researcher in this space and in our internal experiment we found that LLMs where not good critics
This is an idea that seems so obvious in retrospect, after using LLMs and getting so many flattering responses telling us we’re right and complementing our inputs.
For what it’s worth, I’ve heard from some people who said they were getting better results by intentionally using different LLM models for the eval portion. Feels like having a model in the same family evaluate its own output triggers too many false positives.
Now granted, you could say it was “flattering that instruction”, but it sure didn’t flatter me. It absolutely eviscerated my code, calling out numerous security issues (which were real), all manner of code smells and bad architectural decisions, and ended by saying that the codebase appeared to have been thrown together in a rush with no mind toward future maintenance (which was… half true… maybe more true than I’d like to admit).
All this to say that it is far from obvious that LLMs are intrinsically bad critics.
Interestingly enough, we started with hundreds of evals, but after that experience my advice has become: less evals tied more closely to specific features and product ambitions.
By that I mean: some evals should serve as a warning ("uh oh, that eval failed, don't push to prod"), others as a mile stone ("woohoo! we got it work!"), and all should be informed by the product road map. You basically should understand where the product is going just by looking over the eval suite.
And, if you don't have evals, you really don't know if you're moving the needle at all. There were multiple situations where a tweak to a prompt passed an initial vibe check, but when run against the full eval suite, clearly performed worse.
The other piece of advice would be: evals don't have to sophisticated, just repeatable and agnostic to who's running them. Heck even "vibe checks" can be good evals, if they're written down and they need to pass some consensus among multiple people around whether they passed or not.
In DSL style agents, giving LLMs info about what structured inputs are needed to call functions as well as what are outputs expected would probably result in better planning?
What really resonates is the bit about frustrating behaviors signaling deeper system issues, not just model quirks. In my own experiments, I've had agents stubbornly ignore tools because I forgot to expose the right APIs, and it made me rethink how we treat these as "intelligent" when they're really just following our flawed setups. It pushes us toward more robust orchestration, where humans handle the high-level intentions and AI fills in the execution gaps seamlessly.
This ties into broader ideas on how AI interfaces will evolve as models get smarter. I extrapolate more of this thinking and dive deeper into human–AI interfaces on my blog if anyone’s interested in checking it out: https://henriquegodoy.com/blog/stream-of-consciousness
carlotasoto•10h ago