Principles for production AI agents

https://www.app.build/blog/six-principles-production-ai-agents

53•carlotasoto•6h ago

Comments

carlotasoto•6h ago

Practical lessons from building production agentic systems

roadside_picnic•2h ago

Did we just give up on evaluations these days?

Over, and over again my experience building production AI tools/systems has been that evaluations are vital for improving performance.

I've also see a lot of people proposing some variation of "LLM as critic" as a solution to this, but I've never seen empirical evidence that this works. Further more, I've worked with a pretty well respected researcher in this space and in our internal experiment we found that LLMs where not good critics.

Results are always changing, so I'm very open to the possibility that someone has successfully figured out how to use "LLM as critic" but without the foundations of some basic evals to compare by, I remain skeptical.

Aurornis•2h ago

Evals are a core part of any up to date LLM team. If some team was just winging it without robust eval practices they’re not to be trusted.

> Further more, I've worked with a pretty well respected researcher in this space and in our internal experiment we found that LLMs where not good critics

This is an idea that seems so obvious in retrospect, after using LLMs and getting so many flattering responses telling us we’re right and complementing our inputs.

For what it’s worth, I’ve heard from some people who said they were getting better results by intentionally using different LLM models for the eval portion. Feels like having a model in the same family evaluate its own output triggers too many false positives.

Uehreka•1h ago

I once asked Claude Code (Opus 4) to review a codebase I’d built, and threw in at the end of my prompt something like “No need to be nice about it.”

Now granted, you could say it was “flattering that instruction”, but it sure didn’t flatter me. It absolutely eviscerated my code, calling out numerous security issues (which were real), all manner of code smells and bad architectural decisions, and ended by saying that the codebase appeared to have been thrown together in a rush with no mind toward future maintenance (which was… half true… maybe more true than I’d like to admit).

All this to say that it is far from obvious that LLMs are intrinsically bad critics.

Herring•1h ago

I have an idea. What if we used a third LLM to evaluate how good the secondary LLM is at critiquing the primary LLM.

colonCapitalDee•14m ago

The problem isn't that LLMs can't be critical, it's that LLMs don't have taste. It's easy to get an LLM to give praise, and it's easy to get an LLM to give criticism, but getting an LLM to praise good things and criticize bad things is currently impossible for non-trival inputs. That's not say that prompting your LLM to generate criticism is useless, it's just that any LLM prompted to generate criticism is going to criticize things are that actually fine, just like how an LLM prompted to generate praise (which is effectively the default behavior) is going to praise things that are deeply not fine.

prats226•2h ago

I see that in tool calling, we usually specify just the inputs to functions and not what typed output is expected from function.

In DSL style agents, giving LLMs info about what structured inputs are needed to call functions as well as what are outputs expected would probably result in better planning?

SrslyJosh•2h ago

"Don't."

lacoolj•1h ago

Always hard to take an article seriously when it has typos, some of which are repeated ("promt" in the graphic on Principle 2)