What I liked here is how unglamorous it is: tiny prompts, context only when needed, evaluate against real usage, and resist multi‑agent stuff until a single boring pipeline is stable. They also pair rule checks with model checks and expect some reward‑hacking. Curious how others keep evals from drifting in production.
aaronSong•2h ago