We run an LLM system that reads messy medical records and determines clinical trial eligibility.
We tried eval platforms, LLM-as-judge, and automated prompt optimizers. None helped with what actually mattered: hidden domain policies that weren’t explicitly written anywhere.
We ended up building our own annotation UI, prompt integration workflow (via Claude Code SDK), and HTML diff-based experiment reports.
The biggest lesson: off-the-shelves Eval/Annotations/Prompt Optimization tools are sub-part because they can only be generic.
Curious whether others building AI products have reached the same conclusion.
consumer451•1h ago
I will be dealing with something along these lines next month. Thanks for sharing.
anatolecallies•1h ago
We tried eval platforms, LLM-as-judge, and automated prompt optimizers. None helped with what actually mattered: hidden domain policies that weren’t explicitly written anywhere.
We ended up building our own annotation UI, prompt integration workflow (via Claude Code SDK), and HTML diff-based experiment reports.
The biggest lesson: off-the-shelves Eval/Annotations/Prompt Optimization tools are sub-part because they can only be generic.
Curious whether others building AI products have reached the same conclusion.
consumer451•1h ago