What I’m looking for:
Offline evals / scorecards (benchmarks, rubrics, automated tests)
Production monitoring (drift, hallucination detection, latency/cost metrics)
Ability to tag & slice by model version / prompt version / user segment
Integration with product metrics (user success, retention, conversion) and CI/CD gating
Prefer options that are scriptable and support custom metrics/rubrics. Open-source or SaaS both fine. Privacy/on-prem options are a plus.
Things I’ve considered (but haven’t committed to): open-source eval frameworks, ML monitoring libs, and a few commercial platforms that claim “LLM evals + monitoring.” I’m not married to any single approach.
Questions for the community:
What tools/platforms have you used for full-stack LLM analytics (evals -> prod monitoring -> product KPI correlation)?
What worked vs what failed at scale? Any gotchas (cost, data volume, latency, false positives in hallucination detection)?
Recommended combos (e.g., offline eval + experiment platform + monitoring tool) that actually worked in production?
Any “must-have” rubrics/metrics you’d recommend for a product team shipping LLM features?
If you’ve got a short writeup, blog post, or GitHub repo showing your setup, please drop it - I’ll read and credit you. Happy to share more about my product (multi-turn assistant + retrieval + some tool calls) if that helps.
Thanks!