frontpage.

I’m building an LLM-powered product and trying to figure out the right analytics / quality stack. By “product analytics” I mean more than token counts - I want evals, production monitoring, sliceable error analysis, release gating, and the ability to tie model/prompt changes to product KPIs.

What I’m looking for:

Offline evals / scorecards (benchmarks, rubrics, automated tests)

Production monitoring (drift, hallucination detection, latency/cost metrics)

Ability to tag & slice by model version / prompt version / user segment

Integration with product metrics (user success, retention, conversion) and CI/CD gating

Prefer options that are scriptable and support custom metrics/rubrics. Open-source or SaaS both fine. Privacy/on-prem options are a plus.

Things I’ve considered (but haven’t committed to): open-source eval frameworks, ML monitoring libs, and a few commercial platforms that claim “LLM evals + monitoring.” I’m not married to any single approach.

Questions for the community:

What tools/platforms have you used for full-stack LLM analytics (evals -> prod monitoring -> product KPI correlation)?

What worked vs what failed at scale? Any gotchas (cost, data volume, latency, false positives in hallucination detection)?

Recommended combos (e.g., offline eval + experiment platform + monitoring tool) that actually worked in production?

Any “must-have” rubrics/metrics you’d recommend for a product team shipping LLM features?

If you’ve got a short writeup, blog post, or GitHub repo showing your setup, please drop it - I’ll read and credit you. Happy to share more about my product (multi-turn assistant + retrieval + some tool calls) if that helps.

Thanks!

GLM-OCR: Accurate × Fast × Comprehensive

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

Show HN: AboutMyProject – A public log for developer proof-of-work

Expertise, AI and Work of Future [video]

So Long to Cheap Books You Could Fit in Your Pocket

PID Controller

SpaceX Rocket Generates 100GW of Power, or 20% of US Electricity

Kubernetes MCP Server

I Built a Movie Recommendation Agent to Solve Movie Nights with My Wife

What were the first animals? The fierce sponge–jelly battle that just won't end

Sidestepping Evaluation Awareness and Anticipating Misalignment

OldMapsOnline

What It's Like to Be a Worm

Don't go to physics grad school and other cautionary tales

Lawyer sets new standard for abuse of AI; judge tosses case

AI anxiety batters software execs, costing them combined $62B: report

Bogus Pipeline

Winklevoss twins' Gemini crypto exchange cuts 25% of workforce as Bitcoin slumps

How AI Is Reshaping Human Reasoning and the Rise of Cognitive Surrender

Cycling in France

Ask HN: What breaks in cross-border healthcare coordination?

Show HN: Simple – a bytecode VM and language stack I built with AI

Show HN: Free-to-play: A gem-collecting strategy game in the vein of Splendor

My Eighth Year as a Bootstrapped Founde

Show HN: Tesseract – A forum where AI agents and humans post in the same space

Show HN: Vibe Colors – Instantly visualize color palettes on UI layouts

OpenAI is Broke ... and so is everyone else [video][10M]

We interfaced single-threaded C++ with multi-threaded Rust

State Department will delete X posts from before Trump returned to office

AI Skills Marketplace