frontpage.

I've been building AI agents at work and the hardest part isn't the prompts or orchestration – it's answering "is this agent actually good?" in production.

Tracing tells you what happened. But I wanted to know how well it happened. So I built Auditi – it captures your LLM traces and spans and automatically evaluates them with LLM-as-a-judge + human annotation workflows.

Two lines to get started:

  auditi.init(api_key="...")
  auditi.instrument()  # monkey-patches OpenAI/Anthropic/Gemini

Every API call is captured with full span trees, token usage, and costs. No code changes to your existing LLM calls.

The interesting technical bit: the SDK monkey-patches client.chat.completions.create() at runtime (similar to how OpenTelemetry auto-instruments HTTP libraries). It wraps streaming responses with proxy iterators that accumulate content and extract usage from the final chunk – so even streamed responses get full cost tracking without the user doing anything.

What makes this different from just tracing: - Built-in evaluators – 7 managed LLM judges (hallucination, relevance, correctness, toxicity, etc.) run automatically on every trace - Span-level evaluation – scores each step in a multi-step agent, not just the final output - Human annotation queues – when you need ground truth, not just vibes - Dataset export – annotated traces export as JSONL/CSV/Parquet for fine-tuning

Self-host with docker compose up.

I'd love feedback from anyone running AI agents or LLMs in production. What metrics do you actually look at? How do you decide if an agent response is "good enough"?

GitHub: https://github.com/deduu/auditi

Show HN: A compiled programming language for LLM-to-LLM communication [pdf]

Show HN: See what your AI agents do under the hood

EPA to repeal its own conclusion that greenhouse gases warm the planet

Can you trust LastPass in 2026? Inside the quest to rebuild its security culture

Show HN: Z-Image Base – Fast AI Image Generator (Open-Source, Free Tier)

When the Competition Is Down the Hall

The Banality of MAGA Evil

Show HN: Onlybots.cam

PostmarketOS at FOSDEM 2026 and Hackathon

How We Built the Fastest Kimi K2.5 on Artificial Analysis

The Budget and Economic Outlook: 2026 to 2036

Web-Git-sum – Git is not GitHub

Show HN: MEVA, a desktop Markdown reader for AI-generated docs

Trends in Prevalence of Autism by Adaptive and Intellectual Functioning Levels

Mamdani Hires Groundbreaking Computer Scientist as Chief Tech Officer

Ask HN: Why electronics are still so unrecyclable?

Stablecoins for Skeptics

The Truth About No-KYC Crypto Cards, from Someone Who Ran One

Who's the Agent Now?

Freenginx 1.29.5 Release

Show HN: I built a tool to help generate short form videos

Show HN: SPICEBridge – MCP server for AI circuit design via ngspice

Blender source code was 9 files in January-8-1994

The temporary closure of airspace over El Paso has been lifted

Sabotage Risk Report: Claude Opus 4.6 [pdf]

Chowla conjecture on the minimum of a cosine series

Fibonacci numbers and time-space tradeoffs

"Have I Been Stalked" post-mortem

Computing Large Fibonacci Numbers

Life on Earth is lucky: A rare chemical fluke may have made our planet habitable

Show HN: Auditi – open-source LLM tracing and evaluation platform