frontpage.

Show HN: Iris – first MCP-native eval and observability tool for AI agents

https://github.com/iris-eval/mcp-server

1•iparent•1h ago

I kept running into the same problem building AI agents: once they're running, I have no idea what they're actually doing. Traditional monitoring shows me HTTP 200. It can't tell me the output was wrong, that the agent leaked a user's email address, or that a single tool call in the chain is burning through tokens.

So I built Iris. It's an open-source MCP server — not an SDK, not a proxy. Any MCP-compatible agent (Claude Desktop, Cursor, or anything built with the MCP SDK) discovers and uses it automatically. Add it to your MCP config and your agent gains observability without touching your code.

What it does:

- 3 MCP tools: log_trace (full execution traces with spans, tool calls, token usage, cost in USD), evaluate_output (score output quality against configurable rules), get_traces (query traces with filters and pagination) - 12 built-in eval rules across 4 categories: completeness (output length, coverage), relevance (keyword overlap, hallucination markers), safety (PII detection for SSN/credit card/phone/email, prompt injection patterns, blocklist), and cost (USD threshold, token efficiency) - Hierarchical span tree: trace exactly where in an agent's execution chain something went wrong — which tool call failed, which step was slow - Aggregate cost tracking: the dashboard shows total agent spend across all your agents over any time window, not just per-trace cost. You can finally answer "what are my agents costing me?" - Web dashboard: dark-mode React UI with summary cards, trace list, span tree view, eval results with per-rule breakdown - SQLite storage: single file, no database server. Back it up, move it, inspect it with any SQLite tool - Custom eval rules defined with Zod schemas

Security: API key auth, rate limiting (express-rate-limit), helmet headers, CORS, input validation, ReDoS-safe regex for user-supplied patterns, 1MB body limit.

Stack: TypeScript, Express 5, better-sqlite3, @modelcontextprotocol/sdk, Zod, pino.

Iris also exposes MCP resources — your agent can programmatically read iris://dashboard/summary to get aggregate metrics without opening the dashboard. Every trace logs full traceability, which also means you're building the audit trail that regulations like the EU AI Act will require by August 2026.

  npm install -g @iris-eval/mcp-server
  iris-mcp --transport http --dashboard

Self-hosted, MIT licensed.

GitHub: https://github.com/iris-eval/mcp-server npm: https://www.npmjs.com/package/@iris-eval/mcp-server

I'd appreciate feedback on two things specifically: 1. The eval rule system — are these the right 12 rules to ship with? What's missing? 2. The MCP tool API — three tools feels minimal but sufficient. Should trace logging and evaluation be combined or kept separate?

Check the roadmap for what's coming next: https://github.com/iris-eval/mcp-server/blob/main/docs/roadm...

Bellingcat: The Osint Gatekeepers Who Can't Secure Their Own Site

Daily pill may cure deadly sleep disorder that affects 84M people

Ask HN: How do you find collaborators?

Iran war's Qatari Helium production disruption a potential blow to chipmakers

Meta reportedly plans layoffs as AI costs increase

Do you ship vibe coded apps with security issues?

US told to brace for extreme weather in every single state

Where Censored Words Find a Safe Haven: Inside Minecraft

The Washington Post Is Using Reader Data to Set Subscription Prices

Postgres Is the Gateway Drug

Back End Aggregation Enables Gigawatt-Scale AI Clusters

Library of Short Stories

Millennium Challenge: Iran Destroyed America in a War Game

AI Codemods for Secure-by-Default Android Apps

Book: The Emerging Science of Machine Learning Benchmarks

Pipechart – pipe any JSON into your terminal and get a chart, zero dependencies

Show HN: An Open-Source Yoto Toy with Qwen3-TTS

My fireside chat about agentic engineering at the Pragmatic Summit

My Wish for Software Engineering

Claude Doubles Usage Limits During Off-Peak Hours (March 13–27, 2026)

Glow: Render Markdown on the CLI, with Pizzazz

I rebuilt a daily habit because the default experience felt broken

Trump administration to be paid $10B for brokering TikTok deal

Show HN: Paperctl- An Arxiv CLI designed for agents

Activity-based CO2 sensing provides new insights into cellular metabolism

VFA – Cryptographic Intent Handshake for Secure API Transactions

Cathars and Cathar Beliefs in the Languedoc

Show HN: Language Life – Learn a language by living a simulated life

DOOM fully rendered in CSS

The Anthropic Institute