This is the core problem with agent evals today: one run tells you nothing. The same prompt, same model, same tools — different result every time.
I built agentrial to fix this. It's a pytest-style CLI that runs your agent N times and gives you:
- Wilson confidence intervals on pass rate - Step-level failure attribution (Fisher exact test pinpoints which tool call or reasoning step diverges between pass/fail runs) - Real API cost from response metadata - A GitHub Action that blocks PRs when reliability drops
Usage is minimal — write a YAML config, run "agentrial run":
pip install agentrial
Tested extensively with LangGraph agents. 100 trials cost $0.06.
MIT licensed, no telemetry, runs locally.Looking for feedback on what metrics matter most when you're shipping agents to production.