Existing eval tools treat agents like deterministic functions: run once, check output, done. But agents aren't deterministic - same input, different tool calls, different outputs.
agentrial runs every test N times (default 10) and gives you:
- Pass rate with Wilson confidence intervals (not "72%" but "72%, CI 55-84%") - Step-level failure attribution (which exact step diverged between success/fail runs) - Real cost tracking from API metadata - GitHub Action for CI/CD
Tested with Claude 3 Haiku on a 3-tool LangGraph agent: 100 trials, $0.06 total, full trajectory capture.
pip install agentrial
Open source, MIT. LangGraph supported, CrewAI/AutoGen coming next.