I have built TrainForgeTester, an open-source scenario test runner for AI agents that take actions (call tools).
The idea: test how agents perform in company specific scenarios and not just on general benchmarks. More specifically test taking the wrong actions, skipping a required step, calling the wrong tool, or passing the wrong arguments.
TrainForgeTester lets you run multi-turn scenarios (you create this scenarios based on your personal use case and data following the provided scenario schema) and check:
* tool calls and arguments * strict or unordered tool execution * expected responses * regressions after model, prompt, or tool changes
This scenario tester is the first part of the project(like v 0.1.0)
I’m now working on the next part: a "scenario generator" that takes messy historical company data (customer support logs, agent traces, tool calls, transcripts, etc.) and turns them into testable scenarios for this framework. Again trying to make this as deterministic as possible
Repo: https://github.com/TrainForge/TrainForgeTester
I’d love feedback on:
* real agent-testing use cases this does not cover yet (browser use, audio, video, mouse use) * whether this direction makes sense * where this could go as a product/devtool * issues, edge cases, or missing features in the repo
Any GitHub issues/forks/prs would be highly appreciated.