I built Tinman because finding LLM failures in production is a pain in the ass. Traditional testing checks what you've already thought of. Tinman tries to find what you haven't.
It's an autonomous research agent that: - Generates hypotheses about potential failure modes - Designs and runs experiments to test them - Classifies failures (reasoning errors, tool use, context issues, etc.) - Proposes interventions and validates them via simulation
The core loop runs continuously. Each cycle informs the next.
Why now: With tools like OpenClaw/ClawdBot giving agents real system access, the failure surface is way bigger than "bad chatbot response." Tinman has a gateway adapter that connects to OpenClaw's WebSocket stream for real-time analysis as requests flow through.
Three modes: - LAB: unrestricted research against dev - SHADOW: observe production, flag issues - PRODUCTION: human approval required
Tech: - Python, async throughout - Extensible GatewayAdapter ABC for any proxy/gateway - Memory graph for tracking what was known when - Works with OpenAI, Anthropic, Ollama, Groq, OpenRouter, Together
pip install AgentTinman
tinman init && tinman tui
GitHub: https://github.com/oliveskin/Agent-Tinman
Docs: https://oliveskin.github.io/Agent-Tinman/
OpenClaw adapter: https://github.com/oliveskin/tinman-openclaw-evalApache 2.0. No telemetry, no paid tier. Feedback and contributions welcome.