Why Agent runs are stochastic; tool-calls fail; hard to reproduce, measure, and fix at scale. It’s also hard to align behavior with goals across output quality/format, cost, and latency. We need a loop that integrates user feedback and LLM evaluators directly into the agent code (prompts, configs, models, graphs) without overfitting.
How - Simulation: LLM personas, mocked MCP servers/tools, synthetic data; can condition on real traces - Evaluation: code-based + LLM-based evaluators; turn human reviews into optimization-ready benchmarks - Optimization with Maestro: tune prompts, configs and even agent graph for improved quality, cost and latency
Try it pip install relai
GitHub: https://github.com/relai-ai/relai-sdk
Docs: https://docs.relai.ai/ (2-min overview: https://youtu.be/qKsJUD_KP40)
Looking for feedback on - Where graph-level suggestions help (beyond prompt tuning) - Evaluator signals you rely on for reliability (and what we’re missing) - Simulation setups/environments you’d want out of the box
Notes Founder here. Happy to share internals, tradeoffs, and limitations.
Works with LangGraph / OpenAI Agents / Google ADK / etc. SDK Apache-2.0 license.