We experienced, in large agentic systems, prompt-engineering or auto-prompt improvement tool can get accuracy from 0 to 50% but for increasing accuracy to 100% we had to work with domain experts. Example -> In a law ai agent, lawyers are needed because law is complex and lawyers have a deeper context compared to non-lawyers.
Other evaluation tools in the market focus on the experience of the developer and we are focusing on making as easy as possible for your domain experts to improve agentic systems on their own. Each agent action automatically routes a feedback request to the appropriate domain expert; once they respond, the system pinpoints the responsible agent and applies the necessary change which they can test.
We’re borrowing our OSS business model from Supabase who makes it easy to self-host with features reserved for enterprise and a paid version for managed cloud service. Right now, all of our code is available under a permissive license (MIT).
We’re admittedly early, and many features are in the process of being built. We would really appreciate a star and feedback on how we can make it useful to you. Thanks!