Why we built this: We saw teams repeatedly struggle with testing: scattered test cases, unclear or inconsistent metrics, and a lot of manual effort that still missed obvious failures before production. Most tools assume a single developer runs evals alone; in practice, testing tends to involve PMs, domain experts, QA, and engineers. We built Rhesis to make that collaboration straightforward.
What it does: Rhesis is a self-hostable platform (with UI) where teams can create, run, and review tests for conversational AI systems. A few core ideas:
- Test generation: Create and run tests for single-turns or full conversations; the platform can also assist with generating both single- and multi-turn scenarios using your domain context. - Domain context / knowledge: Provide background material to guide test creation so you’re not starting from an empty prompt. - Collaboration tools: Non-technical teammates can write test cases, leave comments, and review results; developers can dig into failures with detailed traces and outputs. - Unified metrics: Bring in eval metrics from DeepEval, RAGAS, and similar OSS frameworks without re-implementing them.
Current state: Still early. We shipped v0.4.2 last week with a zero-config Docker setup. Core flows work, but there are rough edges. Everything is MIT-licensed; an enterprise edition will come later, but the OSS core will remain free. We’re currently focused on conversational applications because that’s where we saw the biggest pain in evaluation and QA workflows.
Links: App: app.rhesis.ai GitHub: github.com/rhesis-ai/rhesis Docs: docs.rhesis.ai
Happy to hear your thoughts and any answer questions about platform design, the architecture, or our thinking on collaborative testing workflows.