frontpage.

Show HN: Rhesis – Open-source platform for collaborative LLM application testing

https://github.com/rhesis-ai/rhesis

1•nicolaib•1h ago

Hi HN, I'm Nicolai. I'm working with a small team in Germany on Rhesis, an open-source platform for testing conversational LLM applications and agents. We’re sharing an early community preview today.

Why we built this: We saw teams repeatedly struggle with testing: scattered test cases, unclear or inconsistent metrics, and a lot of manual effort that still missed obvious failures before production. Most tools assume a single developer runs evals alone; in practice, testing tends to involve PMs, domain experts, QA, and engineers. We built Rhesis to make that collaboration straightforward.

What it does: Rhesis is a self-hostable platform (with UI) where teams can create, run, and review tests for conversational AI systems. A few core ideas:

- Test generation: Create and run tests for single-turns or full conversations; the platform can also assist with generating both single- and multi-turn scenarios using your domain context. - Domain context / knowledge: Provide background material to guide test creation so you’re not starting from an empty prompt. - Collaboration tools: Non-technical teammates can write test cases, leave comments, and review results; developers can dig into failures with detailed traces and outputs. - Unified metrics: Bring in eval metrics from DeepEval, RAGAS, and similar OSS frameworks without re-implementing them.

Current state: Still early. We shipped v0.4.2 last week with a zero-config Docker setup. Core flows work, but there are rough edges. Everything is MIT-licensed; an enterprise edition will come later, but the OSS core will remain free. We’re currently focused on conversational applications because that’s where we saw the biggest pain in evaluation and QA workflows.

Links: App: app.rhesis.ai GitHub: github.com/rhesis-ai/rhesis Docs: docs.rhesis.ai

Happy to hear your thoughts and any answer questions about platform design, the architecture, or our thinking on collaborative testing workflows.

World Labs – Building 3D spatial-AI world models

The Atom Bomb and Japanese Christianity [pdf]

The House Draws the Line at Jeffrey Epstein

Dr. Fei-Fei Li on jobs, robots and why world models are next

The False Glorification of Yann LeCun

Paiml/Depyler: Compiles Python to Rust, Helping Transition to Rust Code

Beyond the Primary User: 3 Types of Smart-Home Users

Show HN: Polymarket/Kalshi Arbitrage Scanner Powered by Gemini Pro 3

Cloudflare CTO: This was not an attack

Chicken Caesars: they're messing with your Bluesky feed

Text to CAD for Aircraft Design

Meta Did Not Violate Antitrust Law, Judge Rules

Intel Lass Feature Looks Like It Will Be Upstreamed for Linux 6.19

Red Hat Losing Another Longtime and Prominent Linux Kernel Engineer

A Week with Elixir (2013)

Electric motorcycle folds down to the size of a carry-on suitcase

Energy and AI – Analysis

Cloudflare Outage Not Caused by Cyberattack

How Not to Lose Your IP When Developing a Product with Your China Factory (2020)

Researchers find the gas pedal and brake for anxiety, and they aren't neurons

Why People Become Overweight

CS/ML PhD: Debating Between Internship and Full-Time Offers

Meta-analysis of resting metabolic rate in formerly obese subjects

Happy holiday shopping season in the low-trust economy

Who has the biggest footprint on the Web?

Choosing a Vector Database for Reddit

New Arduino Privacy Policy: "user shall not [...] reverse-engineer the platform"

Post-Quantum Cryptography in .NET

Introducing flat-rate pricing plans with no overages

UC Berkeley scientists hail breakthrough in decoding whale communication