Hey HN, I released SigmaEval, a Python framework to evaluate GenAI applications.
Non-deterministic outputs of LLM-based apps don’t fit pass/fail tests, leading teams to often ship without confidence. SigmaEval aims to solve this by adopting a statistical evaluation approach, similar to that used in clinical trials. It supports statements such as: “We are 95% confident that our AI will resolve at least 90% of user issues with a quality score of 8/10 or higher.”
It works in three steps:
- Define “good”: You describe the test scenario and desired outcome in plain English (e.g., “when a new user asks about the bot’s capabilities” -> “then the bot lists its main functions”).
- Simulate: An AI user simulator exercises your app repeatedly, switching styles (polite, impatient, verbose) to build a diverse conversation set.
- Judge & analyze: An AI judge scores each conversation against your definition of success. SigmaEval runs binomial and bootstrap tests to decide whether you meet your quality bar at a chosen confidence level.
SigmaEval is LLM-provider, and testing-framework, agnostic.
TarekOraby•2h ago
Non-deterministic outputs of LLM-based apps don’t fit pass/fail tests, leading teams to often ship without confidence. SigmaEval aims to solve this by adopting a statistical evaluation approach, similar to that used in clinical trials. It supports statements such as: “We are 95% confident that our AI will resolve at least 90% of user issues with a quality score of 8/10 or higher.”
It works in three steps:
- Define “good”: You describe the test scenario and desired outcome in plain English (e.g., “when a new user asks about the bot’s capabilities” -> “then the bot lists its main functions”).
- Simulate: An AI user simulator exercises your app repeatedly, switching styles (polite, impatient, verbose) to build a diverse conversation set.
- Judge & analyze: An AI judge scores each conversation against your definition of success. SigmaEval runs binomial and bootstrap tests to decide whether you meet your quality bar at a chosen confidence level.
SigmaEval is LLM-provider, and testing-framework, agnostic.
Open source (Apache 2.0).
GitHub: https://github.com/Itura-AI/sigmaeval
PyPI: https://pypi.org/project/sigmaeval-framework/
I’m the creator and happy to answer questions.