I built SemanticTest while working on calendar0.app (an AI calendar assistant).
While I was building the AI assistant, I noticed a lack on good AI Evals frameworks that would help me test my agent.
SemanticTest uses GPT-4 as a judge to evaluate:
- Text responses (semantic meaning)
- Tool calls (correct tools, right order)
- Multi-turn conversations
It's composable: you build tests as JSON pipelines using custom blocks.
Would love feedback. Thank you!