What it does:
Parallel Execution: Send a single prompt to OpenAI, Anthropic, Mistral, and Gemini simultaneously. Compare the outputs, latency, and exact token usage side-by-side.
Batch Evaluations: Upload a CSV dataset to run bulk tests across multiple models at once.
Manual Diagnostics: Grade outputs manually (Pass/Fail) and assign diagnostic tags (e.g., Hallucination, Format Error) to build a human-verified performance leaderboard.
Local-first: API keys encrypted with your OS keyring; history stored in a local SQLite DB; no telemetry.
I’m looking for technical feedback. What do you think current LLM testing/evaluation tools get most wrong?