The problem we’re targeting: AI evaluation today is mostly hype, cherry-picked benchmarks, and inconsistent model cards. It’s hard to reason about risk, uncertainty, and missing information before deploying or buying a model.
What Zeus does (MVP v0.1): - Takes a minimal description of an AI model or AI-powered tool - Generates standardized ModelCard-style metadata - Runs a structured multi-expert analysis (performance, safety, systems, UX, innovation) - Forces explicit disagreement where evidence conflicts - Scores categories based only on disclosed evidence - Outputs a threat/misuse model and improvement roadmap - Produces deterministic, machine-readable JSON
Constraints: - No model execution - No benchmarks - No rankings - Missing info is explicitly marked as “unknown” - No assumptions or fabricated facts
Think of it as a conservative due-diligence engine, not a judge of “best models.”
Questions we’re trying to answer before going further: - Is evaluation without execution still useful? - Does forced disagreement increase or decrease trust? - Where would this actually fit in real workflows?
Brutal criticism welcome.