How it works: (1) Paste any AI response. (2) Extractor identifies factual claims — names, dates, numbers, citations. (3) Each claim gets searched independently via HTTP. (4) Comparator checks search evidence against claims. (5) Reporter scores overall credibility.
7 Python modules, ~27KB total. Uses Claude API for extraction/comparison and direct search for verification. Streamlit web UI with color-coded cards per claim.
The thesis: Hallucination is an architecture problem, not a scale problem. LLMs compute argmax P(most_likely), not P(true). More parameters make the guess more refined, but "most likely" ≠ "most true." So instead of making the guesser better, add an independent verification layer that runs on logic, not statistics.
The meta-irony: During code review, I had Claude write the code and Gemini review it. Gemini flagged claude-sonnet-4-20250514 as a "fictional model" and issued a critical blocking warning. The model is real — Gemini's training cutoff made it hallucinate about a model name while reviewing a hallucination detector. Then Claude summarized "all three AIs approved" when only two existed. Human caught both with one sentence each.
Built on a 32KB deductive reasoning engine (9 axioms, fractal-verified across 6 relationship scales). Also open source.
Detector: https://github.com/ZhangXiaowenOpen/hallucination-detector
All projects: https://github.com/ZhangXiaowenOpen
MIT + Heart Clause license. Solo dev + AI collaboration. Happy to answer questions about the architecture or why deductive verification will outlast RAG-based approaches.