Show HN: Assay – Found 250 bugs in LiteLLM, LobeChat via AI code verification

https://github.com/gtsbahamas/hallucination-reversing-system

2•tywellshn•2h ago

Comments

tywellshn•2h ago

Hi HN, I'm Ty. I built Assay because I kept shipping bugs that my AI coding assistant hallucinated into existence.

Three independent papers have proven that LLM hallucination is mathematically inevitable (Xu et al. 2024, Banerjee et al. 2024, Karpowicz 2025). You can't train it away. You can't prompt it away. So I built a verification layer instead.

How it works: Assay extracts every implicit claim code makes (e.g., "this function handles null input," "this query is injection-safe"), then verifies each one. First an adversarial LLM pass, then a deterministic formal verifier that can override the LLM's verdict.

We ran it on 4 popular open-source projects. Live results:

- LiteLLM (18K stars): 1,381 claims, 185 bugs, 30 critical — https://tryassay.ai/reports/0bccf817-1cb6-43ff-b724-866f1453... - Chatbot UI (28K stars): 476 claims, 41 bugs, 12 critical — https://tryassay.ai/reports/cc8c0c61-9b5a-4774-aed1-f99cc4f6... - LobeChat (50K stars): 205 claims, 14 bugs, 1 critical — https://tryassay.ai/reports/915dfc1a-64ec-483d-b4b5-effb53a8... - Open Interpreter (55K stars): 12 claims, 4 bugs, 2 critical — https://tryassay.ai/reports/347aa2bb-4249-468a-a835-12da3472...

"But can't the verifier hallucinate too?" Yes. That's why we added a formal verifier underneath — pure regex/pattern-matching, no LLM, can't hallucinate. On its first production call, the LLM judge said PASS on code with SQL injection. The formal verifier overrode it to FAIL.

Benchmarks (validated against real test suites, not LLM judgment): - HumanEval: 86.6% baseline to 100% pass@5 with Assay (164/164 problems) - SWE-bench: 18.3% baseline to 30.3% with Assay (+65.5%)

Try it:

  npx tryassay assess /path/to/your/project

npm: https://www.npmjs.com/package/tryassay Paper: https://doi.org/10.5281/zenodo.18522644

Drop a repo link in the comments and I'll run it for free.

Show HN: Stop Pasting Credentials in Slack

Show HN: Skill Check CLI for your skill.md

Show HN: WebhookStream – Receive, relay, send and debug webhooks from 1 platform

The political effects of X's feed algorithm

Show HN: Mukoko weather – AI-powered weather intelligence built for Zimbabwe

AstianGO Search API

Show HN: WP2TXT – Wikipedia dump text extractor with category/section filtering

Show HN: Filepack: a fast SHASUM/SFV/PGP alternative using BLAKE3

Show HN: AI Code Review Agent – Automated PR Reviews with Google ADK and Gemini

Show HN: NF-1 – A resource-zero programming language for low-end hardware

Let's Burn Some Tokens – AI Chatbot Cost Exploitation as an Attack Vector

AI Fatigue: Why the "Test Only, Zero Code Review" Methodology Is Flawed

Show HN: Script Snap – Extract code from videos

We built an economy for SpaceMolt, the realtime MMO for AI agents

The Takedown Campaign Against archive.today (2025)

US economy slowed sharply in the fourth quarter, expanding at rate of just 1.4%

EPA Weakens Limits on Mercury from Coal Plants

In SF for a couple of days, looking for someone that can host us in their office

Meta Deployed AI and It Is Killing Our Agency

dwata: Local Financial Data Extraction from Emails with Ministral 3 3B, Ollama

Show HN: Claude Chrome Parallel – Ultrafast Parallel Browser MCP for Chrome

OpenAI considered alerting Canadian police about school shooting suspect

Topological Naming Problem

Can we debug a living cell like a running binary?

Tiny QR code achieved using electron microscope technology

The Fundamental Limits of LLMs at Scale

A perceptual-first mobile audio DSP experiment

Saturn's Rings Came from a Two-Moon Collision About 100M Years Ago

A man who triggered the AI explosion(2020) – Alex Krizhevsky [video]

How to Use Goosetown for Parallel Agentic Engineering