LLMs audit code from the same blind spot they wrote it from. Here's the fix

https://zenodo.org/records/19408540

2•brodeurmartin•1h ago

Comments

brodeurmartin•1h ago

The platform I built is live in beta at FluentLogic.org, for real families. I’m a high school teacher with a physics and philosophy background (no software engineering experience) who spent 10 months building it—roughly 350,000 lines of production TypeScript, written entirely with AI assistance. I don't know TS from JS. I know assembler and C++.

No matter how many times I asked the model to audit the same piece of code, I kept finding the same categories—until I forced a completely different angle. New class of bugs. Then a plateau. New angle. New class. Plateau again.

Before this, I tried the obvious: firing hundreds of varied prompts, changing phrasing, and hoping coverage would emerge from volume. I spent hundreds of dollars on this shotgun approach. It doesn’t work. You’re just sampling the same semantic neighborhood from slightly different entry points. Shotgun auditing is same-axis repetition with extra noise.

The fix is almost embarrassingly simple: add one word to your audit prompts—"orthogonal."

Instead of: "Find bugs in this code" [or any target surface]

Try: "Audit this surface from the most orthogonal direction to what you just found." (Then fix the bugs, rotate the axis, and repeat until you hit the P2 floor).

The models aren't broken. When you ask the same model that generated your code to audit it, you’re sending the auditor back into the same semantic compression manifold the generator already exhausted. Same manifold = same blind spot. I call this Generator-Auditor Symmetry (GAS).

"Orthogonal" routes the model through a genuinely different neighborhood, producing non-overlapping findings consistently.

What I formalized from this:

Confidence-Coverage Divergence (CCD): Same-axis repetition decreases output entropy (rising false certainty) while bug-class coverage stays flat.

The P2 Floor: When your false-positive rate crosses ~40% on two consecutive fresh-axis waves with zero new critical bugs, the surface is clean. The FP rate acts as an entropy meter.

Rotation > Diversity: Rotating a single model across 3 orthogonal axes outperformed using 3 different models on the same axis.

The Scale of the Test: Earlier this week, I ran a 36-hour marathon audit across 150+ surfaces.

Yield: 60+ P0 bugs fixed and ~150 P1 bugs catalogued. (e.g., OAuth sentinel bypasses, silent cache-invalidation race conditions). Each was invisible to other probe axes. And the web app is now feeling the snappiest it's ever been.

Same-axis repetition plateaus at ~20% bug-class discovery yield, while orthogonal rotation hits ~80% (a 4–5× advantage). I took the full 350K-line codebase to systemic P2 floor. The app is perceptibly faster afterward.

I wrote a short paper formalizing the method and the supporting topological observations. To check this wasn’t just a prompting trick, I ran persistent homology (Vietoris-Rips on Gemini semantic embeddings of 58 production bug classes). It revealed 20 significant β₁ interior loops—evidence that the bug classes form a geometric structure in semantic space that same-axis probing structurally cannot exhaust.

Preprint (Zenodo): https://doi.org/10.5281/zenodo.19223166

This is a single real-world codebase, not a controlled experiment. The survival curves are strong evidence, not final proof.

What I’m genuinely curious about:

Has anyone else seen meaningfully better LLM bug detection by rotating audit axes?

Does Confidence-Coverage Divergence (CCD) appear in LLM evaluation loops (RLHF, Constitutional AI)?

What does the survival curve look like on a codebase you didn’t build yourself?

(19-year Ontario teacher | M.A., B.A. Philosophy · B.Sc. Physics. Built this for real families.)

How the Apple Watch defined modern health tech

Industry initiative launches Euro-Office as true sovereign office suite

Oracle Files H-1B Visa Petitions Amid Mass Layoffs

Artemis II Multimedia: Crew Photos, Videos and Mission Highlights

Show HN: Screenshot web components with one click

DropSmith – Generate structured NPC dialogue for games via MCP

Show HN: Language Operator – Kubernetes operator for managing agents at scale

Carnus: Exploring the Privacy Threats of Browser Extension Fingerprinting [pdf]

Run multi-service projects locally with AI-friendly unified dashboard

Automatic Textbook Formalization

A simple and fast terminal-based note-taking app, build with rust

Show HN: LangChain Is Dead, Long Live TmuxIsFree

OpenAI's Fidji Simo Is Taking Medical Leave Amid an Executive Shake-Up

LLM coding is the wrong layer of abstraction

Pupils in England losing their thinking skills because of AI, survey suggests

Tracing a Coordinated BLE Device Deployment Across Los Angeles

Show HN: Autheona – The API That Stops Fake Sign-Ups

Bike Safety Rules in Cities

Gajim 2.4.5 has been released – GTK XMPP/Jabber Chat Client – Communication

Can a machine "understand" like a human? The Chinese room argument

A new Postcrossing stamp from the USA

New digital hall passes track bathroom breaks, gather data in NYC schools

Show HN: Mrg – Clean miscellaneous files created by macOS

Review into mental health conditions, ADHD and autism: interim report

New microwave frying technique could make French fries much healthier

Show HN: Run Claude Code autonomously inside your Docker Compose stack (OSS)

DiscoDB – Relational database stored 100% in a Discord guild

Staff SWE and Staff DS Roles at Pomelo Care

We Built a Language for AI Pipelines

Show HN: Large scale hallucinated citation problem in published literature