Confidence-Coverage Divergence (CCD): Same-axis repetition decreases output entropy (rising false certainty) while bug-class coverage stays flat. P2 Floor: When your false-positive rate crosses ~40% on two consecutive fresh-axis waves with zero new critical bugs, the surface is clean. The FP rate acts as an entropy meter. Rotation > Diversity: Rotating a single model across 3 orthogonal axes outperformed using 3 different models on the same axis.
Scale of the test: Earlier this week I ran a 36-hour marathon audit across 150+ surfaces. Yield: 60+ P0 bugs fixed and ~150 P1 bugs catalogued (e.g., OAuth sentinel bypasses, silent cache-invalidation race conditions). Each was invisible to other probe axes. The web app now feels the snappiest it’s ever been. Same-axis repetition plateaus at ~20% bug-class discovery yield, while orthogonal rotation reaches ~80% — a 4–5× advantage. I took the full 350K-line codebase to systemic P2 floor. The app is perceptibly faster afterward. I wrote a short paper formalizing the method and the supporting topological observations. To verify this wasn’t just a prompting trick, I ran persistent homology (Vietoris-Rips on Gemini semantic embeddings of 58 production bug classes). It revealed 20 significant β₁ interior loops — evidence that the bug classes form geometric structure in semantic space that same-axis probing structurally cannot exhaust. Preprint (Zenodo): https://doi.org/10.5281/zenodo.19223166 This is a single real-world codebase, not a controlled experiment. The survival curves are strong evidence, not final proof. What I’m genuinely curious about:
Has anyone else seen meaningfully better LLM bug detection by rotating audit axes? Does Confidence-Coverage Divergence (CCD) appear in LLM evaluation loops (RLHF, Constitutional AI)? What does the survival curve look like on a codebase you didn’t build yourself?
(19-year Ontario teacher | M.A., B.A. Philosophy · B.Sc. Physics. Built this for real families.)
chunpaiyang•1h ago
But you know, engineers bullshit each other all the time too. The difference is we have a way to verify it - logical chains. You have to build an argument that holds up before anayone buys in it.
So I though, can I make AI build its own logical chain ? Let it pass its own logic check before telling me the result.
That's how I created my own "think" skill. It's based on Meta's CoT paper: https://arxiv.org/abs/2501.04682
It roughly works like this: 1. FRAME - Challenge the question itself, hidden assumptions.
2. GROUND - Map what you know, what you need, what's missing.
3. ASSOCIATE - Launch multiple independent agents in parallel to generate hypotheses, avoid anchoring bias.
4. VERIFY - Break each hypothesis into atomic claims, verify each independently
5. CHAIN - Build a logical chain from survivors
6. PROVE and LOOP - Walk backwards from conclusion to premises, seearch for evidence, repair if broken
7. DELIVER - Start with "I was wrong if ...."
It helps me a lot. Whenever I need to check if Claude Opus 4.6 is bullshitting me. I just say "/think verify the above reasoning is correct" or "/think verify the above fix is correct and complete."