Show HN: AI-generated assembly vs GCC -O3 on real codebases (300K fuzz, 0 failures)
Three kernels extracted from real open source projects, optimized with AI-generated x86-64 assembly, verified with 100K differential fuzz each:
KernelAI strategySpeedupVerdictBase64 decodeSSSE3 pshufb table-free lookup4.8–6.3xAI winsLZ4 fast decodeSSE 16-byte match copy~1.05xAI wins (marginal)Redis SipHashReordered SIPROUND scheduling0.97xGCC wins
The base64 win: GCC can't auto-vectorize a 256-byte lookup table (it's a gather pattern). The AI replaces it with a pshufb nibble trick — 16 parallel lookups in one instruction, zero table accesses. 1.8 GB/s → 11.6 GB/s.
The SipHash loss: on pure ALU kernels (adds, rotates, XORs), GCC's scheduler is already near-optimal.
300K total fuzz iterations, zero mismatches. Every result is one command to reproduce.
Comments
cod-e•1h ago
Author here. Some context on how this works and what it doesn't do.
The system doesn't replace the compiler. It sits on top of it. The key insight (which took a few failed experiments to learn) is that AI-generated assembly is dangerous for code with error handling, state, and control flow — but strong on pure computational kernels.
We tried having the AI rewrite an entire packet parser. It shipped two bugs (flag clobbering, unsigned underflow) and was 1.23x slower than GCC. Then we split the architecture: compiler owns all structural code (validation, error paths, bounds checks, state management), AI only optimizes the inner kernel after all checks pass. Same parser, zero bugs on first try, clean performance win.
That's the design principle behind everything here. The compiler guarantees correctness by construction. The AI only touches pure load/transform/store kernels with no branches. Then we verify with 100K differential fuzz — run random inputs through both versions, compare output byte-by-byte.
What the AI is good at: spotting SIMD opportunities GCC misses. The base64 case is textbook — GCC sees a 256-byte lookup table and generates scalar loads. The AI recognizes that base64's alphabet can be decomposed into nibble ranges and uses pshufb to do 16 parallel lookups. That's not a novel technique (simdjson and others use it), but the point is the AI found and applied it automatically.
What the AI is bad at: pure ALU scheduling. SipHash is adds, rotates, and XORs with tight data dependencies. GCC's instruction scheduler already does this near-optimally. The AI tried and lost. The system reports that honestly.
The verification reports and build scripts are in the repo — every number is one shell command to reproduce. Happy to answer questions about the architecture, the failure cases, or where this goes next.
This does a few things: it tells the packet parser failure story before anyone asks "but what about real code," it explains the architecture, it credits existing work (simdjson) so nobody accuses you of claiming to invent pshufb tricks, and it ends with an invitation that keeps you in the thread. The honest failure story in paragraph two will do more for your credibility than any benchmark.
cod-e•1h ago
This does a few things: it tells the packet parser failure story before anyone asks "but what about real code," it explains the architecture, it credits existing work (simdjson) so nobody accuses you of claiming to invent pshufb tricks, and it ends with an invitation that keeps you in the thread. The honest failure story in paragraph two will do more for your credibility than any benchmark.