Show HN: AI-optimized x86-64 assembly vs. GCC -O3 on three production kernels

https://github.com/cleonard2341/ai-kernel-optimizer/blob/main/blog/ai-assembly-vs-gcc-o3.md

1•cod-e•1h ago

Show HN: AI-generated assembly vs GCC -O3 on real codebases (300K fuzz, 0 failures) Three kernels extracted from real open source projects, optimized with AI-generated x86-64 assembly, verified with 100K differential fuzz each: KernelAI strategySpeedupVerdictBase64 decodeSSSE3 pshufb table-free lookup4.8–6.3xAI winsLZ4 fast decodeSSE 16-byte match copy~1.05xAI wins (marginal)Redis SipHashReordered SIPROUND scheduling0.97xGCC wins The base64 win: GCC can't auto-vectorize a 256-byte lookup table (it's a gather pattern). The AI replaces it with a pshufb nibble trick — 16 parallel lookups in one instruction, zero table accesses. 1.8 GB/s → 11.6 GB/s. The SipHash loss: on pure ALU kernels (adds, rotates, XORs), GCC's scheduler is already near-optimal. 300K total fuzz iterations, zero mismatches. Every result is one command to reproduce.

Comments

cod-e•1h ago

Author here. Some context on how this works and what it doesn't do. The system doesn't replace the compiler. It sits on top of it. The key insight (which took a few failed experiments to learn) is that AI-generated assembly is dangerous for code with error handling, state, and control flow — but strong on pure computational kernels. We tried having the AI rewrite an entire packet parser. It shipped two bugs (flag clobbering, unsigned underflow) and was 1.23x slower than GCC. Then we split the architecture: compiler owns all structural code (validation, error paths, bounds checks, state management), AI only optimizes the inner kernel after all checks pass. Same parser, zero bugs on first try, clean performance win. That's the design principle behind everything here. The compiler guarantees correctness by construction. The AI only touches pure load/transform/store kernels with no branches. Then we verify with 100K differential fuzz — run random inputs through both versions, compare output byte-by-byte. What the AI is good at: spotting SIMD opportunities GCC misses. The base64 case is textbook — GCC sees a 256-byte lookup table and generates scalar loads. The AI recognizes that base64's alphabet can be decomposed into nibble ranges and uses pshufb to do 16 parallel lookups. That's not a novel technique (simdjson and others use it), but the point is the AI found and applied it automatically. What the AI is bad at: pure ALU scheduling. SipHash is adds, rotates, and XORs with tight data dependencies. GCC's instruction scheduler already does this near-optimally. The AI tried and lost. The system reports that honestly. The verification reports and build scripts are in the repo — every number is one shell command to reproduce. Happy to answer questions about the architecture, the failure cases, or where this goes next.

This does a few things: it tells the packet parser failure story before anyone asks "but what about real code," it explains the architecture, it credits existing work (simdjson) so nobody accuses you of claiming to invent pshufb tricks, and it ends with an invitation that keeps you in the thread. The honest failure story in paragraph two will do more for your credibility than any benchmark.

Michael Abrash doubled Quake framerste

Alexei Navalny Was Murdered

Show HN: Tufte Editor – Local Markdown Editor with Tufte CSS Live Preview

No Coding Before 10am

The Medal Comes After the Meme

The Demise of Conflict Studies

What the hell is Forth? (2019)

Oat – Ultra-lightweight, semantic, zero-dependency HTML UI component library

Claude Code Tips from the Guy Who Built It

I Turned an ESP32 into a Thermal USB Webcam

ByteDance Seed 2.0

Gemini's mobile app inherits Google's location permissions

Solve Everything

Jailbreaking Google Translate

Show HN: GPACalc – Free GPA and CGPA Calculator (4.0/5.0/10.0 scales)

Project Oberon: A Late Appraisal (2025)

Marching Morons; a Year in Books; AI Character Names

AI Shifts Concern from Technical Debt to Cognitive Debt

Need Help, the Softraid and Lvm

Engineers are becoming sorcerers – Future of software dev with OpenAI Sherwin Wu

Show HN: Ktrack – A simple, offline workout tracker

Are productivity gains due to AI hard-sell where you work?

Show HN: LanceCalc – Open-source freelance platform fee calculator

Show HN: Agent Lens – Code assistant observability in VSCode

Apple Rankings by the Appleist Brian Frange

Saving the SpaceOrb360 with open source hardware and software (2024) [video]

There's a Reason American Kids Are Such Picky Eaters

Watching Code Fly By

Show HN: DepGuard – Local dependency audit and license compliance (10 pkg mgrs)

Hamming Distance for Hybrid Search in SQLite