This is impressive microbenchmarking work but the practical relevance is inverted from what the title suggests.
The 0.5ns lookup times are achieved on a 2^12-entry table that entirely fits in L2 cache. That's 4KB of data. Your entire working set for the benchmark is maybe 32KB. On a real system with millions of entries, you're hitting L3 or memory, and suddenly all those nanosecond savings vanish. A cache miss costs you 200-300ns. This optimization saves ~5ns in the happy path and adds instruction complexity that hurts branch prediction.
The 100% load factor claim is similarly misleading. Nobody targets 100% in production. You do it here because the benchmark ends at 2^12 entries. Try 2^28 entries at 100% load and see what happens to your insertion performance. Those "massive entry movements" shown in the videos? They get worse with larger tables (branch predictors can't help you when you have thousands of entries worth of movement). Real systems target 70-80% for a reason.
The SIMD + 32-slot windows trick is genuinely clever. But it's solving a problem that chained hashing basically sidesteps. Python dicts, Go maps, Rust hash maps they all use chaining or generational approaches now because the sweet spot for most real workloads isn't "perfect 0ns average case", it's "predictable worst case that doesn't require constant specialization".
Also: "modern audiences" in the title but the implementation is in Zig. Zig code is not what modern applications are built in. This would be valuable if there was a production-ready, maintained version people could actually use. Single-file educational implementations don't count.
Great writeup though. The visualizations are perfect.
ninadpathak•1h ago
The 0.5ns lookup times are achieved on a 2^12-entry table that entirely fits in L2 cache. That's 4KB of data. Your entire working set for the benchmark is maybe 32KB. On a real system with millions of entries, you're hitting L3 or memory, and suddenly all those nanosecond savings vanish. A cache miss costs you 200-300ns. This optimization saves ~5ns in the happy path and adds instruction complexity that hurts branch prediction.
The 100% load factor claim is similarly misleading. Nobody targets 100% in production. You do it here because the benchmark ends at 2^12 entries. Try 2^28 entries at 100% load and see what happens to your insertion performance. Those "massive entry movements" shown in the videos? They get worse with larger tables (branch predictors can't help you when you have thousands of entries worth of movement). Real systems target 70-80% for a reason.
The SIMD + 32-slot windows trick is genuinely clever. But it's solving a problem that chained hashing basically sidesteps. Python dicts, Go maps, Rust hash maps they all use chaining or generational approaches now because the sweet spot for most real workloads isn't "perfect 0ns average case", it's "predictable worst case that doesn't require constant specialization".
Also: "modern audiences" in the title but the implementation is in Zig. Zig code is not what modern applications are built in. This would be valuable if there was a production-ready, maintained version people could actually use. Single-file educational implementations don't count.
Great writeup though. The visualizations are perfect.