Measured against unmodified upstream llama.cpp at the same Bonsai/Q2_0 commit, same M4 Max:
- tg128: 309.82 → 442.42 t/s (+42.0%)
- pp512: 4250.32 → 4622.63 t/s (+8.8%)
Measured against unmodified upstream llama.cpp at the same Bonsai/Q2_0 commit, same M4 Max:
- tg128: 309.82 → 442.42 t/s (+42.0%)
- pp512: 4250.32 → 4622.63 t/s (+8.8%)
dsecurity49•1h ago
hhuytho•1h ago
- Instead of the conventional wisdom for fusion: "fuse early, fuse aggressively", the search does the opposite for Q. It fuses K's RMSNorm at K-cache-write time (one norm. for the whole K matrix), but defers Q's RMSNorm to attention kernel's prologue.
- The result_output of Q2_0 kernel was rewritten to process 2 output rows per SIMD lane instead of 1, with nsg=8. This is against the common Metal advice of maximizing occupancy to keep simdgroups busy. The advantage is that each y vector gets reused across two accumulators, halving DRAM bandwidth for the y operand.
We didn't suggest either of these. The agent had the upstream code, a benchmark, and a correctness check.