86 GB/s bitpacking with ARM SIMD (single thread)

https://github.com/ashtonsix/perf-portfolio/tree/main/bytepack

25•ashtonsix•2h ago

Comments

Retr0id•58m ago

I tried to run the benchmark on my M1 Pro macbook, but the "baseline" is written with x86 intrinsics and won't compile.

Are the benchmark results in the README real? (The README itself feels very AI-generated)

Looking at the makefile, it tries to link the x86 SSE "baseline" implementation and the NEON version into the same binary. A real headscratcher!

Edit: The SSE impl gets shimmed via simd-everywhere, and the benchmark results do seem legit (aside from being slightly apples-to-oranges, but that's unavoidable)

Asmod4n•48m ago

Maybe this could help you: https://github.com/simd-everywhere/simde/issues/1099

Retr0id•45m ago

But this project isn't using simd-everywhere. I'd like to reproduce the results as documented in the README

guipsp•32m ago

Look at the parent dir. I agree it is a bit confusing

Retr0id•27m ago

Ah! Yup, that works, I can compile the binary. I get an "Illegal instruction" error when I run it but that's probably just because M1 doesn't support some of the NEON instructions. I retract my implicit AI-slop accusations.

Retr0id•22m ago

Results from M1 Pro (after setting CPU=native in the makefile): https://gist.github.com/DavidBuchanan314/e3cde76e4dab2758ec4...

ashtonsix•16m ago

Thank you so much for attempting a reproduction! (I posted this on Reddit and most commenters didn't even click the link)

For the baseline you need SIMDe headers: https://github.com/simd-everywhere/simde/tree/master/simde. These alias x86 intrinsics to ARM intrinsics. The baseline is based on the previous State-of-The-Art (https://arxiv.org/abs/1209.2137) which happens to be x86-based; using SIMDe to compile was the highest-integrity way I could think of to compare with the previous SOTA.

Note: M1 chips specifically have notoriously bad small-shift performance, so the benchmark results will be very bad on your machine. M3 partially fixed this, M4 fixed completely. My primary target is server-class rather than consumer-class hardware so I'm not too worried about this.

The benchmark results were cpy-pasted from the terminal. The README prose was AI generated from my rough notes (I'm confident when communicating with other experts/researchers, but less-so with communication to a general audience).

danlark1•39m ago

Great work!

Popular narrative that NEON does not have a move mask alternative. Some time ago I published an article to simulate popular bit packing use cases with NEON with 1-2 instructions. This does not include unpacking cases but can be great for real world applications like compare+find, compare+iterate, compare+test.

https://community.arm.com/arm-community-blogs/b/servers-and-...

Show HN: TeXlyre – Local-first collaborative LaTeX/Typst editor

Prester John, Letter from Nowhere: Medieval Europeans Search for a Mythical King

Universal EV Chargers debuts $15 flat-rate fast charging across Illinois

Recipe for conductive plastics paves way for human bodies to go online

Handheld PC Gaming Offering at the Tokyo Games Show 2025 [video]

Saturated Fat Restriction for Cardiovascular Disease Prevention

Show HN: I built a design AI agent that never hallucinates

Show HN: I built an AI financial co-pilot to reduce money stress

An exploration of basic human values in 38M obituaries

Rachel Ruysch and the Bounties of Nature

An alternative to knowledge graphs for storing loosely structured content

Exploring PostgreSQL to Parquet archival for JSON data with S3 range reads

Hacking contest kerfuffle over copied rules pits Wiz against ZDI

Bematist (Professional "Step Measurer")

Platform Engineering: Easy to Use, Hard to Mess Up

Scientists grow mini human brains to power computers

Thinking Bigger

Retiring Test-Ipv6.com

Three Times You Have to Say No

Crazy Google Japan keyboard design switches keys for dials

Embracing the parallel coding agent lifestyle

FIDE 2025 World Corporate Chess Championship

AI Investors Are Chasing a Big Prize. Here's What Can Go Wrong

Ask HN: why AI wrapper developers talking confidently about AGI

Hacking Claude Code for Fun and Profit

Ask HN: What's the best agentic AI tool for writing assembly code?

Scheduling SDK to Build Your Own Calendly

WilhelmScreamDB – A crowdsourced database for every Wilhelm Scream

I Do Not Want to Be a Programmer Anymore

What's new with diagnostics in GCC 16 [pdf]