frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Maple Mono: Smooth your coding flow

https://font.subf.dev/en/
1•signa11•4m ago•0 comments

Sid Meier's System for Real-Time Music Composition and Synthesis

https://patents.google.com/patent/US5496962A/en
1•GaryBluto•12m ago•1 comments

Show HN: Slop News – HN front page now, but it's all slop

https://dosaygo-studio.github.io/hn-front-page-2035/slop-news
3•keepamovin•13m ago•1 comments

Show HN: Empusa – Visual debugger to catch and resume AI agent retry loops

https://github.com/justin55afdfdsf5ds45f4ds5f45ds4/EmpusaAI
1•justinlord•15m ago•0 comments

Show HN: Bitcoin wallet on NXP SE050 secure element, Tor-only open source

https://github.com/0xdeadbeefnetwork/sigil-web
2•sickthecat•18m ago•1 comments

White House Explores Opening Antitrust Probe on Homebuilders

https://www.bloomberg.com/news/articles/2026-02-06/white-house-explores-opening-antitrust-probe-i...
1•petethomas•18m ago•0 comments

Show HN: MindDraft – AI task app with smart actions and auto expense tracking

https://minddraft.ai
2•imthepk•23m ago•0 comments

How do you estimate AI app development costs accurately?

1•insights123•24m ago•0 comments

Going Through Snowden Documents, Part 5

https://libroot.org/posts/going-through-snowden-documents-part-5/
1•goto1•24m ago•0 comments

Show HN: MCP Server for TradeStation

https://github.com/theelderwand/tradestation-mcp
1•theelderwand•27m ago•0 comments

Canada unveils auto industry plan in latest pivot away from US

https://www.bbc.com/news/articles/cvgd2j80klmo
2•breve•28m ago•1 comments

The essential Reinhold Niebuhr: selected essays and addresses

https://archive.org/details/essentialreinhol0000nieb
1•baxtr•31m ago•0 comments

Rentahuman.ai Turns Humans into On-Demand Labor for AI Agents

https://www.forbes.com/sites/ronschmelzer/2026/02/05/when-ai-agents-start-hiring-humans-rentahuma...
1•tempodox•32m ago•0 comments

StovexGlobal – Compliance Gaps to Note

1•ReviewShield•35m ago•1 comments

Show HN: Afelyon – Turns Jira tickets into production-ready PRs (multi-repo)

https://afelyon.com/
1•AbduNebu•36m ago•0 comments

Trump says America should move on from Epstein – it may not be that easy

https://www.bbc.com/news/articles/cy4gj71z0m0o
6•tempodox•37m ago•2 comments

Tiny Clippy – A native Office Assistant built in Rust and egui

https://github.com/salva-imm/tiny-clippy
1•salvadorda656•41m ago•0 comments

LegalArgumentException: From Courtrooms to Clojure – Sen [video]

https://www.youtube.com/watch?v=cmMQbsOTX-o
1•adityaathalye•44m ago•0 comments

US moves to deport 5-year-old detained in Minnesota

https://www.reuters.com/legal/government/us-moves-deport-5-year-old-detained-minnesota-2026-02-06/
8•petethomas•47m ago•2 comments

If you lose your passport in Austria, head for McDonald's Golden Arches

https://www.cbsnews.com/news/us-embassy-mcdonalds-restaurants-austria-hotline-americans-consular-...
1•thunderbong•52m ago•0 comments

Show HN: Mermaid Formatter – CLI and library to auto-format Mermaid diagrams

https://github.com/chenyanchen/mermaid-formatter
1•astm•1h ago•0 comments

RFCs vs. READMEs: The Evolution of Protocols

https://h3manth.com/scribe/rfcs-vs-readmes/
3•init0•1h ago•1 comments

Kanchipuram Saris and Thinking Machines

https://altermag.com/articles/kanchipuram-saris-and-thinking-machines
1•trojanalert•1h ago•0 comments

Chinese chemical supplier causes global baby formula recall

https://www.reuters.com/business/healthcare-pharmaceuticals/nestle-widens-french-infant-formula-r...
2•fkdk•1h ago•0 comments

I've used AI to write 100% of my code for a year as an engineer

https://old.reddit.com/r/ClaudeCode/comments/1qxvobt/ive_used_ai_to_write_100_of_my_code_for_1_ye...
2•ukuina•1h ago•1 comments

Looking for 4 Autistic Co-Founders for AI Startup (Equity-Based)

1•au-ai-aisl•1h ago•1 comments

AI-native capabilities, a new API Catalog, and updated plans and pricing

https://blog.postman.com/new-capabilities-march-2026/
1•thunderbong•1h ago•0 comments

What changed in tech from 2010 to 2020?

https://www.tedsanders.com/what-changed-in-tech-from-2010-to-2020/
3•endorphine•1h ago•0 comments

From Human Ergonomics to Agent Ergonomics

https://wesmckinney.com/blog/agent-ergonomics/
1•Anon84•1h ago•0 comments

Advanced Inertial Reference Sphere

https://en.wikipedia.org/wiki/Advanced_Inertial_Reference_Sphere
1•cyanf•1h ago•0 comments
Open in hackernews

86 GB/s bitpacking with ARM SIMD (single thread)

https://github.com/ashtonsix/perf-portfolio/tree/main/bytepack
132•ashtonsix•4mo ago

Comments

Retr0id•4mo ago
I tried to run the benchmark on my M1 Pro macbook, but the "baseline" is written with x86 intrinsics and won't compile.

Are the benchmark results in the README real? (The README itself feels very AI-generated)

Looking at the makefile, it tries to link the x86 SSE "baseline" implementation and the NEON version into the same binary. A real headscratcher!

Edit: The SSE impl gets shimmed via simd-everywhere, and the benchmark results do seem legit (aside from being slightly apples-to-oranges, but that's unavoidable)

Asmod4n•4mo ago
Maybe this could help you: https://github.com/simd-everywhere/simde/issues/1099
Retr0id•4mo ago
But this project isn't using simd-everywhere. I'd like to reproduce the results as documented in the README
guipsp•4mo ago
Look at the parent dir. I agree it is a bit confusing
Retr0id•4mo ago
Ah! Yup, that works, I can compile the binary. I get an "Illegal instruction" error when I run it but that's probably just because M1 doesn't support some of the NEON instructions. I retract my implicit AI-slop accusations.
Retr0id•4mo ago
Results from M1 Pro (after setting CPU=native in the makefile): https://gist.github.com/DavidBuchanan314/e3cde76e4dab2758ec4...
ashtonsix•4mo ago
Thank you so much for attempting a reproduction! (I posted this on Reddit and most commenters didn't even click the link)

For the baseline you need SIMDe headers: https://github.com/simd-everywhere/simde/tree/master/simde. These alias x86 intrinsics to ARM intrinsics. The baseline is based on the previous State-of-The-Art (https://arxiv.org/abs/1209.2137) which happens to be x86-based; using SIMDe to compile was the highest-integrity way I could think of to compare with the previous SOTA.

Note: M1 chips specifically have notoriously bad small-shift performance, so the benchmark results will be very bad on your machine. M3 partially fixed this, M4 fixed completely. My primary target is server-class rather than consumer-class hardware so I'm not too worried about this.

The benchmark results were cpy-pasted from the terminal. The README prose was AI generated from my rough notes (I'm confident when communicating with other experts/researchers, but less-so with communication to a general audience).

ozgrakkurt•4mo ago
Super cool!

Pretty sure anyone going into this kind of post about simd would prefer your writing to llm

deadmutex•4mo ago
Here is a repro using GCE's C4A Axion instances (c4A-highcpu-72). Seems to beat Graviton? Maybe the title of the thread can be updated to a larger number :) ? I used the largest instance to avoid noisy neighbor issues.

  $ ./out/bytepack_eval
  Bytepack Bench — 16 KiB, reps=20000 (pinned if available)
  Throughput GB/s

  K  NEON pack   NEON unpack  Baseline pack   Baseline unpack
  1  94.77       84.05        45.01           63.12          
  2  123.63      94.74        52.70           66.63          
  3  94.62       83.89        45.32           68.43          
  4  112.68      77.91        58.10           78.20          
  5  86.96       80.02        44.32           60.77          
  6  93.50       92.08        51.22           67.20          
  7  87.10       79.53        43.94           57.95          
  8  90.49       92.36        68.99           83.88
ashtonsix•4mo ago
Oh nice! Axion C4A and Graviton4 use the same core (Neoverse V2), so the performance difference is due to factors like clock speed and power management.

I used a geometric mean to calculate the top-line "86 GB/s" for NEON pack/unpack; so that's 91 GB/s for the C4A repro. Probably going to leave the title unmodified.

danlark1•4mo ago
Great work!

Popular narrative that NEON does not have a move mask alternative. Some time ago I published an article to simulate popular bit packing use cases with NEON with 1-2 instructions. This does not include unpacking cases but can be great for real world applications like compare+find, compare+iterate, compare+test.

https://community.arm.com/arm-community-blogs/b/servers-and-...

ashtonsix•4mo ago
Nice article! I personally find the ARM ISA far more cohesive than x86's: far less historical quirks. I also really appreciate the ubiquity of support for 8-bit elements in ARM and the absence of SMT (make performance much more predictable).
Sesse__•4mo ago
I never understood why they couldn't just include a movmskb instruction to begin with. It's massively useful for integer tasks, not expensive to implement as far as I know, and the vshrn+mov trick often requires an extra instruction either in front or after (in addition to just being, well, pretty obscure).

NEON in general is a bit sad, really; it's built around the idea of being implementable with a 64-bit ALU, and it shows. And SVE support is pretty much non-existent on the client.

ashtonsix•4mo ago
Not having (1 << k) - 1 as a single instruction sucks when it HAS to be in a hot loop, but you can usually hoist this to the loop prolougue: my stuff uses dummy inline assembly hints to force compilers to do this `asm volatile("" : "+w"(m));`.

I personally think calibrating ARM's ISA on smaller VL was a good choice: you get much better IPC. You also have an almost-complete absence of support for 8-bit elements with x86 ISAs, so elements per instruction is tied. And NEON kind-of-ish makes up for its small VL with multi-register TBL/TBX and LDP/STP.

Also: AVX512 is just as non-existent on clients as SVE2; although not really relevant for the server-side targets I'm optimising for (mostly OLAP).

TinkersW•4mo ago
8 bit is not absent in x86 SIMD, it is a slightly less covered than 32 & 16 bit, but you can fully implement all the common 8 bit ops and most are 1 instruction(with AVX2). There are even various horizontal ops on 8 bit values(avg, dot etc).

Also AVX512 is way more common than SVE2, all Zen4 & Zen5 support it.

dzaima•4mo ago
More specifically, basically the only absent 8-bit ops that have 32-bit equivalents in AVX2 are shifts and multiplies. Shifts are quite annoying (though, with a uniform shift they can be emulated on AVX-512 via GFNI abuse in 1 instr), multiplies are rather rare (though note that there is vpmaddubsw for an 8-bit→16-bit multiply-add). There's even a case of the opposite - saturating add/sub exist for 8-bit and 16-bit ints, but not wider.
namibj•4mo ago
GFNI is distinct from AVX-512; it was merely introduced in cores that also had AVX-512.
Sesse__•4mo ago
Does _any_ SIMD instruction set have (1 << k) - 1 as a single instruction?
camel-cdr•4mo ago
Not sure in which context this is used, but you can do -1 << k in most ISAs but that still requires a bit-not. But if you want to use the value in a bitwise instruction, then there are often variants that can invert the input operand.

E.g. in RVV instead of vand.vv(a, vadd.vi(vsll.vv(1,k),-1)) you could do vandn.vv(vsll.vv(-1,k))

AVX-512 can do this with any binary or ternary bitwise logic function via vpternlog

Sesse__•4mo ago
I don't know either; I was talking about the lack of PMOVMSKB in NEON and then the comment I replied to started talking about (1 << k) - 1 not being a single instruction sucks. Which I don't think it is in any non-NEON set either.
robert3005•4mo ago
Highly recommend https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf for a comparable algorithm. It generalizes to arbitrary input and output bit widths.
ashtonsix•4mo ago
Good work, wish I saw it earlier as it overlaps with a lot of my recent work. I'm actually planning to release new SOTAs on zigzag/delta/delta-of-delta/xor-with-previous coding next week. Some areas the work doesn't give enough attention to (IMO): register pressure, kernel fusion, cache locality (wrt multi-pass). They also fudge a lot of the numbers and comparisons eg, pitching themselves against Parquet (storage-optimised) when Arrow (compute-optimised) is the most-comparable tech and obvious target to beat. They definitely improve on current best work, but only by a modest margin.

I'm also skeptical of the "unified" paradigm: performance improvements are often realised by stepping away from generalisation and exploiting the specifics of a given problem; under a unified paradigm there's definite room for improvement vs Arrow, but that's very unlikely to bring you all the way to theoretically optimal performance.

fzeroff•4mo ago
Very interesting! Uh you don’t mind my basic question, but what would you use that for?
ashtonsix•4mo ago
If you have an array of numbers with a known upper-bound, such as enums with 8 possible values (representable with 3 bits), and a memory-bound operation on those numbers eg, for (int i; i < n; i++) if (user_category[i] == 0) filtered.push_pack(i), which is common in data warehouses, using my code can more than 2x performance by allowing more efficient usage of the DRAM<->CPU bus.
6r17•4mo ago
I feel like I could learn a lot just studying this ; just a curiosity ; how do you know if stuff is within L1 cache or not ? Are there kernel fn for that ? or just trough benching ?
ashtonsix•4mo ago
From the working set size and knowledge of hardware cache behaviour. Whenever you access data from memory not already in-cache it's copied four times: to L3, L2, L1 and to CPU registers. As you access data, the hardware evicts old cache entries to make space for it.

If you loop through an array once, and then iterate through it again you can figure out where it will be cached based on the array size.

saati•4mo ago
Does it fit 32K? Does it have some weird aliasing issue because you caused cache extinction with too many power of two sizes? And if you don't know the answer to these just check L1d hitrate with perf.
crest•4mo ago
ARMv8A has nice scalar bit (un)packing instructions. I wonder if NEON is really an improvement over those given that ARM cores tend to have few SIMD ports and NEON is just 128 wide.
ashtonsix•4mo ago
I'm assuming you're referring to BFM/EXTR? NEON absolutely improves here.

The core I developed on (Neoverse V2) has 4 SIMD ports and 6 scalar integer ports, however only 2 of those scalar ports support multicycle integer operations like the insert variant of BFM (essential for scalar packing).

More importantly, NEON progresses 16 elements per instruction instead of 1.