Reverse-engineering the RK3588 NPU: Hacking limits to run vision transformers

https://amohan.dev/blog/2025/shard-optimizing-vision-transformers-edge-npu/

58•rcarmo•1mo ago

Comments

jauntywundrkind•1mo ago

Epic hacker work!

For what it's worth, it seems like there's a bunch of open source NPU work in progress too. There's a layer "TEFLON" for Gallium3D shared by most of these drivers, that TensorFlow can use. Then hardware drivers for Rockchip (via ROCKET driver), and Vivante (with their Etnaviv drivers). It'd be extra interesting now to see how (or if?) they've dealt with the system constraints (small scratchpad size) here. https://www.phoronix.com/news/Gallium3D-Teflon-Merged https://www.phoronix.com/news/Rockchip-NPU-Linux-Mesa https://www.phoronix.com/news/Two-NPU-Accel-Drivers-2026

poad4242•1mo ago

> *Thanks! I actually tracked the Teflon/ROCKET driver work closely during my initial research (it was the 'Plan B' in my original proposal if the vendor blobs failed entirely).* >

> *The main reason I stuck with the closed-source `rknn` stack for this specific project was operator support for Transformers. Teflon is getting great at standard CNN ops (Fused ReLU, Convs, etc.), but the SigLIP vision encoder relies on massive Transposes and unbounded GELU activations that currently fall off the 'happy path' in the open stack.*

> *To your point on the system constraints (small scratchpad): I suspect the current open-source drivers would hit the exact same 32KB SRAM wall I found. The hardware simply refuses to tile large matrices automatically. My 'Nano-Tiling' fix was a software-level patch; porting that logic into the Mesa driver itself would probably be the 'Holy Grail' fix here.*

Neywiny•1mo ago

This is good work. I would say that there was very little reverse engineering but that's fine. It's interesting seeing some companies look at ARM's Ethos line as holding them back and others as it pulling them forward. I'm not sure if ARM is the best solution, but all these different NPUs feels a bit like the early CPU architecture and compiler days. Hopefully we can make it through unscathed so at least we get better error messages or maybe even compilers that know those kinds of idiosyncracies enough to avoid such things.

kvuj•1mo ago

Awesome! Finally putting back "Hacker" in "Hacker News".

doctorpangloss•1mo ago

hacker news needs a reprieve from "Problem. The fix? Vibe coding session. Here's the ChatGPT report"

poad4242•1mo ago

I understand the frustration with AI-written posts lately, but this was the opposite of that. It took months of hard work and many late nights. While the hardware manual (TRM) is public, it doesn't explain how to handle the strict 4KB memory bank limits. I had to figure out how to shard and tile the model because the hardware won't let you store data across those banks without crashing. It was a long battle with memory constraints to get that 15x speedup.

PunchyHamster•1mo ago

we need RISC-V equivalent but for NPUs, it's become a royal mess last few years

Neywiny•1mo ago

It's starting. Some designs are moving towards very wide vector length (1k maybe even 2k?) RV-V cores. So less a giant matrix multiplication unit (I think TI has some parts with what they literally call MMUs, great work guys), more a bunch of DSP heavy CPUs. In the age of x86 splitting on AVX-512, it's interesting.

poad4242•1mo ago

Hello! Author of the post here, happy to answer questions about the process. I have a draft white paper that details more of the process. Let me know if I should put it up on github or arxiv.

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

I Write Games in C (yes, C)

SectorC: A C Compiler in 512 bytes

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

History and Timeline of the Proco Rat Pedal (2021)

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Software factories and the agentic moment

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

I Write Games in C (yes, C)

SectorC: A C Compiler in 512 bytes

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

History and Timeline of the Proco Rat Pedal (2021)

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Software factories and the agentic moment

Reverse-engineering the RK3588 NPU: Hacking limits to run vision transformers

Comments