Improving performance of rav1d video decoder

https://ohadravid.github.io/posts/2025-05-rav1d-faster/

198•todsacerdoti•5h ago

Comments

robertknight•5h ago

Good post! The inefficient code for comparing pairs of 16-bit integers was an interesting find.

ohr•5h ago

Thanks! Would be interesting to see if Rust/LLVM folks can get the compiler to apply this optimization whenever possible, as Rust can be much more accurate w.r.t memory initialization.

Ygg2•4h ago

Would be great, but wouldn't hold my breath for it. LLVM and Rustc can be both be kinda slow to stabilize.

pornel•3h ago

It varies. New public APIs or language features may take a long time, but changes to internals and missed optimizations can be fixed in days or weeks, in both LLVM and Rust.

adgjlsfhk1•4h ago

I think rust may be able to get it by adding a `freeze` intrinsic to the codegen here. that would force LLVM to pick a deterministic value if there was poison, and should thus unblock the optimization (which is fine here because we know the value isn't poison)

kukkamario•2h ago

I think in this case Rust and C code aren't equivalent which maybe caused this slow down. Union trick also affects the alignment. C side struct is 32 bit aligned, but Rust struct only has 16bit alignment because it only contains fields with 16bit alignment. In practice the fields are likely anyway correctly aligned to 32bits, but compiler optimizations may have hard time verifying that.

Have you tried manually defining alignment of Rust struct?

infogulch•4h ago

You know it's a good post when it starts with a funny meme. Seems related to the recent discussion: $20K Bounty Offered for Optimizing Rust Code in Rav1d AV1 Decoder (memorysafety.org) | 108 comments | https://news.ycombinator.com/item?id=43982238

brookst•4h ago

Title undersells post; it’s actually 2.3% faster with two good optimizations.

ohr•4h ago

I think that since the 1.5% one is only for aarch64 it's a bit unfair to claim the full number, more like 1/2 if you consider arm/x86 to be the majority of the (future) deployments

brookst•4h ago

I suppose that’s fair, but I’d give credit for a 2.3% improvement in the test environment. For all we know it may be a net loss in other environments due to quirks (probably not, admittedly).

mmastrac•4h ago

The associated issue for comparing two u16s is interesting.

https://github.com/rust-lang/rust/issues/140167

heybales•3h ago

The thing I like most about this is that the discussion isn't just 14 pages of "I'm having this issue as well" and "Any updates on when this will be fixed?" As a web dev, GitHub issues kinda suck.

eterm•1h ago

It was worse before emoji reactions were added and 90% of messages were literally just "+1"

heybales•19m ago

tialaramex•4h ago

All being equal codecs ought to be in WUFFS† rather than Rust, but I can well imagine that it's a much bigger lift to take something as complicated as dav1d and write the analogous WUFFS than to clean up the c2rust translation, if you said a thousand times harder I'd have no trouble believing that. I just think it's worth it for us as a civilisation.

† Or an equivalent special purpose language, but WUFFS is right there

IgorPartola•3h ago

WUFFS would be great for parsing container files (Matroska, webm, mp4) but it does not seem at all suitable for a video decoder. Without dynamic memory allocation it would be challenging to deal with dynamic data. Video codecs are not simply parsing a file to get the data, they require quite a bit of very dynamic state to be managed.

lubesGordi•3h ago

Requiring dynamic state seems not obvious to me. At the end of the day you have a fixed number of pixels on the screen. If every single pixel changes from frame to frame that should constitute the most work your codec has to do, no? I'm not a codec writer but that's my intuition based on the assumption that codecs are basically designed to minimize the amount of 'work' being done from frame to frame.

throwawaymaths•3h ago

compression algorithms can get very clever in recursive ways

dylan604•3h ago

Maybe you're not familiar with how long GOP encoding works with IPB frames? If all frames were I-frames, maybe what you're thinking might work. Everything you need is in the one frame to be able to describe every single pixel in that frame. Once you start using P-frames, you have to hold on to data from the I-frame to decode the P-frame. With B-frames, you might need data from frames not yet decoded as the are bi-direction references.

lubesGordi•1h ago

Still you don't necessarily need to have dynamic memory allocations if the number of deltas you have is bounded. In some codecs I could definitely see those having a varying size depending on the amount of change going on in the scene.

I'm not a codec developer, I'm only coming at this from an outside/intuitive perspective. Generally, performance concerned parties want to minimize heap allocations, so I'm interested in this as how it applies in codec architecture. Codecs seem so complex to me, with so much inscrutable shit going on, but then heap allocations aren't optimized out? Seems like there has to be a very good reason for this.

zimpenfish•3h ago

> codecs are basically designed to minimize the amount of 'work' being done from frame to frame

But to do that they have to keep state and do computations on that state. If you've got frame 47 being a P frame, that means you need frame 46 to decode it correctly. Or frame 47 might be a B frame in which case you need frame 46 and possibly also frame 48 - which means you're having to unpack frames "ahead" of yourself and then keep them around for the next decode.

I think that all counts as "dynamic state"?

wtallis•49m ago

Memory usage can vary, but video codecs are designed to make it practical to derive bounds on those memory requirements because hardware implementations don't have the freedom to dynamically allocate more silicon.

IgorPartola•3h ago

If you are doing something like a GIF or an MJPEG, sure. If you are doing forwards and backwards keyframes with a variable amount of deltas in between, with motion estimation, with grain generation, you start having a very dynamic amount of state. Granted, encoders are more complex than decoders in some of this. But still you might need to decode between 1 and N frames to get the frame you want, and you don't know how much memory it will consume once it is decoded unless you decode it into bitmaps (at 4k that would be over 8MB per frame which very quickly runs out of memory for you if you want any sort of frame buffer present).

I suspect the future of video compression will also include frame generation, like what is currently being done for video games. Essentially you have let's say 12 fps video but your video card can fill in the intermediate frames via what is basically generative AI so you get 120 fps output with smooth motion. I imagine that will never be something that WUFFS is best suited for.

derf_•2h ago

> But still you might need to decode between 1 and N frames to get the frame you want, and you don't know how much memory it will consume...

All of these things are bounded for actual codecs. AV1 allows storing at most 8 reference frames. The sequence header will specify a maximum allowable resolution for any frame. The number of motion vectors is fixed once you know the resolution. Film grain requires only a single additional buffer. There are "levels" specified which ensure interoperability at common operating points (e.g., 4k) without even relying on the sequence header (you just reject sequences that fall outside the limits). Those are mostly intended for hardware, but there is no reason a software decoder could not take advantage of them. As long as codecs are designed to be implemented in hardware, this will be possible.

lubesGordi•1h ago

See this is interesting to me. I understand the desire to dynamically allocate buffers at runtime to capture variable size deltas. That's cool, but also still maybe technically unnecessary? Because like you say, at 4k and over 8MB per frame; you still can't allocate over a limit. So likely a codec would have some boundary set on that anyway. Why not just pre-allocate at compile time? For sure this results in a complex data structure. Functionally it could be the same and we would elide the cost of dynamic memory allocations. What I'm suggesting is probably complex, I'm sure.

In any case I get what you're saying and I understand why codecs are going to be dynamically allocating memory, so thanks for that.

GuB-42•38m ago

> I suspect the future of video compression will also include frame generation

That's how most video codecs work already. They try to "guess" what the next frame will be, based on past (for P-frames) and future (for B-frames) frames. The difference is that the codec encodes some metadata to help with the process and also the difference between the predicted frame and the real frame.

As for using AI techniques to improve prediction, it is not a new thing at all. Many algorithms optimized for compression ratio use neural nets, but these tend to be too computationally expensive for general use. In fact the Hutter prize considers text compression as an AI/AGI problem.

lubesGordi•1h ago

Hey maybe we can discuss why I'm being downvoted? This is a technical discussion and I'm contributing. If you disagree then say why. I'm not stating anything as fact that isn't fact. I am getting downvoted for asking a question.

IgorPartola•3h ago

AV1 is an amazing codec. I really hope it replaces proprietary codecs like h264 and h265. It has a similar, if not better, performance to h265 while being completely free. Currently on an Intel-based Macbook it is only supported in some browsers, however it seems that newer video cards from AMD, Nvidia, and Intel do include hardware decoders.

karn97•3h ago

9070xt records gameplay by default in av1

monster_truck•2h ago

RDNA3 cards also have AV1 encode. RDNA 2 only has decode.

With the bitrate set to 100MB/s it happily encodes 2160p or even 3240p, the maximum resolution available when using Virtual Super Resolution (which renders at >native res and downsamples, is awesome for titles without resolution scaling when you don't want to use TAA)

kennyadam•50m ago

Isn't that expected? 4K Blurays only encode up to like 128Mbps, which is 16MB/s. 100MB/s seems like complete overkill.

vlovich123•26m ago

I think op just didn’t type Mbps properly. 100MB/s or ~800Mbps is way higher than the GPU can even encode at a HW level even I would think

adzm•2h ago

Isn't VP9 more comparable to h265? AV1 seems to be a ton better than both of them.

dagmx•2h ago

They’re all in the same ballpark of each other and have characteristics that don’t make one an outright winner.

CharlesW•1h ago

AV1 is the outright winner in terms of compression efficiency (until you start comparing against VVC/H.266¹), with the advantage being even starker at high resolutions. The only current notable downside of AV1 is that client hardware support isn't yet universal.

¹ https://www.mdpi.com/2079-9292/13/5/953

senfiaj•2h ago

I think VP9 is more comparable to h264. Also if I'm not mistaken it's not good for live streaming, only for storing data.

toast0•56m ago

VP9 works for live streaming/real time conferencing too.

flashblaze•1h ago

I'm not really well versed with codecs, but is it up to the devices or the providers (where you're uploading them) to handle playback or both? A couple of days ago, I tried to upload an Instagram Reel in AV1 codec, and I was struggling to preview it on my Samsung S20 FE Snapdragon version (before uploading and during preview as well). I then resorted to H.264 and it worked w/o any issues.

kevmo314•1h ago

Instagram (the provider) will transcode for compatibility but likely the preview is before transcoding, the assumption being that the device that uploads the video is able to play it.

ta1243•47m ago

Yes that sounds spot on.

I don't know instagram, but I would expect any provider to be handle almost any container/codec/resolution combination going (they likely use ffmpeg underneath) and generate their different output formats at different bitrates for different playback devices.

Either instagram won't accept av1 (seems unlikely) or they just haven't processed it yet as you infer.

I'd love to know why your commend is greyed out.

sparrc•1h ago

Playback is 100% handled by the device. The primary (and essentially only) benefit of H.264 is that almost every device in the entire world has an H.264 hardware decoder builtin to the chip, even extremely cheap devices.

AV1 hardware decoders are still rare so your device was probably resorting to software decoding, which is not ideal.

aaron695•1h ago

Get The Scene involved.

They shifted to h.264 successfully, but I haven't heard of any more conferences to move forward in over a decade.

Currently "The Last of US S02E06" only has one AV1 - https://thepiratebay.org/search.php?q=The+Last+of+Us+S02E06 same THMT - https://thepiratebay.org/search.php?q=The+Handmaids+Tale+S06... These are low quality at only ~600MB, not really early adopter sizes.

AV1 beats h.265 but not h.266 - https://www.preprints.org/manuscript/202402.0869/v1 - People disagree with this paper on default settings

Things like getting hardware to The Scene for encoding might help, but I'm not sure of the bottleneck, it might be bureaucratic or educational or cultural.

[edit] "Common Side Effects S01E04" AV1 is the strongest torrent, that's cool - https://thepiratebay.org/search.php?q=Common+Side+Effects+S0...

wbl•41m ago

There was a conference?!

mbeavitt•3h ago

Haha I was just thinking to myself "I wonder if anyone made any progress on that rav1d bounty yet?"

lubesGordi•3h ago

Honestly its a little surprising the first optimization he found was something fairly obvious just by using perf. I thought they had discussed the zeroing buffers issue in the first post? The second optimization was definitely more involved/interesting but was still pointed at by perf. Don't underestimate that tool!

sounds•2h ago

He came from the aarch64 perspective on an Apple device. I often experience someone spotting an "obvious in hindsight" gap because they come from a different background.

nemothekid•3h ago

Intersting to see this article on the perfromance advantage of not having to zero buffers after this article 2 days ago: https://news.ycombinator.com/item?id=44032680

mdf•3h ago

There's something about real optimization stories that I find fascinating – particularly the detailed ones including step-by-step improvements and profiling to show how numbers got better. In some way, they are satisfying to read.

Nicholas Nethercote's "How to speed up the Rust compiler" writings[1] fall into this same category for me.

Any others?

[1] https://nnethercote.github.io/

ohr•3h ago

(Author here) I'm a huge fan of the "How to speed up the Rust compiler" series! I was hoping to capture the same feeling :)

dirtyhippiefree•1h ago

Having your last name be Ravid really is the icing on your cake.

Real is about the only other codec I see that could be a name, but nobody uses that anymore.

anon-3988•3h ago

Is skipping initialization of buffers a hard problem for compilers?

empath75•2h ago

It's easy to not initialize the buffer, the hard part is guaranteeing that it's safe to read something that might not be initialized.

adgjlsfhk1•2h ago

yeah. Proving that the zero initialization is useless requires proving that the rest of the program never reads one of the zeroed values. This is really difficult because compilers generally don't track individual array indices (since you often don't even know how big the array is)

brigade•56m ago

It’s especially hard to elide the compiler initialization when the intended initialization is by a function written in assembly

jebarker•3h ago

Beautiful work and nice write-up. Profiling and optimization is absolutely my favorite part of software development.

renewiltord•2h ago

Oh this stuff is what’s prompting the ffmpeg Twitter account to make a stand against Rust https://x.com/ffmpeg/status/1924137645988356437?s=46

mmastrac•1h ago

Reading the ffmpeg twitter account is enough to turn me off using ffmpeg. It's a shame there's no real alternative -- the devs seem very toxic.

I mean sure, max performance is great if you control every part of your pipeline, but if you're accepting untrusted data from users-at-large ffmpeg has at least a half-dozen remotely exploitable CVEs a year. Better make sure your sandbox is tight.

https://ffmpeg.org/security.html

I feel like there's a middle ground where everyone works towards a secure and fast solution, rather than whatever position they've staked out here.

tialaramex•1h ago

The healthier response might have been work to speed-up dav1d. If you refine the Olympic Record metrics and force them to retrospectively update previous records so that Bolt's 100m sprint record is revised to 9.64s rather than 9.63s nobody cares man, get a life, but if you can run an actual nine second 100 metre sprint that people care about†

† If you're a human. If you're an ostrich this is not impressive, but on the whole ostrichs aren't competing in the Olympic 100 metre sprint.

Mr_Eri_Atlov•2h ago

AV1 continues to be the most fascinating development in media encoding.

AVG-SVT-PSY is particularly interesting to read up on as well.

smallpipe•1h ago

This is really fun. Is there anything stopping rustc from performing the transmute trick ?

Edit: If I had read the next paragraph, I'd have learn about [1] before commenting

[1] https://github.com/rust-lang/rust/issues/140167

TinkersW•55m ago

Interesting, but mostly just sounds like Rust issues, and requiring some nonsense to fix issues that shouldn't have existed in the first place.

Leading me to the conclusion that Rust is a dubious choice for highly optimized SIMD code.

adgjlsfhk1•44m ago

transpiled code is rarely good. Rust is often better than C for SIMD code (it actually has useful SIMD instructions exposed, and aliasing guarantees make it a lot easier for the compiler to figure out obvious optimization. By transpiling, however you loose most of the structure of an idiomatic project and generally make a bit of a mess of things.

MilliForth-6502, A Forth For The 6502 CPU

Vibe Check: Claude 4 Opus

Ask HN: Would a combination of Snapchat and Reddit be interesting?

Show HN: Kubiks – Real-Time Service Map for K8s, Cloud, CDN, WAF, and More

We’ll be ending web hosting for your apps on Glitch

Apple Turnaround

Show HN: FaceAge AI – Guess Your Face, Eye, Wrinkle, and Skin Age from a Photo

Ask HN: What's your biggest pain point with AI inference costs?

Activating AI Safety Level 3 Protections

Welcome to the AI Trough of Disillusionment

A Pyramid-Shaped Career

US Mint moves forward with plans to kill the penny

North Korean Navy Diver Reacts to US Navy SEALs Training

Microsoft-backed Builder.ai collapsed after finding potentially bogus sales

Bankrupt Electric Bus Maker Lion Rescued by Quebec Investors

May 2025 OpenZFS Leadership Meeting [video]

Show HN: Free Prompt Grading Tool

System Card: Claude Opus 4 and Claude Sonnet 4 [pdf]

Warning Signs Your App Authorization Is a Ticking Time Bomb

Synopsis of the Language Joy (2001)

R2: Ready for the Wild

Is there PFAS in your pint?

Show HN: DockFlow – Switch between multiple macOS Dock layouts instantly

Sending an Alert for Short Wait Time at Disney

Cardiac Events in Adults Hospitalized for RSV vs. Covid or Influenza

How two students monitor laundry machines in a college dorm with Grafana

Dawn Aerospace Begins Taking Orders for Aurora Spaceplane

Irish privacy watchdog OKs Meta to train AI on EU folks' posts

Ask HN: What makes a programming language great for code generation?

This Kidney Was Frozen for 10 Days. Could Surgeons Transplant It?

MilliForth-6502, A Forth For The 6502 CPU

Vibe Check: Claude 4 Opus

Ask HN: Would a combination of Snapchat and Reddit be interesting?

Show HN: Kubiks – Real-Time Service Map for K8s, Cloud, CDN, WAF, and More

We’ll be ending web hosting for your apps on Glitch

Apple Turnaround

Show HN: FaceAge AI – Guess Your Face, Eye, Wrinkle, and Skin Age from a Photo

Ask HN: What's your biggest pain point with AI inference costs?

Activating AI Safety Level 3 Protections

Welcome to the AI Trough of Disillusionment

A Pyramid-Shaped Career

US Mint moves forward with plans to kill the penny

North Korean Navy Diver Reacts to US Navy SEALs Training

Microsoft-backed Builder.ai collapsed after finding potentially bogus sales

Bankrupt Electric Bus Maker Lion Rescued by Quebec Investors

May 2025 OpenZFS Leadership Meeting [video]

Show HN: Free Prompt Grading Tool

System Card: Claude Opus 4 and Claude Sonnet 4 [pdf]

Warning Signs Your App Authorization Is a Ticking Time Bomb

Synopsis of the Language Joy (2001)

R2: Ready for the Wild

Is there PFAS in your pint?

Show HN: DockFlow – Switch between multiple macOS Dock layouts instantly

Sending an Alert for Short Wait Time at Disney

Cardiac Events in Adults Hospitalized for RSV vs. Covid or Influenza

How two students monitor laundry machines in a college dorm with Grafana

Dawn Aerospace Begins Taking Orders for Aurora Spaceplane

Irish privacy watchdog OKs Meta to train AI on EU folks' posts

Ask HN: What makes a programming language great for code generation?

This Kidney Was Frozen for 10 Days. Could Surgeons Transplant It?

Improving performance of rav1d video decoder

Comments