How many branches can your CPU predict?

https://lemire.me/blog/2026/03/18/how-many-branches-can-your-cpu-predict/

54•ibobev•2h ago

Comments

withinboredom•1h ago

Before switching to a hot and branchless code path, I was seeing strangely lower performance on Intel vs. AMD under load. Realizing the branch predictor was the most likely cause was a little surprising.

stephencanon•1h ago

Enlarging a branch predictor requires area and timing tradeoffs. CPU designers have to balance branch predictor improvements against other improvements they could make with the same area and timing resources. What this tells you is that either Intel is more constrained for one reason or another, or Intel's designers think that they net larger wins by deploying those resources elsewhere in the CPU (which might be because they have identified larger opportunities for improvement, or because they are basing their decision making on a different sample of software, or both).

bee_rider•1h ago

I guess the generate_random_value function uses the same seed every time, so the expectation is that the branch predictor should be able to memorize it with perfect accuracy.

But the memorization capacity of the branch predictor must be a trade-off, right? I guess this generate_random_value function is impossible to predict using heuristics, so I guess the question is how often we encounter 30k long branch patterns like that.

Which isn’t to say I have evidence to the contrary. I just have no idea how useful this capacity actually is, haha.

bluGill•48m ago

30k long patterns are likely rare. However in the real world there is a lot of code with 30k different branches that we use several times and so the same ability memorize/predict 30k branches is useful even though this particular example isn't realistic it still looks good.

Of course we can't generalize this to Intel bad. This pattern seems unrealistic (at least at a glance - but real experts should have real data/statistics on what real code does not just my semi-educated guess), and so perhaps Intel has better prediction algorithms for the real world that miss this example. Not being an expert in the branches real world code takes I can't comment.

bee_rider•41m ago

Yeah, I’m also not an expert in this. Just had enough architecture classes to know that all three companies are using cleverer branch predictors than I could come up with, haha.

Another possibility is that the memorization capacity of the branch predictors is a bottleneck, but a bottleneck that they aren’t often hitting. As the design is enhanced, that bottleneck might show up. AMD might just have most recently widened that bottleneck.

Super hand-wavey, but to your point about data, without data we can really only hand-wave anyway.

IcePic•31m ago

https://chromium.googlesource.com/chromiumos/third_party/gcc... has some looong select/case things with lots of ifs in them, but I don't think they would hit 30k.

rayiner•1h ago

Using random values defeats the purpose of the branch predictor. The best branch predictor for this test would be one that always predicts the branch taken or not taken.

dundarious•39m ago

There will be runs of even and runs of odd outputs from the rng. This benchmark tests how well does the branch predictor "retrain" to the current run. It is a good test of this adaptability of the predictor.

The benchmark is still narrow in focus, and the results don't unequivocally mean AMD's predictor is overall "the best".

gpderetta•24m ago

The author is running the benchmark multiple times with the same random seed to discover how long a pattern can the predictor learn.

OskarS•49m ago

Hmm, that's interesting. The code as written only has one branch, the if statement (well, two, the while loop exit clause as well). My mental model of the branch predictor was that for each branch, the CPU maintained some internal state like "probably taken/not taken" or "indeterminate", and it "learned" by executing the branch many times.

But that's clearly not right, because apparently the specific data it's branching off matters too? Like, "test memory location X, and branch at location Y", and it remembers both the specific memory location and which specific branch branches off of it? That's really impressive, I didn't think branch predictors worked like that.

Or does it learn the exact pattern? "After the pattern ...0101101011000 (each 0/1 representing the branch not taken/taken), it's probably 1 next time"?

gpderetta•26m ago

Typical branch predictors can both learns patterns (even very long patterns) and use branch history (the probability of a branch being taken depends on the path taken to reach that branch). They don't normally look at data other than branch addresses (and targets for indirect branches).

jeffbee•19m ago

They can't. The data that would be needed isn't available at the time the prediction is made.

1718627440•9m ago

Yeah, otherwise you wouldn't need to predict anything.

LPisGood•25m ago

There are many branch prediction algorithms out there. They range from fun architecture papers that try to use machine learning to static predictors that don’t even adapt to the prior outcomes at all.

Night_Thastus•11m ago

AMD CPUs have been killing it lately, but this benchmark feels quite artificial.

It's a tiny, trivial example with 1 branch that behaves in a pseudo-random way (random, but fixed seed). I'm not sure that's a really good example of real world branching.

How would the various branch predictors perform when the branch taken varies from 0% likely to 100% likely, in say, 5% increments?

How would they perform when the contents of both paths are very heavy, which involves a lot of pipeline/SE flushing?

How would they perform when many different branches all occur in sequence?

Without info like that, this feels a little pointless.

Astral to Join OpenAI

OpenBSD: PF queues break the 4 Gbps barrier

Juggalo Makeup Blocks Facial Recognition Technology (2019)

Consensus Board Game

Afroman found not liable in defamation case

The Shape of Inequalities

Conway's Game of Life, in real life

Afroman Wins Civil Trial over Use of Police Raid Footage in His Music Videos

Pretraining Language Models via Neural Cellular Automata

macOS 26 breaks custom DNS settings including .internal

Nvidia greenboost: transparently extend GPU VRAM using system RAM/NVMe

Eniac, the First General-Purpose Digital Computer, Turns 80

OpenRocket

Warranty Void If Regenerated

OpenAI to Acquire Astral

How many branches can your CPU predict?

2% of ICML papers desk rejected because the authors used LLM in their reviews

Gluon: Explicit Performance

'Your Frustration Is the Product'

Stdwin: Standard window interface by Guido Van Rossum [pdf]

Austin’s surge of new housing construction drove down rents

A Preview of Coalton 0.2

LotusNotes

A sufficiently detailed spec is code

Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training

The next fight over the use of facial recognition could be in the supermarkets

Wander – A tiny, decentralised tool to explore the small web

Nvidia NemoClaw

Autoresearch for SAT Solvers

The math that explains why bell curves are everywhere

Astral to Join OpenAI

OpenBSD: PF queues break the 4 Gbps barrier

Juggalo Makeup Blocks Facial Recognition Technology (2019)

Consensus Board Game

Afroman found not liable in defamation case

The Shape of Inequalities

Conway's Game of Life, in real life

Afroman Wins Civil Trial over Use of Police Raid Footage in His Music Videos

Pretraining Language Models via Neural Cellular Automata

macOS 26 breaks custom DNS settings including .internal

Nvidia greenboost: transparently extend GPU VRAM using system RAM/NVMe

Eniac, the First General-Purpose Digital Computer, Turns 80

OpenRocket

Warranty Void If Regenerated

OpenAI to Acquire Astral

How many branches can your CPU predict?

2% of ICML papers desk rejected because the authors used LLM in their reviews

Gluon: Explicit Performance

'Your Frustration Is the Product'

Stdwin: Standard window interface by Guido Van Rossum [pdf]

Austin’s surge of new housing construction drove down rents

A Preview of Coalton 0.2

LotusNotes

A sufficiently detailed spec is code

Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training

The next fight over the use of facial recognition could be in the supermarkets

Wander – A tiny, decentralised tool to explore the small web

Nvidia NemoClaw

Autoresearch for SAT Solvers

The math that explains why bell curves are everywhere

How many branches can your CPU predict?

Comments