Beating the L1 cache with value speculation (2021)

https://mazzo.li/posts/value-speculation.html

22•shoo•4d ago

Comments

signa11•3d ago

https://news.ycombinator.com/item?id=45499965

hshdhdhehd•1h ago

I am new to this low level, but am I right in understanding this works because he uses a linked list but often it is contiguous in memory so you guess the next element is contiguous and if it is the branch predictor predicts you are right and saves going to cache and breaking the pipeline.

However I imagine you'd also get the same great performance using an array?

vlovich123•1h ago

Yes, as he noted the trick is of limited value in practice

kazinator•31m ago

Consecutive linked list nodes can occur when you bump allocate them, and in particularly if you have a copying garbage collector which ensures that bump allocation takes place from blank slate heap areas with no gaps.

Idea: what if we implement something that resembles CDR coding, but doesn't compact the cells together (not a space-saving device). The idea as is that when we have two cells A and B such that A->cdr == B, and such that A + 1 == B, then we replace A->cdr with some special constant which says the same thing; indicates that A->cdr is equivalent to A + 1.

Then, I think, we could have a very simple, stable and portable form of the trick in the article:

  while (node) {
    value += node->value;
    if (node->next == NEXT_IS_CONSECUTIVE)
      next = node + 1;
    else
      next = node->next;
    node = next;
  }

The branch predictor can predict that the branch is taken (our bump allocator ensures that is frequently the case), and go straight to next = node + 1. When in the speculatively executed alternative path, the load of node->next completes and is not equal to the magic value, then the predicted path is canceled and we gret node->next.

This doesn't look like something that can be optimized away, because we are not comparing node->next to node + 1; there is no tautology there.

bjornsing•16m ago

Yes. But I don’t think the OP is suggesting this as an alternative to using an array. As I read / skimmed it the linked list is just a simplified example. You can use this trick in more complex situations too, eg if you’re searching a tree structure and you know that some paths through the tree are much more common than others.

stinkbeetle•45m ago

Data speculation is a CPU technique too, which Apple CPUs are known to implement. Apparently they can do stride detection when predicting address values.

Someone with a M >= 2 might try the code and find no speedup with the "improved" version, and that it's already iterating faster than L1 load-to-use latency.

bjornsing•20m ago

But that works on a different level, right? At least as I understand it data speculation is about prefetching from memory into cache. This trick is about using the branch predictor as an ultra-fast ”L0” cache you could say. At least that’s how I understand it.

FSF announces Librephone project

Disk Prices

New England's last coal plant has stopped operating, according to its owners

Beliefs that are true for regular software but false when applied to AI

Why The Pentagon run the best schools and the safest nuclear program

How bad can a $2.97 ADC be?

Can We Know Whether a Profiler Is Accurate?

Interviewing Intel's Chief Architect of x86 Cores

How AI hears accents: An audible visualization of accent clusters

Nvidia DGX Spark: great hardware, early days for the ecosystem

Hacking the Humane AI Pin

Unpacking Cloudflare Workers CPU Performance Benchmarks

Printing Petscii Faster

Surveillance data challenges what we thought we knew about location tracking

How to turn liquid glass into a solid interface

Beating the L1 cache with value speculation (2021)

GrapheneOS is ready to break free from Pixels

SmolBSD – build your own minimal BSD system

What Americans die from vs. what the news reports on

A 12,000-year-old obelisk with a human face was found in Karahan Tepe

Astronomers 'image' a mysterious dark object in the distant Universe

CSS for Styling a Markdown Post

Ally Petitt: Youngest OSCP at 16yo. Over 11 CVEs by 18

ADS-B Exposed

AI and Home-Cooked Software

Preparing for AI's economic impact: exploring policy responses

Show HN: Metorial (YC F25) – Vercel for MCP

Zoo of array languages

AppLovin nonconsensual installs

Beyond the SQLite single-writer limitation with concurrent writes

Beating the L1 cache with value speculation (2021)

Comments

FSF announces Librephone project

Disk Prices

New England's last coal plant has stopped operating, according to its owners

Beliefs that are true for regular software but false when applied to AI

Why The Pentagon run the best schools and the safest nuclear program

How bad can a $2.97 ADC be?

Can We Know Whether a Profiler Is Accurate?

Interviewing Intel's Chief Architect of x86 Cores

How AI hears accents: An audible visualization of accent clusters

Nvidia DGX Spark: great hardware, early days for the ecosystem

Hacking the Humane AI Pin

Unpacking Cloudflare Workers CPU Performance Benchmarks

Printing Petscii Faster

Surveillance data challenges what we thought we knew about location tracking

How to turn liquid glass into a solid interface

Beating the L1 cache with value speculation (2021)

GrapheneOS is ready to break free from Pixels

SmolBSD – build your own minimal BSD system

What Americans die from vs. what the news reports on

A 12,000-year-old obelisk with a human face was found in Karahan Tepe

Astronomers 'image' a mysterious dark object in the distant Universe

CSS for Styling a Markdown Post

Ally Petitt: Youngest OSCP at 16yo. Over 11 CVEs by 18

ADS-B Exposed

AI and Home-Cooked Software

Preparing for AI's economic impact: exploring policy responses

Show HN: Metorial (YC F25) – Vercel for MCP

Zoo of array languages

AppLovin nonconsensual installs

Beyond the SQLite single-writer limitation with concurrent writes