However I imagine you'd also get the same great performance using an array?
Idea: what if we implement something that resembles CDR coding, but doesn't compact the cells together (not a space-saving device). The idea as is that when we have two cells A and B such that A->cdr == B, and such that A + 1 == B, then we replace A->cdr with some special constant which says the same thing; indicates that A->cdr is equivalent to A + 1.
Then, I think, we could have a very simple, stable and portable form of the trick in the article:
while (node) {
value += node->value;
if (node->next == NEXT_IS_CONSECUTIVE)
next = node + 1;
else
next = node->next;
node = next;
}
The branch predictor can predict that the branch is taken (our bump allocator ensures that is frequently the case), and go straight to next = node + 1. When in the speculatively executed alternative path, the load of node->next completes and is not equal to the magic value, then the predicted path is canceled and we gret node->next.This doesn't look like something that can be optimized away, because we are not comparing node->next to node + 1; there is no tautology there.
Someone with a M >= 2 might try the code and find no speedup with the "improved" version, and that it's already iterating faster than L1 load-to-use latency.
signa11•3d ago