The XOR trick is not useful on modern CPUs for swapping memory blocks, because on modern CPUs the slowest operations are the memory accesses and the XOR trick needs too many memory accesses.
For swapping memory, the fastest way needs 4 memory accesses: load X, load Y, store X where Y was, store Y where X was. Each "load" and "store" in this sequence may consist of multiple load or store instructions, if multiple registers are used as the intermediate buffer. Ideally, an intermediate register buffer matching the cache line size should be used, with accesses aligned to cache lines.
Hopefully, std::rotate is written in such a way that it is compiled into such a sequence of machine instructions.
You want that, but can be tricky because the from and to regions may have different alignment.
Also, the XOR trick introduces data dependencies. That slows down pipelined CPUs.
To swap B and D, with intervening C (i.e. B C D), what he his doing is individually reversing each of B C, and D (= total N swaps), then reversing the combined B' C' D' (= another N swaps).
As a side note, love how Raymond handled that, no fluff and straight to the point. Beginners mind and all that.
My first read of this was "this seems impossible." You're asked to move bits around without any working space, because you're not allowed to allocate memory. I guess you could interpret this pedantically in C/C++ land and decide that they mean no additional usage of the heap, so there's other places (registers, stack, etc.) to store bits. The title is "in constant memory" so I guess I'm allowed some constant memory, which is vaguely at odds with "can you do this without allocating additional memory?" in the text.
But even with that constraint ... std::rotate allocates memory! It'll throw std::bad_alloc when it can't. It's not using it for the core algorithm (... which only puts values on the heap ... which I guess is not memory ...), but that function can 100% allocate new memory in the right conditions.
It's cool you can do this simply with a couple rotates, but it feels like a party trick.
This feels kinda crazy. Is there a reason why this is the case?
Say you have "A1 A2 B1" and want to rotate (swap) adjacent blocks A1-A2 and B1, where WLOG the smaller of these is B1, and A1 is same size as B1.
What you do is first swap B1 with A1 (putting B1 into it's final place).
B1 A2 A1
Now recurse to swap A2 and A1, giving the final result:
B1 A1 A2
Swapping same-size blocks (which is what this algorithm always chooses to do) is easy since you can just iterate though both swapping corresponding pairs of elements. Each block only gets moved once since it gets put into it's final place.
https://en.cppreference.com/w/cpp/algorithm/rotate.html
I'm wondering if the bad_alloc might be because a single temporary element (of whatever type the iterators point to) is going to be needed to swap each pair of elements, or maybe to allow for an inefficient implementation that chose not to do it in-place?
As for whether std::rotate() uses allocations, I can't say without looking. But I know it could be implemented without allocations. Maybe it's optimal in practice to use extra space. I don't think a method involving reversal of items is generally going to be the fastest. It might be the only practical one in some cases or else better for other reasons.
A B C D E
A C B D E -- after rotate B, C
A D C B E -- after rotate C-B, D
Complexity would seem to be the same as the reverse method, since every element in the original B-D range is getting moved twice.
Anyways, here is what google search AI gave me as an example of how that would work (I don't know this stuff well enough myself):
; Assume ymm0 contains [A, B, C, D] (Q0=A, Q1=B, Q2=C, Q3=D)
; The immediate 0xd8 (11011000 in binary) means:
; - Keep Q0 (index 0)
; - Swap Q1 (index 1) with Q2 (index 2) (110, 011 in binary for bits 1,2)
; - Keep Q3 (index 3)
vpermq ymm0, ymm0, 0xd8
; ymm0 now contains [A, C, B, D]
praptak•1d ago
Also in the same book it was mentioned that the disjoint cycles method (also mentioned in the article) was worse for paging/caching than the three reverses method.
ot•1d ago
praptak•1d ago
taeric•1d ago
jnellis•1d ago
https://github.com/openjdk/jdk/blob/f1e0e0c25ec62a543b9cbfab...