Swapping two blocks of memory inside a larger block, in constant memory

https://devblogs.microsoft.com/oldnewthing/20260101-00/?p=111955

48•paulmooreparks•1mo ago

Comments

praptak•1mo ago

I think this was discussed in Jon Bentley "programming pearls"?

Also in the same book it was mentioned that the disjoint cycles method (also mentioned in the article) was worse for paging/caching than the three reverses method.

ot•1mo ago

That's probably true for small primitive types, but if your objects are expensive to move (like a large struct) it might be beneficial to minimize swaps.

praptak•1mo ago

Yeah, it might be interesting to run some profiling of both algorithms and see how they perform dependent on the size of the blocks being swapped (which doesn't even have to be equal to the size of the object in the array).

taeric•1mo ago

It is discussed in that book. Very fun read, all told. Highly recommended if folks find this sort of thing fun. I think I should thumb through it again. :D

jnellis•1mo ago

Java chooses to use the cycle method mostly. They also reference Bentley.

https://github.com/openjdk/jdk/blob/f1e0e0c25ec62a543b9cbfab...

j4cobgarby•1mo ago

You can also use the XOR trick, not sure what's faster though.

adrian_b•1mo ago

The XOR trick was sometimes useful in the past, on weird CPUs that had non-equivalent registers and which also lacked register exchange instructions (the Intel/AMD CPUs have non-equivalent registers, but they have a register exchange instruction, so they do not need this trick).

The XOR trick is not useful on modern CPUs for swapping memory blocks, because on modern CPUs the slowest operations are the memory accesses and the XOR trick needs too many memory accesses.

For swapping memory, the fastest way needs 4 memory accesses: load X, load Y, store X where Y was, store Y where X was. Each "load" and "store" in this sequence may consist of multiple load or store instructions, if multiple registers are used as the intermediate buffer. Ideally, an intermediate register buffer matching the cache line size should be used, with accesses aligned to cache lines.

Hopefully, std::rotate is written in such a way that it is compiled into such a sequence of machine instructions.

Someone•1mo ago

> with accesses aligned to cache lines.

You want that, but can be tricky because the from and to regions may have different alignment.

Also, the XOR trick introduces data dependencies. That slows down pipelined CPUs.

SkiFire13•1mo ago

How does that work for swapping two blocks of memory with different sizes (which may require shifting the data inbetween)?

jhatax•1mo ago

As a commenter noted as well, you can perform the swap using two std::rotate calls vs. three (less than 2N operations). This said, Raymond’s use of reverse is still most efficient at N operations (not considering paging/caching issues).

HarHarVeryFunny•1mo ago

Isn't he also using 2N operations?

To swap B and D, with intervening C (i.e. B C D), what he his doing is individually reversing each of B C, and D (= total N swaps), then reversing the combined B' C' D' (= another N swaps).

TrainedMonkey•1mo ago

Apparently the trick is two std::rotates : https://devblogs.microsoft.com/oldnewthing/20260101-00/?p=11...

As a side note, love how Raymond handled that, no fluff and straight to the point. Beginners mind and all that.

trjordan•1mo ago

There's something about this that's unsatisfying to me. Like it's just a trivia trick.

My first read of this was "this seems impossible." You're asked to move bits around without any working space, because you're not allowed to allocate memory. I guess you could interpret this pedantically in C/C++ land and decide that they mean no additional usage of the heap, so there's other places (registers, stack, etc.) to store bits. The title is "in constant memory" so I guess I'm allowed some constant memory, which is vaguely at odds with "can you do this without allocating additional memory?" in the text.

But even with that constraint ... std::rotate allocates memory! It'll throw std::bad_alloc when it can't. It's not using it for the core algorithm (... which only puts values on the heap ... which I guess is not memory ...), but that function can 100% allocate new memory in the right conditions.

It's cool you can do this simply with a couple rotates, but it feels like a party trick.

SkiFire13•1mo ago

> But even with that constraint ... std::rotate allocates memory! It'll throw std::bad_alloc when it can't.

This feels kinda crazy. Is there a reason why this is the case?

quuxplusone•1mo ago

That's only for the parallel overload. The ordinary sequential overload doesn't allocate: the only three ordinary STL algorithms that allocate are stable_sort, stable_partition, and (ironically) inplace_merge.

HarHarVeryFunny•1mo ago

No - std::rotate is just doing this with in-place swaps.

Say you have "A1 A2 B1" and want to rotate (swap) adjacent blocks A1-A2 and B1, where WLOG the smaller of these is B1, and A1 is same size as B1.

What you do is first swap B1 with A1 (putting B1 into it's final place).

B1 A2 A1

Now recurse to swap A2 and A1, giving the final result:

B1 A1 A2

Swapping same-size blocks (which is what this algorithm always chooses to do) is easy since you can just iterate though both swapping corresponding pairs of elements. Each block only gets moved once since it gets put into it's final place.

hacker_homie•1mo ago

You are thinking of std::swap, std::rotate does throw bad_alloc

HarHarVeryFunny•1mo ago

I see it says that it may throw bad_alloc, but it's not clear why, since the algorithm itself (e.g see "Possible implementation" below) can easily be done in-place.

https://en.cppreference.com/w/cpp/algorithm/rotate.html

I'm wondering if the bad_alloc might be because a single temporary element (of whatever type the iterators point to) is going to be needed to swap each pair of elements, or maybe to allow for an inefficient implementation that chose not to do it in-place?

taeric•1mo ago

To be fair, it originates from a time when memory was tighter. Is discussed with some motivating text in Programming Pearls. I can't remember the context, but I think it was in a text editor. I can look it up, if folks want some of that context here.

osullivj•1mo ago

Also useful for cache locality, a more recent trend. But I guess that's just another slighlty diff case of tight mem; this time in the cache rather than RAM generally.

HarHarVeryFunny•1mo ago

I did something similar back in the day to support block-move for an editor running on a memory constrained 8-bit micro (BBC Micro). It had to be done in-place since there was no guarantee you'd have enough spare memory to use a temporary buffer, and also more efficient to move each byte once rather than twice (in/out of temp buffer).

wakawaka28•1mo ago

The problem seems less arbitrary if the chunks being rotated are large enough. Implicit in the problem is that any method that would require additional memory to be allocated would probably require memory proportional to the sizes of stuff being swapped. That could be unmanageable.

As for whether std::rotate() uses allocations, I can't say without looking. But I know it could be implemented without allocations. Maybe it's optimal in practice to use extra space. I don't think a method involving reversal of items is generally going to be the fastest. It might be the only practical one in some cases or else better for other reasons.

HarHarVeryFunny•1mo ago

Couldn't this be done in 2 rotates rather than 3 :

A B C D E

A C B D E -- after rotate B, C

A D C B E -- after rotate C-B, D

Complexity would seem to be the same as the reverse method, since every element in the original B-D range is getting moved twice.

cbsks•1mo ago

On Linux, if the blocks are page aligned, you could use mremap(2) to swap blocks very efficiently without using any additional physical memory.

throwawayk7h•1mo ago

This should also be possible with [XOR swap](https://en.wikipedia.org/wiki/XOR_swap_algorithm), though you need to do three passes.

notepad0x90•1mo ago

I was anticipating a SIMD shuffle/permute instruction like vpermq, it won't allocate more ram-memory per-se.

Anyways, here is what google search AI gave me as an example of how that would work (I don't know this stuff well enough myself):

; Assume ymm0 contains [A, B, C, D] (Q0=A, Q1=B, Q2=C, Q3=D)

; The immediate 0xd8 (11011000 in binary) means:

; - Keep Q0 (index 0)

; - Swap Q1 (index 1) with Q2 (index 2) (110, 011 in binary for bits 1,2)

; - Keep Q3 (index 3)

vpermq ymm0, ymm0, 0xd8

; ymm0 now contains [A, C, B, D]

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

LLMs as the new high level language

GitBlack: Tracing America's Foundation

Software factories and the agentic moment

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

FDA intends to take action against non-FDA-approved GLP-1 drugs

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

First Proof

I write games in C (yes, C) (2016)

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection rather than prediction

The AI boom is causing shortages everywhere else

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

France's homegrown open source online office suite

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Where did all the starships go?

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

LLMs as the new high level language

GitBlack: Tracing America's Foundation

Software factories and the agentic moment

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

FDA intends to take action against non-FDA-approved GLP-1 drugs

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

First Proof

I write games in C (yes, C) (2016)

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection rather than prediction

The AI boom is causing shortages everywhere else

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

France's homegrown open source online office suite

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Where did all the starships go?

Swapping two blocks of memory inside a larger block, in constant memory

Comments