Swapping two blocks of memory inside a larger block, in constant memory

https://devblogs.microsoft.com/oldnewthing/20260101-00/?p=111955

48•paulmooreparks•1mo ago

Comments

praptak•1mo ago

I think this was discussed in Jon Bentley "programming pearls"?

Also in the same book it was mentioned that the disjoint cycles method (also mentioned in the article) was worse for paging/caching than the three reverses method.

ot•1mo ago

That's probably true for small primitive types, but if your objects are expensive to move (like a large struct) it might be beneficial to minimize swaps.

praptak•1mo ago

Yeah, it might be interesting to run some profiling of both algorithms and see how they perform dependent on the size of the blocks being swapped (which doesn't even have to be equal to the size of the object in the array).

taeric•1mo ago

It is discussed in that book. Very fun read, all told. Highly recommended if folks find this sort of thing fun. I think I should thumb through it again. :D

jnellis•1mo ago

Java chooses to use the cycle method mostly. They also reference Bentley.

https://github.com/openjdk/jdk/blob/f1e0e0c25ec62a543b9cbfab...

j4cobgarby•1mo ago

You can also use the XOR trick, not sure what's faster though.

adrian_b•1mo ago

The XOR trick was sometimes useful in the past, on weird CPUs that had non-equivalent registers and which also lacked register exchange instructions (the Intel/AMD CPUs have non-equivalent registers, but they have a register exchange instruction, so they do not need this trick).

The XOR trick is not useful on modern CPUs for swapping memory blocks, because on modern CPUs the slowest operations are the memory accesses and the XOR trick needs too many memory accesses.

For swapping memory, the fastest way needs 4 memory accesses: load X, load Y, store X where Y was, store Y where X was. Each "load" and "store" in this sequence may consist of multiple load or store instructions, if multiple registers are used as the intermediate buffer. Ideally, an intermediate register buffer matching the cache line size should be used, with accesses aligned to cache lines.

Hopefully, std::rotate is written in such a way that it is compiled into such a sequence of machine instructions.

Someone•1mo ago

> with accesses aligned to cache lines.

You want that, but can be tricky because the from and to regions may have different alignment.

Also, the XOR trick introduces data dependencies. That slows down pipelined CPUs.

SkiFire13•1mo ago

How does that work for swapping two blocks of memory with different sizes (which may require shifting the data inbetween)?

jhatax•1mo ago

As a commenter noted as well, you can perform the swap using two std::rotate calls vs. three (less than 2N operations). This said, Raymond’s use of reverse is still most efficient at N operations (not considering paging/caching issues).

HarHarVeryFunny•1mo ago

Isn't he also using 2N operations?

To swap B and D, with intervening C (i.e. B C D), what he his doing is individually reversing each of B C, and D (= total N swaps), then reversing the combined B' C' D' (= another N swaps).

TrainedMonkey•1mo ago

Apparently the trick is two std::rotates : https://devblogs.microsoft.com/oldnewthing/20260101-00/?p=11...

As a side note, love how Raymond handled that, no fluff and straight to the point. Beginners mind and all that.

trjordan•1mo ago

There's something about this that's unsatisfying to me. Like it's just a trivia trick.

My first read of this was "this seems impossible." You're asked to move bits around without any working space, because you're not allowed to allocate memory. I guess you could interpret this pedantically in C/C++ land and decide that they mean no additional usage of the heap, so there's other places (registers, stack, etc.) to store bits. The title is "in constant memory" so I guess I'm allowed some constant memory, which is vaguely at odds with "can you do this without allocating additional memory?" in the text.

But even with that constraint ... std::rotate allocates memory! It'll throw std::bad_alloc when it can't. It's not using it for the core algorithm (... which only puts values on the heap ... which I guess is not memory ...), but that function can 100% allocate new memory in the right conditions.

It's cool you can do this simply with a couple rotates, but it feels like a party trick.

SkiFire13•1mo ago

> But even with that constraint ... std::rotate allocates memory! It'll throw std::bad_alloc when it can't.

This feels kinda crazy. Is there a reason why this is the case?

quuxplusone•1mo ago

That's only for the parallel overload. The ordinary sequential overload doesn't allocate: the only three ordinary STL algorithms that allocate are stable_sort, stable_partition, and (ironically) inplace_merge.

HarHarVeryFunny•1mo ago

No - std::rotate is just doing this with in-place swaps.

Say you have "A1 A2 B1" and want to rotate (swap) adjacent blocks A1-A2 and B1, where WLOG the smaller of these is B1, and A1 is same size as B1.

What you do is first swap B1 with A1 (putting B1 into it's final place).

B1 A2 A1

Now recurse to swap A2 and A1, giving the final result:

B1 A1 A2

Swapping same-size blocks (which is what this algorithm always chooses to do) is easy since you can just iterate though both swapping corresponding pairs of elements. Each block only gets moved once since it gets put into it's final place.

hacker_homie•1mo ago

You are thinking of std::swap, std::rotate does throw bad_alloc

HarHarVeryFunny•1mo ago

I see it says that it may throw bad_alloc, but it's not clear why, since the algorithm itself (e.g see "Possible implementation" below) can easily be done in-place.

https://en.cppreference.com/w/cpp/algorithm/rotate.html

I'm wondering if the bad_alloc might be because a single temporary element (of whatever type the iterators point to) is going to be needed to swap each pair of elements, or maybe to allow for an inefficient implementation that chose not to do it in-place?

taeric•1mo ago

To be fair, it originates from a time when memory was tighter. Is discussed with some motivating text in Programming Pearls. I can't remember the context, but I think it was in a text editor. I can look it up, if folks want some of that context here.

osullivj•1mo ago

Also useful for cache locality, a more recent trend. But I guess that's just another slighlty diff case of tight mem; this time in the cache rather than RAM generally.

HarHarVeryFunny•1mo ago

I did something similar back in the day to support block-move for an editor running on a memory constrained 8-bit micro (BBC Micro). It had to be done in-place since there was no guarantee you'd have enough spare memory to use a temporary buffer, and also more efficient to move each byte once rather than twice (in/out of temp buffer).

wakawaka28•1mo ago

The problem seems less arbitrary if the chunks being rotated are large enough. Implicit in the problem is that any method that would require additional memory to be allocated would probably require memory proportional to the sizes of stuff being swapped. That could be unmanageable.

As for whether std::rotate() uses allocations, I can't say without looking. But I know it could be implemented without allocations. Maybe it's optimal in practice to use extra space. I don't think a method involving reversal of items is generally going to be the fastest. It might be the only practical one in some cases or else better for other reasons.

HarHarVeryFunny•1mo ago

Couldn't this be done in 2 rotates rather than 3 :

A B C D E

A C B D E -- after rotate B, C

A D C B E -- after rotate C-B, D

Complexity would seem to be the same as the reverse method, since every element in the original B-D range is getting moved twice.

cbsks•1mo ago

On Linux, if the blocks are page aligned, you could use mremap(2) to swap blocks very efficiently without using any additional physical memory.

throwawayk7h•1mo ago

This should also be possible with [XOR swap](https://en.wikipedia.org/wiki/XOR_swap_algorithm), though you need to do three passes.

notepad0x90•1mo ago

I was anticipating a SIMD shuffle/permute instruction like vpermq, it won't allocate more ram-memory per-se.

Anyways, here is what google search AI gave me as an example of how that would work (I don't know this stuff well enough myself):

; Assume ymm0 contains [A, B, C, D] (Q0=A, Q1=B, Q2=C, Q3=D)

; The immediate 0xd8 (11011000 in binary) means:

; - Keep Q0 (index 0)

; - Swap Q1 (index 1) with Q2 (index 2) (110, 011 in binary for bits 1,2)

; - Keep Q3 (index 3)

vpermq ymm0, ymm0, 0xd8

; ymm0 now contains [A, C, B, D]

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Y Combinator Founder Organizes 'March for Billionaires'

Ask HN: Need feedback on the idea I'm working on

OpenClaw Addresses Security Risks

Apple finalizes Gemini / Siri deal

Italy Railways Sabotaged

Emacs-tramp-RPC: high-performance TRAMP back end using MsgPack-RPC

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

Show HN: 26/02/26 – 5 songs in a day

Toroidal Logit Bias – Reduce LLM hallucinations 40% with no fine-tuning

Top AI models fail at >96% of tasks

The Science of the Perfect Second (2023)

Bob Beck (OpenBSD) on why vi should stay vi (2006)

Show HN: a glimpse into the future of eye tracking for multi-agent use

The Optima-l Situation: A deep dive into the classic humanist sans-serif

Barn Owls Know When to Wait

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Y Combinator Founder Organizes 'March for Billionaires'

Ask HN: Need feedback on the idea I'm working on

OpenClaw Addresses Security Risks

Apple finalizes Gemini / Siri deal

Italy Railways Sabotaged

Emacs-tramp-RPC: high-performance TRAMP back end using MsgPack-RPC

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

Show HN: 26/02/26 – 5 songs in a day

Toroidal Logit Bias – Reduce LLM hallucinations 40% with no fine-tuning

Top AI models fail at >96% of tasks

The Science of the Perfect Second (2023)

Bob Beck (OpenBSD) on why vi should stay vi (2006)

Show HN: a glimpse into the future of eye tracking for multi-agent use

The Optima-l Situation: A deep dive into the classic humanist sans-serif

Barn Owls Know When to Wait

Swapping two blocks of memory inside a larger block, in constant memory

Comments