XOR'ing a register with itself is the idiom for zeroing it out. Why not sub?

https://devblogs.microsoft.com/oldnewthing/20260421-00/?p=112247

227•ingve•2d ago

Comments

nopurpose•2d ago

It amazes me how entertaining Raymond's writing on most mundane aspects of computing often is.

lynndotpy•1d ago

For as much flack Microsoft gets today, they have some of the best people writing about low-level computing. James Mickens writings managed to make me literally laugh-out-loud on these subjects. Chen described him best as "the funniest man in Microsoft Research" ( https://devblogs.microsoft.com/oldnewthing/20131224-00/?p=22... )

NewCzech•2d ago

The obvious answer is that XOR is faster. To do a subtract, you have to propagate the carry bit from the least-significant bit to the most-significant bit. In XOR you don't have to do that because the output of every bit is independent of the other adjacent bits.

Probably, there are ALU pipeline designs where you don't pay an explicit penalty. But not all, and so XOR is faster.

Surely, someone as awesome as Raymond Chen knows that. The answer is so obvious and basic I must be missing something myself?

mikequinlan•2d ago

As TFA says, on x86 `sub eax, eax` encodes to the same number of bytes and executes in the same number of cycles.

whizzter•2d ago

On modern ones, x86 has quite a history and the idiom might carry on from an even older machine.

Edit: Looked at comments, seems like x86 and the major 8bit cpu's had the same speed, pondering in this might be a remnant from the 4-bit ALU times.

abainbridge•1d ago

> seems like x86 and the major 8bit cpu's had the same speed, pondering in this might be a remnant from the 4-bit ALU times.

I think that era of CPUs used a single circuit capable of doing add, sub, xor etc. They'd have 8 of them and the signals propagate through them in a row. I think this page explains the situation on the 6502: https://c74project.com/card-b-alu-cu/

And this one for the ARM 1: https://daveshacks.blogspot.com/2015/12/inside-alu-of-armv1-...

But I'm a software engineer speculating about how hardware works. You might want to ask a hardware engineer instead.

adrian_b•1d ago

Nope.

In any ALU the speed is determined by the slowest operation, so XOR is never faster. It does not matter which is the width of the ALU, all that matters is that an ALU does many kinds of operations, including XOR and subtraction, where the operation done by an ALU is selected by some control bits.

I have explained in another comment that the only CPUs where XOR can be faster than subtraction are the so-called superpipelined CPUs. Superpipelined CPUs have been made only after 1990 and there were very few such CPUs. Even if in superpipelined CPUs it is possible for XOR to be faster than subtraction, it is very unlikely that this feature has been implemented in anyone of the few superpipelined CPU models that have ever been made, because it would not have been worthwhile.

For general-purpose computers, there have never been "4-bit ALU times".

The first monolithic general-purpose processor was Intel 8008 (i.e. the monolithic version of Datapoint 2200), with an 8-bit ISA.

Intel claims that Intel 4004 was the first "microprocessor" (in order to move its priority earlier by one year), but that was not a processor for a general-purpose computer, but a calculator IC. Its only historical relevance for the history of personal computers is that the Intel team which designed 4004 gained a lot of experience with it and they established a logic design methodology with PMOS transistors, which they used for designing the Intel 8008 processor.

Intel 4004, its successors and similar 4-bit processors introduced later by Rockwell, TI and others, were suitable only for calculators or for industrial controllers, never for general-purpose computers.

The first computers with monolithic processors, a.k.a. microcomputers, used 8-bit processors, and then 16-bit processors, and so on.

For cost reduction, it is possible for an 8-bit ISA to use a 4-bit ALU or even just a serial 1-bit ALU, but this is transparent for the programmer and for general-purpose computers there never were 4-bit instruction sets.

deathanatos•1d ago

> In any ALU the speed is determined by the slowest operation, so XOR is never faster.

On a 386, a reg/reg ADD is 2 cycles. An r32 IMUL is "9-38" cycles.

If what you stated were true, you'd be locking XOR's speed to that of DIV. (Or you do not consider MUL/DIV "arithmetic", or something.)

https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/op...

> I have explained in another comment that the only CPUs where XOR can be faster than subtraction are the so-called superpipelined CPUs. Superpipelined CPUs have been made only after 1990 and there were very few such CPUs.

(And I'm choosing 386 to avoid it being "a superpipelined CPU".)

hcs•1d ago

> Or you do not consider MUL/DIV "arithmetic", or something.

Multiplier and divider are usually not considered part of the ALU, yes. Not uncommon for those to be shared between execution threads while there's an ALU for each.

adrian_b•1d ago

386 is a microprogrammed CPU where a multiplication is dome by a long sequence of microinstructions, including a loop that is executed a variable number of times, hence its long and variable execution time.

A register-register operation required 2 microinstructions, presumably for an ALU operation and for writing back into the register file.

Unlike the later 80486 which had execution pipelines that allowed consecutive ALU operations to be executed back-to-back, so the throughput was 1 ALU operation per clock cycle, in 80386 there was only some pipelining of the overall instruction execution, i.e. instruction fetching and decoding was overlapped with microinstruction execution, but there was no pipelining at a lower level, so it was not possible to execute ALU operations back to back. The fastest instructions required 2 clock cycles and most instructions required more clock cycles.

In 80386, the ALU itself required the same 1 clock cycle for executing either XOR or SUB, but in order to complete 1 instruction the minimum time was 2 clock cycles.

Moreover, this time of 2 clock cycles was optimistic, it assumed that the processor had succeeded to fetch and decode the instruction before the previous instruction was completed. This was not always true, so a XOR or a SUB could randomly require more than 2 clock cycles, when it needed to finish instruction decoding or fetching before doing the ALU operation.

In very old or very cheap processors there are no dedicated multipliers and dividers, so a multiplication or division is done by a sequence of ALU operations. In any high performance processor, multiplications are done by dedicated multipliers and there are also dedicated division/square root devices with their own sequencers. The dividers may share some circuits with the multipliers, or not. When the dividers share some circuits with the multipliers, divisions and multiplications cannot be done concurrently.

In many CPUs, the dedicated multipliers may share some surrounding circuits with an ALU, i.e. they may be connected to the same buses and they may be fed by the same scheduler port, so while a multiplication is executed the associated ALU cannot be used. Nevertheless the core multiplier and ALU remain distinct, because a multiplier and an ALU have very distinct structures. An ALU is built around an adder by adding a lot of control gates that allow the execution of related arithmetic operations, e.g. subtraction/comparison/increment/decrement and of bitwise operations. In cheaper CPUs the ALU can also do shifts and rotations, while in more performant CPUs there may be a dedicated shifter separated from the ALU.

The term ALU can be used with 2 different senses. The strict sense is that an ALU is a digital adder augmented with control gates that allow the selection of any operation from a small set, typically of 8 or 16 or 32 operations, which are simple arithmetic or bitwise operations. Before the monolithic processors, computers were made using separate ALU circuits, like TI SN74181+SN74182 or circuits combining an ALU with registers, e.g. AMD 2901/2903.

In the wide sense, ALU may be used to designate an execution unit of a processor, which may include many subunits, which may be ALUs in the strict sense, shifters, multipliers, dividers, shufflers etc.

An ALU in the strict sense is the minimal kind of execution unit required by a processor. The modern high-performance processors have much more complex execution units.

rep_lodsb•1d ago

Most of mul/div was implemented in hardware since the 80186 (and the more or less compatible NEC V30 too). The microcode only loaded the operands into internal ALU registers, and did some final adjustment at the end. But it was still done as a sequence of single bit shifts with add/sub, taking one clock cycle per bit.

FarmerPotato•1d ago

> For general-purpose computers, there have never been "4-bit ALU times".

Well, consider minicomputers made from bit-slices. Those would be 4-bit ALUs with CLA.

What drives me crazy about the 8-bit era is the lack of orthogonality. We're having this whole discussion because they didn't have a ZERO or ONES opcode. In 1972's 74181 chip those were just cases among 48 modes.

adrian_b•1d ago

The minicomputers made with bit-slices had 16-bit ALUs or 32-bit ALUs.

Those 16-bit or 32-bit ALUs were made from 2-bit, 4-bit or 8-bit slices, but this did not matter for the programmer, and it did not matter even for the micro-programmer who implemented the instruction set architecture by writing microcode.

The size of the slices mattered a little for the schematic designer who had to draw the corresponding slices and their interconnections an it mattered a lot for the PCB designer, because each RALU slice (RALU = registers + ALU) was a separate integrated circuit package.

Intel made 2-bit RALU slices (the Intel 3000 series), AMD made 4-bit RALU slices (the 2900 series), which were the most successful on the market. There were a few other 4-bit RALU slices, e.g. the faster ECL 10800 series from Motorola, Later, there were a few 8-bit RALU slices, e.g. from Fairchild and from TI, but by that time the monolithic processors became quickly dominant, so the bit-sliced designs were abandoned.

The width of the slices mattered for cost, size and power consumption, but it did not matter for the architecture of the processor, because the slices were made to be chained into ALUs of any width that was a multiple of the slice width.

arka2147483647•2d ago

> The answer is so obvious

A tangent, but what is Obvious depends on what you know.

Often experts don't explain the things they think are Obvious, but those things are only Obvious to them, because they are the expert.

We should all kind, and explain also the Obvious things those who do not know.

akie•2d ago

"The proof is left as an exercise for the reader" comes to mind

themafia•2d ago

XOR and SUB have had identical cycle counts and latencies since the 8088. That's because you can "look ahead" when doing carries in binary. It's just a matter of how much floorspace on the chip you want to use.

https://en.wikipedia.org/wiki/Carry-lookahead_adder

The only minor difference between the two on x86, really, is SUB sets OF and CF according to the result while XOR always clears them.

asQuirreL•2d ago

A carry lookahead adder makes your circuit depth logarithmic in the width of the inputs vs linear for a ripple carry adder, but that is still asymptotically worse than XORs constant depth.

(But this does not discount the fact that basically all CPUs treat them both as one cycle)

bonzini•1d ago

OF/CF/AF are always cleared anyway by SUB r,r. So there's absolutely no difference.

themafia•1d ago

The point is OF/CF are sometimes dependent on the inputs for SUB. They never are for XOR.

bonzini•1d ago

Ah, you mean in terms of complexity of the calculation. Thanks for clarifying.

In practice AF and CF can be computed from the carry out vector which is already available, and OF is a single XOR (of the two most significant bits of the carry out vector). The same circuitry works for XOR and SUB if the carry out vector of XOR is simply all zeroes.

themafia•1d ago

It also clears any dependence on the state of those flags. Which is probably not useful in practice.

svnt•2d ago

His point is that in x86 there is no performance difference but everyone except his colleague/friend uses xor, while sub actually leaves cleaner flags behind. So he suspects its some kind of social convention selected at random and then propagated via spurious arguments in support (or that it “looks cooler” as a bit of a term of art).

It could also be as a result of most people working in assembly being aware of the properties of logic gates, so they carry the understanding that under the hood it might somehow be better.

3form•2d ago

I think an even more likely explanation would be that x86 assembly programmers often were, or learned from other-architecture assembly programmers. Maybe there's a place where it makes more sense and it can be so attributed. 6502 and 68k being first places I would look at.

richrichardsson•2d ago

For 68k depending on the size you're interested in then it mostly doesn't matter.

.b and .w -> clr eor sub are all identical

for .l moveq #0 is the winner

bonzini•1d ago

6502 doesn't even have register-to-register ALU operations, there's no alternative to LDA #0.

8080/Z80 is probably where XOR A got a lead over SUB A, but they are also the same number of cycles.

zahlman•1d ago

GP seems to think it strange that "x86" would actually not have a performance difference here.

I think this might just be due to not realizing just how far back in CPU history this goes.

wongarsu•1d ago

In a clockless cpu design you'd indeed expect xor to be faster. But in a regular CPU with a clock you either waste a bit of xor performance by making xor and sub both take the same number of ticks, or you speed up the clock enough that the speed difference between xor and sub justifies sub being at least a full tick slower

The former just seems way more practical

dbdr•1d ago

Even if they take the same number of ticks, shouldn't xor fundamentally needing less work also mean it can be performed while drawing less power/heating less, which is just as much an improvement in the long run?

MBCook•1d ago

That wasn’t much of a concern in the 70s and 80s.

phire•2d ago

I'm not actually aware of any CPUs that preform a XOR faster than a SUB. And more importantly, they have identical timings on the 8086, which is where this pattern comes from.

FarmerPotato•1d ago

I'm studying 4-bit-slice processors from the 1970s. This is all tangent to the x86 discussion. Minicomputer processors!

I have two bit-slice machines from TI based on the 74S481 (4-bit slice x 4).

Just like with the 74181, all ALU operations go through the same path, there are just extra gates that make the difference between logical or arithmetic. For instance, for each bit in the slice, the carry path is masked out if logical, but used if arithmetic.

* The XOR operation (logical) is accomplished with A+B but no bits carry. If carry is not masked, you get arithmetic ADD.

* The ZERO or CLEAR operation is (A+A without carry). With carry, A+A is a shift-left.

* The ONES operation forces all the carry chain to 1 (ignoring operand) (you can do a ONES+1 to get arithmetic 0, but why?)

* In the simpler 74181 (4 years earlier) there are 16 operations with 48 logical/arithmetic outcomes. Pick 12 or so for your instruction set. There are some weirdos.

The crazy thing here is that in the TM990/1481 implementation, the microinstruction clock is 15 MHz, and each has a field for number of micro-wait states. This is faster than the '481s max!

Theoretically, if 66ns is sufficient to settle the ALU, a logical operation doesn't need a micro-wait-state. While arithmetic needs one, only because of carry-look-ahead. If I/O buses are activated, then micro-instructions account for setup/hold times. I could be wrong about the details, but that field is there!

It's the only architecture I know of with short and long microinstructions! (The others are like a fixed 4-stage cycle: input valid, ALU valid, store)

phire•1d ago

Thanks, I suspected there might be something from the minicomputer era.

I've only really looked at a single AM2900 implementation (and it was far from optimal). Guess I need to dig deeper at some point.

> The ONES operation forces all the carry chain to 1 (ignoring operand) (you can do a ONES+1 to get arithmetic 0, but why?)

Forcing all carries to 1 inverts the output.

If I'm understanding the ALU correctly, (the datasheet doesn't show that part) it only implements OR and XOR. When combined with the ability to invert both inputs, AND can be implemented as !(!A OR !B), NAND is (!A OR !B) and so on.

Or maybe the ALU implements NOR and XNOR, and all the carry logic is physically inverted from what the documentation says.

FarmerPotato•13h ago

I'll have to rethink what goes on in the ONES operation. The gate level schematic for the 74181 I found in a databook.

feverzsj•2d ago

It's like 0.5 cycles vs 0.9 cycles. So both are 1 cycle, considering synchronization.

pishpash•2d ago

But energy consumption could be different for this hypothetical 0.5 and 0.9.

scheme271•2d ago

Energy consumption wasn't really a concern when the idiom developed. I don't think people really cared about the energy consumption of instructions until well into the x86-64 era.

allenrb•1d ago

Not sure why this is being downvoted, but it’s absolutely correct. For most of the history of computing, people were happy that it worked at all. Being concerned about energy efficiency is a recent byproduct of mobile devices and, even more recently, giant amounts of compute adding up to gigawatts.

pishpash•1d ago

This take is anachronistic. Thermal issues were evident by the late 1990's. Of course by that time not many were working in x86 assembly but embedded systems sure cared about power.

People forget embedded predated mobile by a good 20 years.

imtringued•1d ago

Nintendo's original Game Boy lasted 40 hours on two AA batteries in 1989. You can't reach those numbers without engineering for energy efficiency.

Tepix•2d ago

From TFA:

> It encodes to the same number of bytes, executes in the same number of cycles.

abainbridge•1d ago

Those aren't the only resources. I could imagine XOR takes less energy because using it might activate less circuitry than SUB.

zahlman•1d ago

I'm not aware of any stories in the historical record of "real programmers" optimizing for power use, only for speed or code size.

abainbridge•1d ago

For a few years I worked in the team that wrote software for an embedded audio DSP. The power draw to do something was normally more important than the speed. Eg when decoding MP3 or SBC you probably had enough MIPS to keep up with the stream rate, so the main thing the customers cared about was battery life. Mostly the techniques to optimize for speed were the same as those for power. But I remember being told that add/sub used less power than multiply even though both were single cycle. And that for loops with fewer than 16 instructions used less power because there was a simple 16 instruction program memory cache that saved the energy required to fetch instructions from RAM or ROM. (The RAM and ROM access was generally single cycle too).

Nowadays, I expect optimizations that minimize energy consumption are an important target for LLM hosts.

toast0•1d ago

Sibling posted a good example. But I know of (without details) things where you have to insert nops to keep peak power down, so the system doesn't brown out (in my experience, the 68hc11 won't take conditional branches if the power supply voltage dips too far; but I didn't work around that, I just made sure to use fresh batteries when my code started acting up). Especially during early boot.

Apple got in a lot of trouble for reducing peak power without telling people, to avoid overloading dying batteries.

ranger_danger•1d ago

Aerospace.

jojobas•2d ago

The non-obvious bit is why there isn't an even faster and shorter "mov <register>,0" instructions - the processors started short-circuiting xor <register>,<register> much later.

drob518•1d ago

Right, like a “set reg to zero” instruction. One byte. Just encodes the operation and the reg to zero. I’m surprised we didn’t have it on those old processors. Maybe the thinking was that it was already there: xor reg,reg.

bonzini•1d ago

One byte instructions, with 8 registers as in the 8086, waste 8 opcodes which is 3% of the total. There are just five: "INC reg", "DEC reg", "PUSH reg", "POP reg", "XCHG AX, reg" (which is 7 wasted opcodes instead of 8, because "XCHG AX, AX" doubles as NOP).

One-byte INC/DEC was dropped with x86-64, and PUSH/POP are almost obsolete in APX due to its addition of PUSH2/POP2, leaving only the least useful of the five in the most recent incantation of the instruction set.

drob518•1d ago

I’m not sure I understand what you mean by “waste 8 opcodes.”

bonzini•1d ago

They occupy 8 of the possible 256 byte values. Together, those five cases used about 15% of the space.

Though I was forgetting one important case: MOV r,imm also used one-byte opcodes with the register index embedded. And it came in byte and word variants, so it used a further 16 opcodes bytes for a total of 56 one byte opcodes with register encoding.

drob518•1d ago

Gotcha, thanks for clarifying. I was reacting to the word “waste” I guess. Surely, as you say, it consumes that opcode encoding space. Whether that’s a waste or not depends on a lot of other things, I suppose. I wasn’t necessarily thinking x86-specific in my original comment. But yea, if you try to zero every possible register and half-word register you would definitely consume lots of encoding space.

LegionMammal978•1d ago

Traditionally in x86, only the first byte is the opcode used to select the instruction, and any further bytes contain only operands. Thus, since there exist 256 possible values for the initial byte, there are at most 256 possible opcodes to represent different instructions.

So if you add a 1-byte instruction for each register to zero its value, that consumes 8 of the possible 256 opcodes, since there are 8 registers. Traditional x86 did have several groups of 1-byte instructions for common operations, but most of them were later replaced with multibyte encodings to free up space for other instructions.

GuB-42•1d ago

There are only 256 1-byte opcodes or prefixes available, if you take 8 of these to zero registers, they won't be available for other instruction, and unless you consider zeroing to be so important that they really need their 1-byte opcodes, it is redundant since you can use the 2-byte "xor reg,reg" instead, hence the "waste'.

In addition, you would need 16 opcodes, not 8, if you also wanted to cover 8 bit registers (AH/AL,...).

Special shout-out to the undocumented SALC instruction, which puts the carry flag into AL. If you know that the carry will be 0, it is a nice sizecoding trick to zero AL in 1 byte.

gpderetta•1d ago

special mov 0 instruction times 8 registers. The opcode space, especially 1 byte opcode space, is precious so encoding redundant operations is wasteful.

flohofwoe•1d ago

Instruction slots are extremely valuable in 8-bit instruction sets. The Z80 has some free slots left in the ED-prefixed instruction subset, but being prefix-instructions means they could at best run at half speed of one-byte instructions (8 vs 4 clock cycles).

HarHarVeryFunny•1d ago

A number of the RISC processors have a special zero register, giving you a "mov reg, zero" instruction.

Of course many of the RISC processors also have fixed length instructions, with small literal values being encoded as part of the instruction, so "mov reg, #0" and "mov reg, zero" would both be same length.

defmacr0•1d ago

In x86, a basic immediate instruction with a 1 Byte immediate value is encoded like this:

<op> (1 Byte opcode), <Registers> (1 Byte), <immediate value> (1 Byte)

While xor eax, eax only uses 2 bytes. Since there are only 8 registers, meaning they can be encoded with 3 bits, you can pack two values into the <Registers> field (ModR/M).

Making mov eax, 0 only take two bytes would require significant changes of the ISA to allow immediate values in the ModR/M byte (or similar) but there would be little benefit since zeroing can already be done in 2 bytes and I doubt that other cases are even close to frequent enough for this to be any significant benefit overall. An actual improvement would be if there was a dedicated 1 Byte set-rax-to-0 instruction, but obviously that comes at a tradeoff where we have to encode another operation differently (probably with more bytes) again (and you can't zero anything else with it).

https://wiki.osdev.org/X86-64_Instruction_Encoding

https://pyokagan.name/blog/2019-09-20-x86encoding/

rep_lodsb•1d ago

Some other architectures like PDP-11 and 680x0 had a dedicated "clear register" instruction.

It could have been added to x86, even as a group of single-byte opcodes with the register encoded in three bits (as with PUSH, POP, and INC/DEC outside of long mode). But the XOR idiom was already established on the 8080 by that point.

virexene•2d ago

The operation is slightly more complex yes, but has there ever been an x86 CPU where SUB or XOR takes more than a single CPU cycle?

praptak•2d ago

I wonder if you could measure the difference in power consumption.

I mean, not for zeroing because we know from the TFA that it's special-cased anyway. But maybe if you test on different registers?

flohofwoe•2d ago

That comment is not very useful without pointing to realworld CPUs where SUB is more expensive than XOR ;)

E.g. on Z80 and 6502 both have the same cycle count.

brigade•2d ago

Cortex A8 vsub reads the second source register a cycle earlier than veor, so that can add one cycle latency

Not scalar, but still sub vs xor. Though you’d use vmov immediate for zeroing anyway.

GoblinSlayer•1d ago

Harvard Mark I? Not sure why people think programming started with Z80.

flohofwoe•1d ago

My WW2-era assembly is a bit rusty, but I don't think the Harvard Mark 1 had bitwise logical operations?

bonzini•1d ago

The article is about x86, and x86 assembly is mostly a superset of 8080 (which is why machine language numbers registers as AX/CX/DX/BX, matching roughly the function of A/BC/DE/HL on the 8080—in particular with respect to BX and HL being last).

GoblinSlayer•21h ago

So you say x86 wasn't made ex nihilo, but evolved from previous designs? When this evolution began? 8080 followed 8008, code for which was written in macro-11 https://en.wikipedia.org/wiki/PDP-11_architecture#Example_co...

HarHarVeryFunny•1d ago

The 6502 doesn't support XOR A or SUB A, and in fact doesn't have a SUB opcode at all, only SBC (subtract with carry, requiring an extra opcode to set the carry flag beforehand).

flohofwoe•1d ago

I was handwaving over the details, SBC is identical to SUB when the carry flag is clear, so it's understandable why the 6502 designers didn't waste an instruction slot.

EOR and SBC still have the same cycle counts though.

HarHarVeryFunny•1d ago

Sure, in some contexts you would know that the carry flag was set or clear (depending on what you needed), and it was common to take advantage of that and not add an explicit clc or sec, although you better comment the assumption/dependency on the preceding code.

However the 6502 doesn't support reg-reg ALU operations, only reg-mem, so there simply is no xor a,a or sbc a,a support. You'd either have to do the explicit lda #0, or maybe use txa/tya if there was a free zero to be had.

em3rgent0rdr•1d ago

With more bits, then SUB is going to be more and more expensive to fit in the same number of clocks as XOR. So with an 8-bit CPU like Z80, it probably makes design sense to have XOR and SUB both take one cycle. But if for instance a CPU uses 128-bit registers, then the propagate-and-carry logic for ADD/SUB might take way much longer than XOR that the designers might not try to fit ADD/SUB into the same single clock cycle as XOR, and so might instead do multi-cycle pipelined ADD/SUB.

A real-world CPU example is the Cray-1, where S-Register Scalar Operations (64-bit) take 3 cycles for ADD/SUB but still only 1 cycle for XOR. [1]

[1] https://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM...

billpg•2d ago

I had a similar reaction when learning 8086 assembly and finding the correct way to do `if x==y` was a CMP instruction which performed a subtraction and set only the flags. (The book had a section with all the branch instructions to use for a variety of comparison operators.) I think I spent a few minutes experimenting with XOR to see if I could fashion a compare-two-values-and-branch macro that avoided any subtraction.

rep_lodsb•1d ago

Comparing for equality can use either SUB or XOR: it sets the zero flag if (and only if) the two values are equal. That's why JE/JNE (jump if equal/not equal) is an alias for JZ/JNZ (jump if zero/not zero).

There's also the TEST instruction, which does a logical AND but without storing the result (like CMP does for SUB). This can be used to test specific bits.

Testing a single register for zero can be done in several ways, in addition to CMP with 0:

    TEST AX,AX
    AND  AX,AX
    OR   AX,AX
    INC  AX    followed by DEC AX (or the other way around)

The 8080/Z80 didn't have TEST, but the other three were all in common use. Particularly INC/DEC, since it worked with all registers instead of just the accumulator.

Also any arithmetic operation sets those flags, so you may not even need an explicit test. MOV doesn't set flags however, at least on x86 -- it does on some other architectures.

defmacr0•2d ago

I would be surprised if modern CPUs didn't decode "xor eax, eax" into a set of micro-ops that simply moves from an externally invisible dedicated 0 register. These days the x86 ISA is more of an API contract than an actual representation of what the hardware internals do.

brigade•2d ago

Zero micro ops to be precise, that’s handled entirely at the register rename stage with no data movement.

defrost•2d ago

From TFA:

  The predominance of these idioms as a way to zero out a register led Intel to add special xor r, r-detection and sub r, r-detection in the instruction decoding front-end and rename the destination to an internal zero register, bypassing the execution of the instruction entirely. You can imagine that the instruction, in some sense, “takes zero cycles to execute”.

rasz•1d ago

"rename the destination to an internal zero register"

That would be quite late then, 1997 Pentium 2 for general population.

bahmboo•2d ago

Because he is explicitly talking about x86 - maybe you missed that.

adrian_b•1d ago

XOR is faster when you do that alone in an FPGA or in an ASIC.

When you do XOR together with many other operations in an ALU (arithmetic-logical unit), the speed is determined by the slowest operation, so the speed of any faster operation does not matter.

This means that in almost all CPUs XOR and addition and subtraction have the same speed, despite the fact that XOR could be done faster.

In a modern pipelined CPU, the clock frequency is normally chosen so that a 64-bit addition can be done in 1 clock cycle, when including all the overheads caused by registers, multiplexers and other circuitry outside the ALU stages.

Operations more complex than 64-bit addition/subtraction have a latency greater than 1 clock cycle, even if one such operation can be initiated every clock cycle in one of the execution pipelines.

The operations less complex than 64-bit addition/subtraction, like XOR, are still executed in 1 clock cycle, so they do not have any speed advantage.

There have existed so-called superpipelined CPUs, where the clock frequency is increased, so that even addition/subtraction has a latency of 2 or more clock cycles.

Only in superpipelined CPUs it would be possible to have a XOR instruction that is faster than subtraction, but I do not know if this has ever been implemented in a real superpipelined CPU, because it could complicate the execution pipeline for negligible performance improvements.

Initially superpipelining was promoted by DEC as a supposedly better alternative to the superscalar processors promoted by IBM. However, later superpipelining was abandoned, because the superscalar approach provides better energy efficiency for the same performance. (I.e. even if for a few years it was thought that a Speed Demon beats a Brainiac, eventually it was proven that a Brainiac beats a Speed Demon, like shown in the Apple CPUs)

While mainstream CPUs do not use superpipelining, there have been some relatively recent IBM POWER CPUs that were superpipelined, but for a different reason than originally proposed. Those POWER CPUs were intended for having good performance only in multi-threaded workloads when using SMT, and not in single-thread applications. So by running simultaneous threads on the same ALU the multi-cycle latency of addition/subtraction was masked. This technique allowed IBM a simpler implementation of a CPU intended to run at 5 GHz or more, by degrading only the single-thread performance, without affecting the SMT performance. Because this would not have provided any advantage when using SMT, I assume that in those POWER CPUs XOR was not made faster than subtraction, even if this would have theoretically been possible.

imtringued•1d ago

Superpipelining doesn't work in practice because you can only save the timing slack left over in the pipelined architecture. If you're running the CPU twice as fast but basic operations now take twice as long, all you've done is double the book keeping cost, which is the energy intensive part of a CPU, while having gained a small performance increase in the few cases where a quick 1 cycle instruction finishes faster than a slow 1 cycle instruction.

Energy efficiency is usually better. There are countless ways to translate energy efficiency into higher performance.

bialpio•1d ago

From TFA:

The predominance of these idioms as a way to zero out a register led Intel to add special xor r, r-detection and sub r, r-detection in the instruction decoding front-end and rename the destination to an internal zero register, bypassing the execution of the instruction entirely.

TacticalCoder•1d ago

> The obvious answer is that XOR is faster.

It used to be not only faster but also smaller. And back then this mattered.

Say you had a computer running at 33 Mhz, you had 33 million cycles per second to do your stuff. A 60 Hz game? 33 million / 60 and suddenly you only have about 500 000 cycles per frame. 200 scanlines? Suddenly you're left with only 2500 cycles per scanline to do your stuff. And 2500 cycles really isn't that much.

So every cycle counted back then. We'd use the official doc and see how many cycles each instruction would take. And we'd then verify by code that this was correct too. And memory mattered too.

XOR was both faster and smaller (less bytes) then a MOV ..., 0.

Full stop.

And when those CPU first began having cache, the cache were really tiny at first: literally caching ridiculously low number of CPU instructions. We could actually count the size of the cache manually (for example by filling with a few NOP instructions then modifying them to, say, add one, and checking which result we got at the end).

XOR, due to being smaller, allowed to put more instructions in the cache too.

Now people may lament that it persisted way long after our x86 CPUs weren't even real x86 CPUs anymore and that is another topic.

But there's a reason XOR was used and people should deal with it.

We zero with XOR EAX,EAX and that's it.

zahlman•1d ago

The context was comparison to SUB EAX,EAX, not to a MOV.

drob518•1d ago

Yea, that’s what immediately went through my head, too. XOR is ALWAYS going to be single cycle because it’s bit-parallel.

Sharlin•1d ago

And SUB is also always a single cycle on any practically useful architecture since the 70s. Theoretical archs where SUB might be slower than XOR don't matter.

Someone•1d ago

> To do a subtract, you have to propagate the carry bit from the least-significant bit to the most-significant bit.

Yes, but that need not scale linearly with the number of bits. https://en.wikipedia.org/wiki/Carry-lookahead_adder:

“A carry-lookahead adder (CLA) or fast adder is a type of electronics adder used in digital logic. A carry-lookahead adder […] can be contrasted with the simpler, but usually slower, ripple-carry adder (RCA), for which the carry bit is calculated alongside the sum bit, and each stage must wait until the previous carry bit has been calculated to begin calculating its own sum bit and carry bit. The carry-lookahead adder calculates one or more carry bits before the sum, which reduces the wait time to calculate the result of the larger-value bits of the adder.

[…]

Already in the mid-1800s, Charles Babbage recognized the performance penalty imposed by the ripple-carry used in his difference engine, and subsequently designed mechanisms for anticipating carriage for his never-built analytical engine.[1][2] Konrad Zuse is thought to have implemented the first carry-lookahead adder in his 1930s binary mechanical computer, the Zuse Z1.”

I think most, if not all, current ALUs implement such adders.

dreamcompiler•1d ago

Carry lookahead is definitely faster than ripple carry but it's not free. It requires high-fan-in gates that take up a fair amount of silicon. That silicon saves time though, so as you say almost nobody uses ripple carry any more.

Symmetry•1d ago

There's a structure called a carry-bypass adder[1] that lets you add two numbers in O(√n) time for only O(n) gates. That or a similar structure is what modern CPUs use and they allow you two add two numbers in a single clock cycle which is all you care about from a software perspective.

There are also tree adders which add in O(log(n)) time but use O(n^2) gates if you really need the speed, but AFAIK nobody actually does need to.

[1]https://en.wikipedia.org/wiki/Carry-skip_adder

anematode•2d ago

My favorite (admittedly not super useful) trick in this domain is that sbb eax, eax breaks the dependency on the previous value of eax (just like xor and sub) and only depends on the carry flag. arm64 is less obtuse and just gives you csetm (special case of csinv) for this purpose.

not_a_bijection•1d ago

That's even more useful because of x86's braindamanged "setcc", which only affects the lowest byte of the destination, AFAIR, and so always has to be combined with a zeroing idiom before the setcc or a zero extension after it in practice.

defrost•2d ago

  Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side.

And this, interestingly, is why life on earth uses left-handed amino acids and right-handed sugars .. and why left handed sugar is perfect for diet sodas.

praptak•2d ago

You still need to explain why this case creates a positive feedback loop rather than a negative one. I mean left/right fuel intakes in cars and male/female ratios somehow tend to balance at 50/50.

defrost•2d ago

Wrt amino acids and sugars I personally don't have to explain as a good many others have already.

eg: For one, Isaac Asimov in the 1970s wrote at length on this in his role as a non fiction science writer with a Chemistry Phd

> male/female ratios somehow tend to balance at 50/50.

This is different to the case of actual right handed dominance in humans and to L- Vs R- dominance in chirality ...

( Men and women aren't actual mirror images of each other ... )

phenol•1d ago

products of an asymmetric reaction performed without enantiomeric control can selectively catalyse the formation of more products with the same handedness -- this is called autocatalysis. so the first full reaction might produce a left-handed product (by chance) but that left-handed product will then cause future products to be preferentially left-handed. see the [Soai reaction](https://en.wikipedia.org/wiki/Soai_reaction?wprov=sfla1) for an example of this.

as mentioned by others this is conjectural but it is a popular (if somewhat unfalsifiable) explanation for homochirality

ben_w•1d ago

Regarding gender ratios: https://en.wikipedia.org/wiki/Fisher's_principle

There's exceptions, but they tend to be colonial animals in the broadest sense e.g. how clownfish males are famously able to become female but each group has one breeding male and one breeding female at any given time*, or bees where the males (drones) are functionally flying sperm and there's only one fertile female in any given colony; or some reptiles which have a temperature-dependent sex determination that may have been 50/50 before we started causing rapid climate change but in many cases isn't now: https://en.wikipedia.org/wiki/Temperature-dependent_sex_dete...

* Wolves, despite being where nomenclature of "alpha" comes from, are not this. The researcher who coined the term realised they made a mistake and what he thought of as the "alpha" pair were simply the parents of the others in that specific situation: https://davemech.org/wolf-news-and-information/

bonzini•1d ago

Temperature-dependent sex determination may not be at equilibrium now but is not an exception to Fisher's principle. The temperature at which sex determination switches is variable based on the parent's genes, and it will try to re-equilibrate with the environment temperature to obtain 1:1 ratios just like in other animals.

ben_w•1d ago

Indeed, that is why I wrote "may have been 50/50 before we started causing rapid climate change".

bonzini•1d ago

It's still not a violation of Fisher's principle, long term we would see natural selection move the threshold temperature upwards.

tbrownaw•1d ago

> left/right fuel intakes in cars

Are I believe chosen by intelligent humans who are deliberately trying to keep the lines at gas stations balanced.

NetMageSCW•1d ago

As someone with a right side fuel intake, that’s certainly isn’t true in the US. Left side fuel intake dominates completely and when the 8 pump station I prefer is busy, I only ever see left hand intake cars being fueled from the “wrong” side.

JuniperMesos•2d ago

This is a hypothesis about why the chirality of life on earth is what it is, but I don't think there's enough evidence to state that this (or any competing hypothesis) is definitely the correct explanation.

defrost•1d ago

Well "definitely correct" has no real place in probabilistic arguments almost by ipso factum absurdum :-)

The chirality argument made is more akin to dynamic systems balance; yes, you can balance a pencil on its point .. but given a bit of random tilt one way or the other it's going to tend to keep going and end near flat on the table.

saati•1d ago

> and why left handed sugar is perfect for diet sodas

If you want to get diarrhea.

Sweepi•2d ago

"Bonus bonus chatter: The xor trick doesn’t work for Itanium because mathematical operations don’t reset the NaT bit. Fortunately, Itanium also has a dedicated zero register, so you don’t need this trick. You can just move zero into your desired destination."

Will remember for the next time I write asm for Itanium!

shawn_w•2d ago

Quite a few architectures have a dedicated 0 register.

signa11•2d ago

indeed. riscv for instance. also, afaik, xor’ing is faster. i would assume that someone like mr. raymond would know…

IshKebab•2d ago

> afaik, xor’ing is faster

Even tiny tiny CPUs can do sub in one cycle, so I doubt that. On super-scalar CPUs xor and sub are normally issued to the same execution units so it wouldn't make a difference there either.

tliltocatl•2d ago

On superscalars running xor trick as is would be significantly slower because it implies a data dependency where there isn't one. But all OOO x86's optimize it away internally.

IshKebab•1d ago

Sub has the same false data dependency.

pif•2d ago

Which part of "mathematical operations don’t reset the NaT bit" did you not understand?

lynguist•2d ago

Indeed!!

MIPS - $zero

RISC-V - x0

SPARC - %g0

ARM64 - XZR

classichasclass•1d ago

PowerPC: "r0 occasionally" (with certain instructions like addi, though this might be better considered an edge case of encoding)

matja•1d ago

Alpha: r31, f31

Findecanor•1d ago

On 64-bit ARM, the same register number is XZR in some instructions and the stack pointer in others.

repelsteeltje•2d ago

Yep. The XOR trick - relying on special use of opcode rather than special register - is probably related to limited number of (general purpose) registers in typical '70 era CPU design (8080, 6502, Z80, 8086).

bonzini•1d ago

A move on SPARC is technically an OR of the source with the zero register. "move %l0, %l1" is assembled as "or %g0, %l0, %l1". So if you want to zero a register you OR %g0 with itself.

classichasclass•1d ago

Unfortunately, 6502 can't XOR the accumulator with itself. I don't recall if the Z80 can, and loading an immediate 0 would be most efficient on those anyway.

bonzini•1d ago

The Z80 can do either LD A,0 or SUB A or XOR A, but the LD is slower due to the extra memory cycle to load the second byte of the instruction.

blywi•1d ago

XOR A absolutely works on Z80 and it's of course faster and shorter than loading a zero value with LD A,0. LD A,0 is encoded to 2 bytes while XOR A is encoded as a single opcode. XOR A has the additional benefit to also clear all the flags to 0. Sub A will clear the accumulator, but it will always set the N flag on Z80.

classichasclass•1d ago

Ah, thanks, I couldn't recall off the top of my head.

eichin•1d ago

Yeah, the article seems to have missed the likely biggest reason that this is the popular x86 idiom - that it was already the popular 8080/Z80 idiom from the CP/M era, and there's a direct line (and a bunch of early 8086 DOS applications were mechanically translated assembly code, so while they are "different" architectures they're still solidly related.)

dmitrygr•1d ago

should set Z too

repelsteeltje•1d ago

You're absolutely right, I stand corrected.

The 6502 gets by doing immediate load: 2 clock cycles, 2 bytes (frequently followed by single byte register transfer instruction). Out of curiosity I did a quick scan of the MOS 1.20 rom of the BBC micro:

  LDY #0 (a0 00): 38 hits
  LDX #0 (a2 00): 28 hits
  LDA #0 (a9 00): 48 hits

tom_•1d ago

Are you sure you're not an LLM? There is no way anybody writing 6502 would do anything else, because there's no other way to do it.

(You can squeeze in a cheeky Txx instruction afterwards to get a 2-or-more-for-1, if that would be what you need - but this only saves bytes. Every instruction on the 6502 takes 2+ cycles! You could have done repeated immediate loads. The cycle count would be the same and the code would be more general.)

repelsteeltje•19h ago

> Are you sure you're not an LLM?

Hard to tell, but I don't think so ;-)

I suppose using Txx instructions rather than LDx is more of an idiom than intended to conserve space. Also, could an LDx #0 potentially be 3 cycles in the edge case where the PC crosses a page boundary? (I'm probably confused? Red herring?)

tom_•8h ago

I don't know how the 6502's PC increment actually worked, but it was an exception to the general rule of page crossings (or the possibility thereof) incurring a penalty, or, as was also sometimes the case, just ignored entirely. (One big advantage of the latter approach: doing nothing does take 0 cycles.)

The full 16 bits would be incremented after each instruction byte fetched, and it didn't cost any extra if there was a carry out of the MSB.

wongarsu•1d ago

And [as mentioned in the article] even modern x86 implementations have a zero register. So you have this weird special opcode that (when called with identical source and destination) only triggers register renaming

monocasa•1d ago

Very few architectures have a NAT bit though.

dlcarrier•1d ago

It would probably run really fast, considering that Itanium's downfall was the difficulty in compiling. (Including translating x86 instructions into Itanium instructions)

tliltocatl•1d ago

Not really. Itanium was a result of some people at Intel being obsessed by LINPACK benchmarks and forgetting everything else. It sucked for random memory access, and hence everything that's not floating-point number-crunching. Compiler can't hide memory access latency because it's fundamentally unpredictable. VLIW does magic for floating-point latency (which is predictable), but

- As transistors got smaller, FP performance increased, memory latency stayed the same (or even increased).

- If you are doing a lot of floating point, you are probably doing array processing, so might as well go for a GPU or at least SIMD).

- Low instruction density is bad for I-cache. Yes, RISC fans, density matters! And VLIW is an absolute disaster in that regard. Again, this is less visible in number-crunching loads where the processor executes relatively small loops many times over.

fjjfnrnr•1d ago

Naive question: shouldn't vliw be beneficial to memory access, since each instruction does quite a lot of work, thus giving the memory time to fetch the next instruction?

tliltocatl•1d ago

- Even each instruction does a lot of work, it is supposed to do it in parallel, so time available to fetch the next instruction is (supposed to be) the same.

- Not everything is parallelisable so most of instructions words end up full of NOPs.

- The real problem are data reads. Instruction fetches are fairly predictable (and when they aren't OOO suck just as much), data reads aren't. An OOO can do something else until the data comes in. VLIV, or any in-order architecture, must stall as soon as a new instruction depends on the result of the read.

dlcarrier•1d ago

Short loops and lots of math is what I'm talking about; the kind of thing that hand-written assembly language helps with on even modern hardware.

tliltocatl•2d ago

It might be because XOR is rarely (in terms of static count, dynamically it surely appears a lot in some hot loops) used for anything else, so it is easier to spot and identify as "special" if you are writing manual assembly.

stingraycharles•2d ago

And helps with SMT

Edit: this is apparently not the case, see @tliltocatl's comment down the thread

tliltocatl•2d ago

What's SMT in this context?

recursivecaveat•2d ago

Simultaneous Multi-Threading (hyper-threading as Intel calls it). I'm not a cpu guy, but I think the ALU used for subtraction would be a more valuable resource to leave available to the other thread than whatever implements a xor. Hence you prefer to use the xor for zeroing and conserve the ALU for other threads to use.

tliltocatl•2d ago

I don't think that's how it works.

- Normally ALU implements all "light" operations (i. e. add/sub/and/or/xor) in a single block, separating them would result in far more interconnect overhead. Often, CPUs have specialized adder-only units for address generation, but never a xor-specialized block.

- All CPUs that implement hyper-threading also optimize a XOR EAX,EAX into MOV EAX,ZERO/SET FLAGS (where ZERO is an invisible zero register just like on Itanium and RISCs). This helps register renaming and eliminates a spurious dependency.

- The XOR trick is about as old as 8086 if not older.

Symmetry•1d ago

Right. Keeping down the number of slots the scheduler and bypass network need to worry about is an important design pressure.

fredoralive•2d ago

By the time you get to a CPU complex enough to be to have SMT it is likely to detect these “clear register” patterns and special case them.

XOR would also be handled by the ALU, the L is for logic.

IshKebab•1d ago

Most CPU use the same ALU for xor and sub.

kunley•2d ago

XOR appears a lot in any code touching encryption.

PS. What is static vs dynamic count?

tliltocatl•2d ago

Static count - how many times an instruction appears in a binary (or assembly source).

Dynamic count - how many times an opcode gets executed.

I. e. an instruction that doesn't appear often in code, but comes up in some hot loops (like encryption) would have low static and high dynamic.

bonzini•1d ago

Indeed this is the best explanation!

rasz•2d ago

Looking at some random 1989 Zenith 386SX bios written in assembly so purely programmer preferences:

8 'sub al, al', 14 'sub ah, ah', 3 'sub ax, ax'

26 'xor al, al', 43 'xor ah, ah', 3 'xor ax, ax'

edit: checked a 2010 bios and not a single 'sub x, x'

pishpash•2d ago

Could be used to express 1 bit of information in some non-obvious convention.

adrian_b•1d ago

The x86-64 ISA provides a lot of alternative encodings for the same instruction or for instructions that are equivalent.

It has already been suggested to use these for steganography, i.e. for embedding a hidden message in a binary executable file, by encoding 1 or more bits in the choice of the instruction encoding among alternatives, for every instruction for which alternatives exist.

ralferoo•21h ago

The shareware assembler a86 used to use this to fingerprint its output so the author could check whether random programs to see if they were assembled using it without having paid the shareware fee.

empiricus•2d ago

The hw implementation of xor is simpler than sub, so it should consume slightly less energy. Wondering how much energy was saved in the whole world by using xor instead of sub.

flohofwoe•2d ago

I doubt any of that is measurable, since all ALU operations are usually implemented with the same logic (e.g. see https://www.righto.com/2013/09/the-z-80-has-4-bit-alu-heres-...)

croes•1d ago

I guess everything what was saved was burned by the first useless image created per AI

Symmetry•1d ago

For a 32 bit number you're looking at going from using 256 to ~1800 transistors in the operation itself. A modern core will have roughly 1,000,000,000 transistors. Some of those are for vector operations that aren't involved in a xor or sub, but most of them are for allowing the core to extract more parallelism from the instruction stream. It's really just a dust mote compared to the power reduction you could get by, e.g., targeting a 10 MHz lower clock rate.

jhoechtl•2d ago

Back in the stone ages XOR ing was just 1 byte of opcode. Habbits stick. In effect XORing is no longer faster since a long time.

dragontamer•2d ago

The XOR trick is implemented as a (malloc from register file) on modern processors, implemented in the decoder and it won't even issue a uOp to the execution pipelines.

Its basically free today. Of course, mov RAX, 0 is also free and does the same thing. But CPUs have limited decoder lengths per clock tick, so the more instructions you fit in a given size, the more parallel a modern CPU can potentially execute.

So.... definitely still use XOR trick today. But really, let the compiler handle it. Its pretty good at keeping track of these things in practice.

-----------

I'm not sure if "sub" is hard-coded to be recognized in the decoder as a zero'd out allocation from the register file. There's only certain instructions that have been guaranteed to do this by Intel/AMD.

mmozeiko•1d ago

sub is also recognized as zeroing idiom for register file. Intel documents these in "3.5.1.7 Clearing Registers and Dependency Breaking Idioms" from Optimization Reference Manual: https://www.intel.com/content/www/us/en/developer/articles/t...

Here's html version: https://zzqcn.github.io/perf/intel_opt_manual/3.html#clearin...

AMD has similar list in "2.9.2 Idioms for Dependency removal" from "Software Optimization Guide for the AMD Zen5 Microarchitecture" document: https://docs.amd.com/v/u/en-US/58455_1.00

flohofwoe•2d ago

Depending on what's stone-age for you, a SUB with a register was also only one byte, and was the same cost as XOR, at least in the Intel/Zilog lineage all the way back to the 70s ;)

NetMageSCW•1d ago

The article’s point is about why XOR is preferred over SUB, both being one byte.

MOV is right out.

drfuchs•2d ago

Relatedly, there's a steganographic opportunity to hide info in machine code by using "XOR rax,rax" for a "zero" and "SUB rax,rax" for a "one" in your executable. Shouldn't be too hard to add a compiler feature to allow you to specify the string you want encoded into its output.

gynvael•1d ago

This sounds like a Paged Out article ;)

defmacr0•1d ago

https://www.cs.columbia.edu/~angelos/Papers/hydan.pdf

Here's some more prior art

not_a_bijection•1d ago

You can do better. X86 has both "op [mem], reg" and "op reg, [mem]" variants of most instructions, where "[mem]" can be a register too. So you have two ways to encode "xor eax, eax", differing by which of the operands is in the "possible memory operand" slot, the source or the destination.

mpeg•1d ago

This one would be a fun challenge in a ctf, or maybe more appropriate for a puzzle hunt – most people would look at the dissassembly and not at the actual bytes and completely miss the binary encoding

zzo38computer•1d ago

Some disassembly listings will also include the actual bytes (there are multiple reasons why you will want this).

EvanAnderson•1d ago

That could be a style metric, too. Time spent reversing MS-DOS viruses in my youth showed me assembler programmers very clearly have styles to their code. It's too weak for definitive attribution but it was interesting to see "rhymes" between, for example, the viruses written by The Dark Avenger.

enduku•2d ago

I ran into this rabbithole while writing an x86-64 asm rewriter.

xor was the default zeroing idiom.I onkly did sub reg,reg when I actually want its flags result. Otherwise the main rule is: do not touch either form unless flags liveness makes the rewrite obviously safe. Had about 40 such idioms for the passes.

b1temy•2d ago

Back when I was in university, one of the units touching Assembly[0] required students to use subtraction to zero out the register instead of using the move instruction (which also worked), as it used fewer cycles.

I looked it up afterwards and xor was also a valid instruction in that architecture to zero out a register, and used even fewer cycles than the subtraction method; but it was not listed in the subset of the assembly language instructions we were allowed to use for that unit. I suspect that it was deemed a bit off-topic, since you would need to explain what the mathematical XOR operation was (if you didn't already learn about it in other units), when the unit was about something else entirely- but everyone knows what subtraction is, and that subtracting a number by itself leads to zero.

[0] Not x86, I do not recall the exact architecture.

adrian_b•1d ago

It should be noted that XOR is just (bitwise) subtraction modulo 2.

There are many kinds of SUB instructions in the x86-64 ISA, which do subtraction modulo 2^64, modulo 2^32, modulo 2^16 or modulo 2^8.

To produce a null result, any kind of subtraction can be used, and XOR is just a particular case of subtraction, it is not a different kind of operation.

Unlike for bigger moduli, when operations are done modulo 2 addition and subtraction are the same, so XOR can be used for either addition modulo 2 or subtraction modulo 2.

gblargg•1d ago

> XOR is just a particular case of subtraction, it is not a different kind of operation.

It's different in that there's no carry propagation.

adrian_b•1d ago

That is not a property specific to XOR.

Whenever you do addition/subtraction modulo some power of two, the carry does not propagate over the boundaries that correspond to the size of the modulus.

For instance, you can make the 128-bit register XMM1 to be zero in one of the following ways:

  PXOR  XMM1, XMM1   ; Subtraction modulo 2^1
  PSUBB XMM1, XMM1   ; Subtraction modulo 2^8
  PSUBW XMM1, XMM1   ; Subtraction modulo 2^16
  PSUBD XMM1, XMM1   ; Subtraction modulo 2^32
  PSUBQ XMM1, XMM1   ; Subtraction modulo 2^64

In all these 5 instructions, the carry propagates inside chunks corresponding to the size of the modulus and the carry does not propagate between chunks.

For XOR, i.e. subtraction modulo 2^1, the size of a chunk is just 1 bit, so the propagation of the carry inside the chunk happens to do nothing.

There are no special rules for XOR, its behavior is the same as for any other subtraction, any behavior that seems special is caused by the facts that the numbers 1 (size in bits of the integer residue) and 0 (number of carry propagations inside a number having the size of the residue) are somewhat more special numbers than the other cardinal numbers.

When you do not do those 5 operations inside a single ALU, but with separate adders, the shorter is the number of bits over which the carry must propagate, the faster is the logic device. But when a single ALU does all 5, the speed of the ALU is a little slower than the slowest of those 5 (a little slower because there are additional control gates for selecting the desired operation).

The other bitwise operations are also just particular cases of more general vector operations. Each of the 3 most important bitwise operations is the 1-bit limit of 2 operations which are distinct for numbers with sizes greater than 1 bit, but which are equivalent for 1-bit numbers. While XOR is just addition or subtraction of 1-bit numbers, AND is just minimum or multiplication of 1-bit numbers, and OR is just maximum of 1-bit numbers or the 1-bit version of the function that gives the probability for 1 of 2 events to happen (i.e. difference between sum and product).

gpderetta•1d ago

And in practice it is very likely that XOR and the variously sized vector ADDs and SUBs are implemented exactly by the same ALU circuitry, parameterized by a bitmasks of the carry lines to enable (none for XOR, all except the vector size boundaries for the vector operations).

Suzuran•1d ago

On some of IBM's smaller processors, such as channel controllers and the CSP used in the midrange line prior to the System/38, the xor instruction had a special feature when used with identical source and destination - It would inhibit parity and/or ECC error checking on the read cycle, which meant that xor could be used to clear a register or memory location that had been stored with bad parity without taking a machine check or processor check.

rep_lodsb•1d ago

Interesting, since the general culture at IBM seems to have preferred SUB over XOR -- their earlier business-oriented machines didn't even have a XOR instruction, and even on later ones the use of SUB has persisted, including in the IBM PC and AT BIOS.

(There was another, now deleted, comment somewhere in this thread that mentioned IBM's preference for SUB. Source of that statement was Claude, but it seems very likely to be correct. The BIOS code I've checked myself, lots of 'SUB AX,AX', no XOR)

Suzuran•1d ago

You may not be looking for the right thing. On the aforementioned CSP, the instruction that performed XOR was called "XR" and not "XOR". My source is firsthand knowledge; I was a CE and performed service calls on the System/34, System/36, 370, and 390.

In any case, I am describing equipment built mostly in late 60s through the late 70s at IBM Rochester and Poughkeepsie. The IBM PC was developed by an entirely different team at IBM Boca Raton, and IBM didn't design its CPU.

rep_lodsb•1d ago

I don't doubt that this specific processor special-cased XOR (regardless of how it was called in the assembly language)!

Merely pointing out that where both operations were available, there seems to have been a preference to use SUB instead, with some continuity from early business-oriented mainframes, to the 360, to the PC.

fweimer•1d ago

You probably would prefer to use SUB with fault-checking to clear registers in general-purpose code, and only use XOR in early startup (and perhaps fault handlers), where error checking has to be suppressed. So both observations seem to align well?

Suzuran•1d ago

Another thing I should point out is that the CSP instruction set was not documented to the customer. The CSP software was called "Microcode" and the customer was not told about the CSP's design or how it worked. The documented instruction set for the System/34 and System/36 is that of the Main Storage Processor or MSP, which was an evolution of the IBM System/3.

zahlman•1d ago

> but xor took a slightly lead due to some fluke, perhaps because it felt more “clever”.

Absolutely. But I can also imagine that it feels more like something that should be more efficient, because it's "a bit hack" rather than arithmetic. After all, it avoids all the "data dependencies" (carries, never mind the ALU is clocked to allow time for that regardless)!

I imagine that a similar feeling is behind XOR swap.

> Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side.

Network effects are much older than social media, then....

dreamcompiler•1d ago

I vaguely remember we used the XOR trick on processors other than Intel, so it may not be Intel-specific.

In principle, sub requires 4 steps:

1. Move both operands to the ALU

2. Invert second operand (twos complement convert)

3. Add (which internally is just XOR plus carry propagate)

4. Move result to proper result register.

This is absolutely not how modern processors do it in practice; there are many shortcuts, but at least with pure XOR you don't need twos complement conversion or carry propagation.

Source: Wrote microcode at work a million years ago when designing a GPU.

rep_lodsb•1d ago

These two steps usually run in parallel though, with transistors to enable them depending on what operation should be performed.

bonzini•1d ago

You don't do twos complement negation for sub in an integer ALU. You do ones complement (A + ~B) and set the input carry to 1. The difference is that you don't need two carry propagations and therefore you can just add a fancy A + ~B function to the ALU.

Floating point is different because what matters is same sign or different sign (for same sign you cannot have cancellation and the exponent will always be the same or one than the largest input's. So the FP mantissa tends to use sign magnitude representation.

RiverCrochet•1d ago

XOR is a simple logic-gate operation. SUB would have to be an ALU operation.

A one-bit adder (which is subtraction in reverse) makes signals pass through two gates.

See https://en.wikipedia.org/wiki/Adder_(electronics)

You need the 2 gates for adding/subtracting because you care about carry. So if you're adding/subtracting 8 bits, 16 bits, or more, you're connecting multiples of these together, and that carry has to ripple through all the rest of the gates one-by-one. It can't be paralellized without extra circuitry, which increases your costs in other ways.

Without the AND gate needed for carry, all the XORs can fire off at the same time. If you added the extra circuitry for a parallelizable add/subtract to make it as fast as XOR, your actual parallel XOR would consume less power.

commandlinefan•1d ago

It's still the same number of clock cycles, though, isn't it? You're using some extra circuitry during the SUB, but during the XOR, that circuitry is just sitting idle anyway, so it's still six of one/half a dozen of the other.

RiverCrochet•1d ago

XOR can do everything in 1 cycle (which is hopefully far, far less than the clock). SUB-if done the simple way-has to take n cycles where n is the number of bits subtracted.

jonathrg•1d ago

What do you mean by cycles? A ripple-carry adder needs to wait for the carry bits to ripple through yes, but there's no clock cycle involved.

toast0•1d ago

Maybe they mean gate delays?

anvuong•1d ago

That's just not true. You think subtracting/adding 64-bit numbers actually take 64 cycles?

There is sequential implementation of ripple carry adder that uses clock and register, this will add 1-bit per cycle, but no body uses this for obvious reason, it's just a toy example for education. A normal ripple carry adder will have some delay in propagation time before the output is valid, but that is much less a clock cycle. You can also design a customized adder circuit for 4-bit 8-bit 16-bit etc separately that would greatly minimizes the propagation delay to only 2 or 3 levels of gates, instead of n gates like in the ripple carry adder.

valleyer•1d ago

Right. In other words, the clock cycle is already made to be long enough to allow a word-sized SUB to settle. An XOR-with-self surely settles faster, but it still has to wait for that same clock cycle before proceeding.

vintermann•1d ago

> but no body uses this for obvious reason, it's just a toy example for education.

SERV has entered the chat!

It has one upside besides education, and that is that it can be implemented with fewer gates. If you for some reason need parallelism on the core level rather than the bit level, you can cram in more cores with bit-serial ALUs in the same space.

monocasa•13h ago

SERV also implements xor bit serially too though.

DSMan195276•1d ago

It all depends on the CPU architecture, if it supports something like out-of-order execution then both parts of the CPU could be in use at the same time to execute different instructions. Realistically any CPU with that level of complexity doesn't care about SUB vs XOR though.

monocasa•1d ago

Also, because SUB is implemented internally with XOR, so it's normally the same gates, with different signals selecting a different function.

saati•1d ago

In an OoO CPU it won't even hit an execution unit because it's handled as a dependency chain break.

Symmetry•1d ago

That's all true, but on any modern x86 processor both the single pair of gates for the xor and the 10 or so for a carry-bypass 64 bit wide subtraction both happen with a single clock cycle of latency so from a programmer's perspective they're the same in that sense. There's still an energy difference but its tiny compared to what even the register file and bypass network for the operation use, let along the OoO structures.

dataflow•1d ago

The question isn't whether they both take a clock cycle, but rather whether any future implementation of the ISA might ostensibly find some sort of performance advantage, even if none do right now. From that standpoint, xor seems like a safer bet.

vablings•1d ago

Defacto standard, Compilers optimize for the CPU, CPU uarch is now optimizing for compilers

Symmetry•1d ago

There's been a lot of churn over the years but additions being done in the same timeframe as XORs has been pretty constant. The Pentium 4 double pumped its ALU but both XORs and ADDs could happen in a half cycle latency. The POWER 6 cut the FO4s of latency in stage from 16 to 10 and kept that parity as well. When you need 2 FO4s for latching between stages and 2 to handle clock jitter at high frequencies the difference between what a XOR needs and what an ADD need start looking smaller, particularly when you include the circuitry to move the data and select the instruction. Maybe if we move to asynchronous circuits?

cma•1d ago

There's also heat and thermal throttling, xor without having to forward compute carries could use less power I would think.

idontwantthis•1d ago

The blog post is about why this is idiomatic not whether it needs to be done that way today. It’s idiomatic because once upon a time none of that existed and xor gates did. The author apparently never took intro to digital logic.

masklinn•1d ago

The question is why one idiom won over the other, which happened a long time ago.

Because as the article notes on "any modern x86 processor" both xor r, r and sub r, r are handled by the frontend and have essentially no cost.

fjjfnrnr•1d ago

Because of encoding size of the machine code, not because of any runtime cost

masklinn•1d ago

> It encodes to the same number of bytes

0-_-0•22h ago

So then the question is: which pipeline is used less? Bit or add?

z500•1d ago

That was what I guessed too, but according to the article Intel detected both.

monocasa•1d ago

Except ALUs share hardware with logic functions.

Internally the adder (which is also used as a subtractor by ones complementing one of the inputs and inverting the initial carry in) uses xor, and you can implement the XOR logic op with the same gates.

Also, modern ALUs don't use ripple carries really any more, but instead stuff like a Kogge-Stone adder (or really, typically a hierarchical set of different techniques). https://en.wikipedia.org/wiki/Kogge%E2%80%93Stone_adder

matja•1d ago

SUB has higher latency than XOR on some Intel CPUs:

latency (L) and throughput (T) measurements from the InstLatx64 project (https://github.com/InstLatx64/InstLatx64) :

  | GenuineIntel | ArrowLake_08_LC | SUB r64, r64 | L: 0.26ns=  1.00c  | T:   0.03ns=   0.135c |
  | GenuineIntel | ArrowLake_08_LC | XOR r64, r64 | L: 0.03ns=  0.13c  | T:   0.03ns=   0.133c |
  | GenuineIntel | GoldmontPlus    | SUB r64, r64 | L: 0.67ns=  1.0 c  | T:   0.22ns=   0.33 c |
  | GenuineIntel | GoldmontPlus    | XOR r64, r64 | L: 0.22ns=  0.3 c  | T:   0.22ns=   0.33 c |
  | GenuineIntel | Denverton       | SUB r64, r64 | L: 0.50ns=  1.0 c  | T:   0.17ns=   0.33 c |
  | GenuineIntel | Denverton       | XOR r64, r64 | L: 0.17ns=  0.3 c  | T:   0.17ns=   0.33 c |

I couldn't find any AMD chips where the same is true.

Symmetry•1d ago

.03ns is a frequency of 33 GHz. The chip doesn't actually clock that fast. What I think you're seeing is the front end detecting the idiom and directing the renamer to zero that register and just remove that instruction from the stream hitting the execution resources.

adrian_b•1d ago

SUB does not have higher latency than XOR on any Intel CPU, when those operations are really performed, e.g. when their operands are distinct registers.

The weird values among those listed by you, i.e. those where the latency is less than 1 clock cycle, are when the operations have not been executed.

There are various special cases that are detected and such operations are not executed in an ALU. For instance, when the operands of XOR/SUB are the same the operation is not done and a null result is produced. On certain CPUs, the cases when one operand is a small constant are also detected and that operation is done by special circuits at the register renamer stage, so such operations do not reach the schedulers for the execution units.

To understand the meaning of the values, we must see the actual loop that has been used for measuring the latency.

In reality, the latency measured between truly dependent instructions cannot be less than 1 clock cycle. If a latency-measuring loop provides a time that when divided by the number of instructions is less than 1, that is because some of those instructions have been skipped. So that XOR-latency measuring loop must have included XORs between identical operands, which were bypassed.

butterisgood•1d ago

I recall thinking about these things quite a bit when reading Michael Abrash back in the 90s.

How much of that advice applies to anything these days is questionable. Back then we used to squeeze as much as possible from every clock cycle.

And cache misses weren’t great but the “front side bus” vs CPU clock difference wasn’t so insane either. RAM is “far away” now.

So the stuff you optimize for has changed a bit.

Always measure!

NanoWar•1d ago

XORing just feels more like xxxxing out the register. SUB feels like a calculation or mistaken use of a register.

billforsternz•1d ago

Back in the early 1980s I leveled up my self taught Z80 assembly skills by reading a book that attempted to disassemble and explain the Sinclair Spectrum ROM.

I remember the very first ROM instruction was XOR A and this was already a revelation to me as I'd never considered doing anything other than LD A,0 to clear the accumulator.

fortran77•1d ago

SUB may have side effects of setting flags that XOR may not depending on the CPU.

lochnessduck•1d ago

I use the carry flag in a lot of z80 assembly for communicating a status of an operation. XOR doesn’t mess with the carry flag, I think it’s another point in favor of xor. (Though I don’t remember even considering using sub)

Jare•1d ago

This is the exact reason I remember from back in the 80's. Perform arithmetic, clear register, CF is still valid.

kevin_thibedeau•1d ago

XOR is faster than SUB on bit slice processors. I'd posit that that created established idioms as micros came on the scene.

aforwardslash•1d ago

Afaik xor reg,reg is optimized by the cpu as zero-out reg; sub reg, reg is quite more difficult to optimize this way; this seems to be quite important in modern cpus, where cisc is translated to micro-ops; in superscalar archs, this is probably optimized away instead of causing a stall.

michaelcampbell•17h ago

> I don’t know why xor won the battle, but I suspect it was just a case of swarming.

> In my hypothetical history, xor and sub started out with roughly similar popularity, but xor took a slightly lead due to some fluke, perhaps because it felt more “clever”.

SO MUCH ink and "odd" code has been spilled over these 2 sentences over the past few decades...

DeepSeek v4

Composition Shouldn't be this Hard

Why I Write (1946)

An update on recent Claude Code quality reports

GPT-5.5

US special forces soldier arrested after allegedly winning $400k on Maduro raid

Bitwarden CLI compromised in ongoing Checkmarx supply chain campaign

Spinel: Ruby AOT Native Compiler

Show HN: Gova – The declarative GUI framework for Go

Show HN: How LLMs Work – Interactive visual guide based on Karpathy's lecture

Habitual coffee intake shapes the microbiome, modifies physiology and cognition

MeshCore development team splits over trademark dispute and AI-generated code

Meta tells staff it will cut 10% of jobs

Why Not Venus?

Show HN: Tolaria – Open-source macOS app to manage Markdown knowledge bases

Familiarity is the enemy: On why Enterprise systems have failed for 60 years

TorchTPU: Running PyTorch Natively on TPUs at Google Scale

Using the internet like it's 1999

UK Biobank health data keeps ending up on GitHub

Ubuntu 26.04

My phone replaced a brass plug

Show HN: Agent Vault – Open-source credential proxy and vault for agents

Show HN: Honker – Postgres NOTIFY/LISTEN Semantics for SQLite

A programmable watch you can actually wear

Astronomers find the edge of the Milky Way

Alberta startup sells no-tech tractors for half price

Used La Marzocco machines are coveted by cafe owners and collectors

Incident with multple GitHub services

Writing a C Compiler, in Zig (2025)

I am building a cloud

DeepSeek v4

Composition Shouldn't be this Hard

Why I Write (1946)

An update on recent Claude Code quality reports

GPT-5.5

US special forces soldier arrested after allegedly winning $400k on Maduro raid

Bitwarden CLI compromised in ongoing Checkmarx supply chain campaign

Spinel: Ruby AOT Native Compiler

Show HN: Gova – The declarative GUI framework for Go

Show HN: How LLMs Work – Interactive visual guide based on Karpathy's lecture

Habitual coffee intake shapes the microbiome, modifies physiology and cognition

MeshCore development team splits over trademark dispute and AI-generated code

Meta tells staff it will cut 10% of jobs

Why Not Venus?

Show HN: Tolaria – Open-source macOS app to manage Markdown knowledge bases

Familiarity is the enemy: On why Enterprise systems have failed for 60 years

TorchTPU: Running PyTorch Natively on TPUs at Google Scale

Using the internet like it's 1999

UK Biobank health data keeps ending up on GitHub

Ubuntu 26.04

My phone replaced a brass plug

Show HN: Agent Vault – Open-source credential proxy and vault for agents

Show HN: Honker – Postgres NOTIFY/LISTEN Semantics for SQLite

A programmable watch you can actually wear

Astronomers find the edge of the Milky Way

Alberta startup sells no-tech tractors for half price

Used La Marzocco machines are coveted by cafe owners and collectors

Incident with multple GitHub services

Writing a C Compiler, in Zig (2025)

I am building a cloud

XOR'ing a register with itself is the idiom for zeroing it out. Why not sub?

Comments