* for add is 0.20 (ie 5 per cycle)
* for adc is 0.50 (ie 2 per cycle)
so it does seem correct.
This seems to be a consequence of `add` being available on ports 0, 1, 5, 6, & B, whereas `adc` is only available on ports 0 & 6
So yes as an individual instruction it’s no worse, but even non-dependent instructions will be worse for OoO execution (which is more realistic than viewing it as a single instruction)
And with compare & jump being adjacent they can be fused together into one uop, which Intel, AMD, and Apple Silicon all do.
I've been working on my own Entity Component System in C# and basically had to start from the ground up and test every assumption possible. There have only really been a few instances where my gut was correct, more often than not there are so many surprising gotchas hidden everywhere.
With C++ the existence of such "extended integer" types is implementation defined. clang at least supports the same _BitInt construct for C++ too. gcc seems to not support it.
So, for the 256 case on x86_64, both clang and gcc seem to only generate the simple adc ripple version: https://gcc.godbolt.org/z/nxoEda3q5 https://gcc.godbolt.org/z/bYf4bor3f
even the one you cult over
s0 += a0;
s1 += a1;
s2 += a2;
s3 += a3;
c0 = s0 < a0; // RISC-V `sltu`
c1 = s1 < a1;
c2 = s2 < a2;
if (s1 == -1) goto propagate0; // executes 1 time in 18,446,744,073,709,551,616
check_s2:
if (s2 == -1) goto propagate1; // ditto
add_carries:
s1 += c0;
s2 += c1;
s3 += c2;
goto done;
propagate0: c1 = c0; goto check_s2;
propagate1: c2 = c1; goto add_carries;
done:
The key insight here is that unless the sum at a particular limb position is all 1s the carry out from that position DOES NOT DEPEND on the carry in to that limb position, but only on whether the original add in that position produces a carry. If the sum is all 1s the the carry out is the same as the carry in.If you express this with a conditional branch which is overwhelmingly predicted as not taken then the code should execute each block of instructions entirely in parallel, provided that multiple conditional branches can be predicted as not-taken in the same clock cycle.
One time in 2^64 it will execute very slowly.
With 4 limb numbers on a 4-wide machine this doesn't offer an advantage over `adc` as there are also 4 code blocks. But on, say, an 8-wide machine with 8 limb numbers you're really starting to gain.
It's probably not going to help on current x86_64, but might well do on Apple's M* series, where even the M1 is 8-wide, though it might be tricky to work around the Arm ISA.
When the 8-wide RISC-V Ascalon processor from Tenstorrent hits hopefully late this year or early 2026 we will really see. And others such as Ventana, Rivos, XiangShan.
This will work even better in a wide SIMD, if you have a fast 1-lane shift (Called slideup on RISC-V).
if (s1 == -1)
c1 = c0;
if (s2 == -1)
c2 = c1;
These can become conditional moves on x86. I've often thought RISC-V should have implemented an IF instruction instead of compare and branch. IF would cause the next instruction to be executed conditionally while not needing a flag register at the ISA level. They could have required only branch and jump to be conditional, but it turns out conditional mov, load, and store are all very useful in real code.The entire point of what I did is that the two conditional branches will be predicted not taken, so the CPU will 99.9999999999999999946% of the time not even see the `c1 = c0` and `c2 = c1` instructions that introduce the sequential dependencies.
Modern branch predictors are very good and most branches are very predictable.
If you can substitute a cmov without control flow then it's probably safer, e.g. c1 |= c0 & seq(s1,-1) or so, so long as you can make sure the compiler won't turn it into a branch.
It does add a data dependency though ...
A `cmov` will have the same serialisation problem as `adc` but on machines without carry it might still leave you better off than the obvious `add s,a,b; sltu co,s,a; add s,s,ci; sltu t,s,ci; or co,co,t`.
Even at that time in 2021 I argued that serialising through a carry flag is limiting on wide machines, but there was very little RISC-V hardware available at the time and also GMP was not yet ported to RISC-V.
That has changed a bit now, and almost two months ago I tried the GMP project's own gmpbench on a few RISC-V boards.
I found that when comparing similar µarch at similar clock speed, in dual-issue in-order SiFive's U74 is very comparable to Arm's A53, and in small 3-wide OoO SiFive's P550 is significantly better than Arm's A72.
And that's not even using the kind of technique discussed in this post, but the full multi-instruction carry flag emulation criticised by Granlund.
https://www.reddit.com/r/RISCV/comments/1jsnbdr/gnu_mp_bignu...
It's going to be very interesting when the 8-wide OoO RISC-V cores come out, probably starting with Tenstorrent's Ascalon core which they expect to tape out in Q3 and they have said they want to get into as many hands as possible to accelerate RISC-V development, including in laptops, not only in servers or the like.
Neither of the 2 multi-word addition algorithms can replace the other, both have their use cases, so ADC/SBB instructions are included in any decent ISA, because the cost of adding them is negligible. A dedicated flag register is not necessary, some ISAs store the carry/borrow flags in general-purpose registers, when used.
Not having carry is by far not the worst feature of RISC-V. Much worse is not having an integer overflow flag, because the software workaround for detecting integer overflow, which is mandatory for any program that claims to be written in a safe way, lowers the attainable performance much more than the workarounds for not having carry.
That's absurd. A better way is to ensure that your algorithms don't overflow. Detecting an overflow just means your code has to STOP which is usually not safe. It'd be insane to have conditionally executed code trying to figure out how to handle an overflow anywhere in code. Another problem is that flags are not even accessible from any language higher level then ASM. From a C perspective there are no flags.
While it'd be nice to have a formal proof that every single `a+b`, `a-b`, `a*b` in every codebase doesn't overflow, I'm sure you understand that that is rather impractical. (and really, it'd be nice! I've thought about having some compile-time-bounded-size integers where each addition increases the size, but multiplication is much less suitable for that, and it also means you can't have a loop adding to an accumulator. It's a rather non-trivial problem really - you might think that it'd be fine to have a loop over a list of objects and sum their sizes, but that can relatively easily overflow if the list references the same massive object many times, so can't even really abstract that)
This idea has wider applicability than operations on long integers.
With double the amount of additions this allows for log(bits) propagation time (versus linear)
... Which likely isn't that bad to code up.
For as long as radix=2, you either have a carry or you don't.
But - you still have to split the input numbers into sets of 5 registers in the first place, right? So doesn't that need to be parallelizable somehow as well in order for this to be a net win?
[1] a left shift, a right shift, and an OR. Or just one 2->1 funnel shift instruction if your ISA has that e.g. arm64. And then ANDing with a 51 bit mask.
and out0, in0, 0x7FFFFFFFFFFFF
extr out1, in1, in0, #51
extr out2, in2, in1, #38
extr out3, in3, in2, #25
lsr out4, in3, #12
and out1, out1, 0x7FFFFFFFFFFFF
and out2, out2, 0x7FFFFFFFFFFFF
and out3, out3, 0x7FFFFFFFFFFFF
The radix 2^51 trick - https://news.ycombinator.com/item?id=33706153 - Nov 2022 (6 comments)
The radix 2^51 trick (2017) - https://news.ycombinator.com/item?id=23351007 - May 2020 (83 comments)
Okasaki's book 'Purely Functional Data Structures' has some nice examples.
A convolution can be done with FFT, pointwise multiply, inverse FFT which is O(n log n) rather that O(n^2) for traditional multiplication.
The bits in each limb can be quite small though as there are lots of carries and it depends on how many digits you have and how accurate your floating point is.
Some kind of FFT is how all large multiplications are done.
I had a lot of fun learning about this in relation to GIMPS (the Great Internet Mersenne Prime Search) where you use a variant FFT called a DWT over an irrational base which gives you a free mod 2^n-1 which is what you want for primality testing Mersenne prime candidates using the Lucas test.
It looks more or less like this:
__m256i s = _mm256_add_epi64(a, b);
const __m256i all_ones = _mm256_set1_epi64x(~0);
int g = _mm256_cmpgt_epu64_mask(a, s);
int p = _mm256_cmpeq_epu64_mask(s, all_ones);
int carries = ((g << 1) + p) ^ p;
__m256i ret = _mm256_mask_sub_epi64(s, carries, s, all_ones);
The throughput even seems to be better: https://godbolt.org/z/e7zETe8xYIt's trivial to change this to do 512 bit addition where the improvement will be even more significant.
https://stackoverflow.com/questions/56852812/simd-instructio...
This isn't correct. AVX512 provides both a bunch of extra instructions, zmm (512 bit) registers, and an extra 16 (for a total of 32) vector registers. The donwnclocking only happens if you use 512 bit registers (not just avx512 instructions). The difference here matters a bunch since there are a bunch of really useful instructions (e.g. 64 bit integer multiply) that are added by avx512 that are pure upside.
Also none of this is an issue on Zen4 or Zen5 since they use much more sensible downlclocking where it will only downclock if you've used enough instructions in a row for it to start spiking power/temp.
General idea was just to highlight some of the dangers of vector registers. I believe the same is true of ymm (256) to a lesser extent.
I was trying to encode and decode some buffers into an arbitrary base, and I eventually came to the conclusion (after far too long) that a carry could ripple all the way down the buffer, which dramatically slows down the algorithm.
Actually, the eventual solution I came up might have some stuff in common with this trick too. I did eventually chunk up the buffer leaving some unused headroom to 'handle carries'. Not exactly though, I just have some wasted bits which uses a tiny bit more storage or network bandwidth but saves on compute. I wonder if I could instead pool up the carries like this and 'resolve' it in a later step. Have my cake and eat it too? Wishful thinking.
"The radix 2^51 trick to adding 64-bit integers on *some* x86 architectures in parallel without slowing the pipeline due to dependencies on carry"
addaon•1d ago
Why not give the top limb 64 bits and the other four limbs 48 bits each, then? You can accumulate more additions before normalization, you can take advantage of word alignment during splitting and normalization if your instruction set has anything useful there, and your overflow properties are identical, no?
bboreham•1d ago
addaon•1d ago
bboreham•12h ago
Sukera•22h ago
volemo•21h ago
vitus•20h ago
That said, when you're dealing with 256-bit integers, you're almost assuredly not working with signed arithmetic.
immibis•19h ago
phkahler•18h ago
I think one goal is to use 5 64 bit registers to do 256 bit math. That means using 256/5 = 51.2 bits of each word. That's probably some kind of ideal if you want 256bit math, but not optimal if you're writing a generic big-int library. In the old days you'd want to use exactly one byte for the carry(s) because we didn't have barrel shifters to do arbitrary bit shifts efficiently. In that case I'd use 56 bits of the 64 to get nice byte alignment.
This is all quite relevant for RISC-V since the ISA does not have flags.
Thorrez•17h ago
Why must each word have the same amount? Why not 64 bits on the top word, and 48 bits on the other 4 words?
LegionMammal978•16h ago
xigoi•16h ago
LegionMammal978•15h ago
dgoldstein0•5h ago
andrewla•13h ago