The radix 2^51 trick (2017)

https://www.chosenplaintext.ca/articles/radix-2-51-trick.html

402•blobcode•1d ago

Comments

addaon•1d ago

> Aside: Why 13 bits instead of 12? For our purposes, we’re going to ignore the carries in the most significant limb, allowing numbers to wrap when they overflow past 2256 - 1 (just like how unsigned addition works in C with normal size integer types). As a result, we can assign 52 bits to the most significant limb and ignore the fact that it will run out of room for carries before the other limbs do.

Why not give the top limb 64 bits and the other four limbs 48 bits each, then? You can accumulate more additions before normalization, you can take advantage of word alignment during splitting and normalization if your instruction set has anything useful there, and your overflow properties are identical, no?

bboreham•1d ago

Then you would need 6 words to hold a 256-bit value instead of 5 in the OP, and consequently more instructions to add them.

addaon•1d ago

64 + 48 * 4 == 256... still just five 64-bit words.

bboreham•12h ago

Now you can’t detect overflow?

Sukera•22h ago

Because adding the top limbs of two encoded numbers would overflow too soon. If you set both to 2^63 for example, they overflow immediately. Might be fine for wraparound arithmetic, but not in general.

volemo•21h ago

Setting both to 2^63 means your original 256-bit numbers were 2^255, thus the addition would overflow no matter what intermediate encoding you’re using.

vitus•20h ago

Sure, then set one to 2^62 and the other to -2^62 (namely: 0b1100..00). It's overflow as far as unsigned arithmetic is concerned, but not in the case of signed arithmetic.

That said, when you're dealing with 256-bit integers, you're almost assuredly not working with signed arithmetic.

immibis•19h ago

...so? They don't care about top limb overflow, at all. That's the point.

phkahler•18h ago

>> Why not give the top limb 64 bits and the other four limbs 48 bits each, then?

I think one goal is to use 5 64 bit registers to do 256 bit math. That means using 256/5 = 51.2 bits of each word. That's probably some kind of ideal if you want 256bit math, but not optimal if you're writing a generic big-int library. In the old days you'd want to use exactly one byte for the carry(s) because we didn't have barrel shifters to do arbitrary bit shifts efficiently. In that case I'd use 56 bits of the 64 to get nice byte alignment.

This is all quite relevant for RISC-V since the ISA does not have flags.

Thorrez•17h ago

>That means using 256/5 = 51.2 bits of each word.

Why must each word have the same amount? Why not 64 bits on the top word, and 48 bits on the other 4 words?

LegionMammal978•16h ago

Evenly distributing the number of bits per word lets you chain more additions/subtractions before having to normalize.

xigoi•16h ago

Sure, but the point is that for the most significant limb, there is no point in having redundant bits because whatever you put in them will be discarded after normalization.

LegionMammal978•15h ago

Ah, in that case, you're right, it would make sense to use all 64 bits for the top limb. Still, making them all equal-sized can have benefits if you use SIMD or similar techniques to operate on them uniformly. One project of mine has been trying to work with large integers in CUDA by distributing their limbs across a warp.

dgoldstein0•5h ago

Maybe some use cases want to know if there was overflow? In which case having more space for carry bits to accumulate makes it easy to give an accurate answer

andrewla•13h ago

Even with this explanation a 64 + 48*4 is clearly superior. You can go longer without overflow (since you have 16 bits of carry space per pseudo-digit), and the amount of carry space is aligned even more nicely.

russdill•1d ago

I'm seriously doubtful that adc is inherently slower than add on a modern CPU other then the data hazard introduced by the carry bit. I realize the point of the article is the data hazard so this is a really minor nit.

john-h-k•1d ago

uops.info has latency for both (Alder Lake) at 1 cycle but throughput (lower is better)

* for add is 0.20 (ie 5 per cycle)

* for adc is 0.50 (ie 2 per cycle)

so it does seem correct.

This seems to be a consequence of `add` being available on ports 0, 1, 5, 6, & B, whereas `adc` is only available on ports 0 & 6

So yes as an individual instruction it’s no worse, but even non-dependent instructions will be worse for OoO execution (which is more realistic than viewing it as a single instruction)

phkahler•18h ago

Intel is also supposed to introduce the new APX instructions which include a bunch of instructions that duplicate existing ones but don't set any flags. The only plausible reason to add these is for performance reasons.

john-h-k•17h ago

This isn't just due to the actual dependencies of flag instructions at hardware level (although likely be a factor), it also majorly affects code layout. On Arm64 for example, you can make a comparison, do other operations, and then consume the result of that comparison afterwards, which is excellent for the pipeline and OoO engine. However, because most instructions on x86_64 write flags, you can't do this, and so you are forced to cram `jcc`/`setcc` instructions right after the comparison, which is less friendly to compilers and the OoO engine

dzaima•9h ago

OoO should actually be the care where that doesn't matter I'd think - the CPU can, well, execute the instructions not in the order they're in the binary; it's in-order implementations are where that matters more.

And with compare & jump being adjacent they can be fused together into one uop, which Intel, AMD, and Apple Silicon all do.

john-h-k•16h ago

note: since learnt that B port is just port 11 in all the intel docs, uops.info just hexifies them to keep ports single-char

superjan•1d ago

You are right: it can be done with the same ALU, for sure. But the data dependency on the carry flag makes it a really different instruction from the point of view of the CPU: three data dependencies in stead of two. For the CPU it is beneficial to treat the instructions differently.

animal531•22h ago

CPU's are really funny and interesting things. Us programmers work with them daily and make so many assumptions about them, as well as the whole code chain from the compiler, runtimes, how code works when it comes to loops, methods etc., you name it.

I've been working on my own Entity Component System in C# and basically had to start from the ground up and test every assumption possible. There have only really been a few instances where my gut was correct, more often than not there are so many surprising gotchas hidden everywhere.

yusina•20h ago

It's because they are providing abstractions which we/the compilers use, but just doing that would be too slow, so they implement optimizations, but those are based on certain assumptions, so then the users adjust what they do to match those assumptions well, so the optimizations have now leaked into the API, and after many rounds of doing this for decades, you end up with this terrible mess we are in.

foota•1d ago

Would it be legal for a C(++?) compiler to implement this optimization?

nine_k•1d ago

Does C++ have native support for uint256?

Arnavion•1d ago

With C, it is _BitInt(256) if the compiler supports it. The upper limit of _BitInt is implementation-defined though, so 256 is not guaranteed to be supported. Eg clang on RV64 only supports upto 128, but does support 256 on x64_64. gcc seems to not support _BitInt on RV64 at all, but does support 256 on x86_64.

With C++ the existence of such "extended integer" types is implementation defined. clang at least supports the same _BitInt construct for C++ too. gcc seems to not support it.

So, for the 256 case on x86_64, both clang and gcc seem to only generate the simple adc ripple version: https://gcc.godbolt.org/z/nxoEda3q5 https://gcc.godbolt.org/z/bYf4bor3f

addaon•1d ago

Yes, it complies with the as-if rule; there's no observable difference in behavior. This would apply as well for supporting 64 bit additions within a loop on 32- or 16-bit architectures, for example.

rollcat•22h ago

An unexpected optimisation can introduce a side channel (most commonly timing). This one would be safe, but "how do you tell a compiler which ones [not] to use" is a whole topic by itself.

Denvercoder9•20h ago

The C++ standard doesn't forbid introducing side channels, so the answer to the question is yes.

rollcat•18h ago

With all the UB, I wonder how did we manage to write any secure or safety-critical code at all.

wat10000•14h ago

In C++? We pretty much did not.

konstantinua00•1h ago

this isn't UB, and any other language can do this optimization as well

even the one you cult over

brucehoult•1d ago

Someone working entirely on x86_64 very nicely demonstrates that RISC-V is not wrong to omit the carry flag.

brucehoult•1d ago

Also, there is another way to do this while keeping 64 bit limbs. All variables uint64_t.

    s0 += a0;
    s1 += a1;
    s2 += a2;
    s3 += a3;
    
    c0 = s0 < a0; // RISC-V `sltu`
    c1 = s1 < a1;
    c2 = s2 < a2;
    
    if (s1 == -1) goto propagate0; // executes 1 time in 18,446,744,073,709,551,616
    check_s2:
    if (s2 == -1) goto propagate1; // ditto
    
    add_carries:
    s1 += c0;
    s2 += c1;
    s3 += c2;
    goto done;
    
    propagate0: c1 = c0; goto check_s2;
    
    propagate1: c2 = c1; goto add_carries;
    
    done:

The key insight here is that unless the sum at a particular limb position is all 1s the carry out from that position DOES NOT DEPEND on the carry in to that limb position, but only on whether the original add in that position produces a carry. If the sum is all 1s the the carry out is the same as the carry in.

If you express this with a conditional branch which is overwhelmingly predicted as not taken then the code should execute each block of instructions entirely in parallel, provided that multiple conditional branches can be predicted as not-taken in the same clock cycle.

One time in 2^64 it will execute very slowly.

With 4 limb numbers on a 4-wide machine this doesn't offer an advantage over `adc` as there are also 4 code blocks. But on, say, an 8-wide machine with 8 limb numbers you're really starting to gain.

It's probably not going to help on current x86_64, but might well do on Apple's M* series, where even the M1 is 8-wide, though it might be tricky to work around the Arm ISA.

When the 8-wide RISC-V Ascalon processor from Tenstorrent hits hopefully late this year or early 2026 we will really see. And others such as Ventana, Rivos, XiangShan.

This will work even better in a wide SIMD, if you have a fast 1-lane shift (Called slideup on RISC-V).

phkahler•18h ago

I think you want to write:

  if (s1 == -1)
     c1 = c0;
  if (s2 == -1)
     c2 = c1;

These can become conditional moves on x86. I've often thought RISC-V should have implemented an IF instruction instead of compare and branch. IF would cause the next instruction to be executed conditionally while not needing a flag register at the ISA level. They could have required only branch and jump to be conditional, but it turns out conditional mov, load, and store are all very useful in real code.

brucehoult•17h ago

The problem is that, as far as I know, a conditional move is going to introduce a data dependency from c0 to c1 to c2 that is the exact thing we are trying to get rid of. The cmov is a constant time instruction, not a speculated instruction like a conditional branch.

The entire point of what I did is that the two conditional branches will be predicted not taken, so the CPU will 99.9999999999999999946% of the time not even see the `c1 = c0` and `c2 = c1` instructions that introduce the sequential dependencies.

IshKebab•10h ago

That sounds like it would be quite a pain to implement and program. E.g. what happens if there's an interrupt between the IF and the following instruction? You need to add a CSR to read/write the conditional state, similar to the vector control CSRs (vstart etc.). Hard to see how that extra complexity would be worth it.

Modern branch predictors are very good and most branches are very predictable.

less_less•14h ago

Neat, but if you're using this in cryptographic code (one of the main consumers of bignums), keep in mind that secret data reaching branches is usually a side-channel risk. Sure, it's only 1 time in 2^64 on random data, but if you're depending on that, then you have to consider whether an attacker can choose data that will make it happen more often.

If you can substitute a cmov without control flow then it's probably safer, e.g. c1 |= c0 & seq(s1,-1) or so, so long as you can make sure the compiler won't turn it into a branch.

It does add a data dependency though ...

brucehoult•3h ago

Yes, for cryptography you'd like to have constant time, but this has to be an awfully low bandwidth channel!

A `cmov` will have the same serialisation problem as `adc` but on machines without carry it might still leave you better off than the obvious `add s,a,b; sltu co,s,a; add s,s,ci; sltu t,s,ci; or co,co,t`.

NooneAtAll3•23h ago

ha, I'm not the only one to think "so what's all the risc5 gmp fuss was about, if carry flag is slow anyway?"

brucehoult•17h ago

Right.

Even at that time in 2021 I argued that serialising through a carry flag is limiting on wide machines, but there was very little RISC-V hardware available at the time and also GMP was not yet ported to RISC-V.

That has changed a bit now, and almost two months ago I tried the GMP project's own gmpbench on a few RISC-V boards.

I found that when comparing similar µarch at similar clock speed, in dual-issue in-order SiFive's U74 is very comparable to Arm's A53, and in small 3-wide OoO SiFive's P550 is significantly better than Arm's A72.

And that's not even using the kind of technique discussed in this post, but the full multi-instruction carry flag emulation criticised by Granlund.

https://www.reddit.com/r/RISCV/comments/1jsnbdr/gnu_mp_bignu...

It's going to be very interesting when the 8-wide OoO RISC-V cores come out, probably starting with Tenstorrent's Ascalon core which they expect to tape out in Q3 and they have said they want to get into as many hands as possible to accelerate RISC-V development, including in laptops, not only in servers or the like.

pjc50•23h ago

This is all downstream of C omitting the carry flag, which means in practice it's very rarely used for the purpose of a carry.

immibis•19h ago

C does, however, now have _BitInt

phkahler•18h ago

Ugh, what a terrible thing to add to C.

adrian_b•22h ago

There remain many frequently-encountered cases when carry-save addition is worse than addition using add-with-carry.

Neither of the 2 multi-word addition algorithms can replace the other, both have their use cases, so ADC/SBB instructions are included in any decent ISA, because the cost of adding them is negligible. A dedicated flag register is not necessary, some ISAs store the carry/borrow flags in general-purpose registers, when used.

Not having carry is by far not the worst feature of RISC-V. Much worse is not having an integer overflow flag, because the software workaround for detecting integer overflow, which is mandatory for any program that claims to be written in a safe way, lowers the attainable performance much more than the workarounds for not having carry.

phkahler•18h ago

>> because the software workaround for detecting integer overflow, which is mandatory for any program that claims to be written in a safe way, lowers the attainable performance much more than the workarounds for not having carry

That's absurd. A better way is to ensure that your algorithms don't overflow. Detecting an overflow just means your code has to STOP which is usually not safe. It'd be insane to have conditionally executed code trying to figure out how to handle an overflow anywhere in code. Another problem is that flags are not even accessible from any language higher level then ASM. From a C perspective there are no flags.

dzaima•14h ago

While there is no direct access to flags in standard C, you can nevertheless on gcc and clang compile with -ftrapv and get your signed integer arithmetic be overflow-checked. Or you can use __builtin_add_overflow & co and get access to the overflow flags that way. Rust debug builds trap on signed and unsigned integer overflow, and you can make release builds do so too.

While it'd be nice to have a formal proof that every single `a+b`, `a-b`, `a*b` in every codebase doesn't overflow, I'm sure you understand that that is rather impractical. (and really, it'd be nice! I've thought about having some compile-time-bounded-size integers where each addition increases the size, but multiplication is much less suitable for that, and it also means you can't have a loop adding to an accumulator. It's a rather non-trivial problem really - you might think that it'd be fine to have a loop over a list of objects and sum their sizes, but that can relatively easily overflow if the list references the same massive object many times, so can't even really abstract that)

dzaima•6h ago

Oh, also, C23 added standard ckd_add & ckd_sub & ckd_mul for getting a boolean of whether the operation overflows (i.e. standard equivalent to __builtin_add_overflow)!

nine_k•1d ago

The main takeaway: doing more operations may be faster if they are largely independent, and thus can execute in parallel. Doing fewer operations may be slower if they are forced to execute serially due to data dependency.

This idea has wider applicability than operations on long integers.

repelsteeltje•1d ago

Yes. Another approach would be to use regular 64 bit chunks and speculatively execute each add with and without carry in parallel. Then select the correct variant based on carry result of less significant addition.

With double the amount of additions this allows for log(bits) propagation time (versus linear)

dgoldstein0•1d ago

There's not just "result with carry" and "result without carry" but rather one variant of that per word of the input

... Which likely isn't that bad to code up.

brucehoult•23h ago

Or see https://news.ycombinator.com/item?id=44133169

volemo•21h ago

Wouldn’t that produce 2^n possible results to choose from, where n is the number of chunks? That seems like a lot of additional (he-he) instructions executed.

repelsteeltje•11h ago

Nope. Just 2n: each chunk pair is added once without carry, and once won't carry=1.

For as long as radix=2, you either have a carry or you don't.

mananaysiempre•11h ago

For a single addition, the radix is irrelevant, the carry is always zero or one: (r-1) + (r-1) = 2r - 2 < 2r.

rollcat•22h ago

This rule scales up all the way to multi-node supercomputers / cloud. The overhead is negligible when you can employ 10.000 cores.

hinkley•22h ago

Amdahl says no.

noduerme•21h ago

Abstractly, when any parallel system scales up large enough without cross checking or waiting between "threads", the cost of de-duplicating and merging the output will probably outweigh the advantage of producing new results in tandem. I think. That's just a hypothesis, but feels correct. With stuff like a-life distributed over lots of servers all converging on evolutionary answers to a problem, it's the collation and analysis layer that's most expensive and slow. Sharing more frequently / allowing more reliance on central historical truth slows each one down but avoids duplication and redundancy. I guess where that point is depends on what problem you're trying to solve.

credit_guy•21h ago

Actually the overhead crushes you when you employ 10000 cores. If the overhead of a process is 10% and the parallel part is 90%, then 2 cores will result in a run time of 55% = 10% + 90%/2 of the original time. And 10 cores will get you to 19%. And 100 cores to 10.9%. If you then buy 9900 more cores to bring it to a total of 10000, you just reduced the runtime from 10.9% to 10.009%. In other words, you increased your bill by a factor of 100 to reduce your run time by almost nothing.

volemo•21h ago

You two are talking about different kinds of overhead though.

CamperBob2•13h ago

Yep. Company called NVidia has been looking into that general idea. They seem to be getting some promising results so far, in a couple of different areas.

zahlman•12h ago

What I didn't get about this: the technique shown seems to be about making sure that the ripple carry only happens once instead of N-1 times while adding N values. The carry operation is more complex, but this allows the actual addition to be parallelized.

But - you still have to split the input numbers into sets of 5 registers in the first place, right? So doesn't that need to be parallelizable somehow as well in order for this to be a net win?

adgjlsfhk1•12h ago

That is paralelizable. Each of the 5 registers has no depence on the value of the others.

zahlman•8h ago

But when you split from 4 registers to 5, the bits for a given destination register may come from two different source registers.

brucehoult•5h ago

That only means you need a couple of instructions to do each one [1]. The five output registers are each dependent on (at most) a pair of the four input registers, but not on each other.

[1] a left shift, a right shift, and an OR. Or just one 2->1 funnel shift instruction if your ISA has that e.g. arm64. And then ANDing with a 51 bit mask.

    and  out0, in0, 0x7FFFFFFFFFFFF
    extr out1, in1, in0, #51
    extr out2, in2, in1, #38
    extr out3, in3, in2, #25
    lsr  out4, in3, #12
    
    and  out1, out1, 0x7FFFFFFFFFFFF
    and  out2, out2, 0x7FFFFFFFFFFFF
    and  out3, out3, 0x7FFFFFFFFFFFF

dang•1d ago

Related. Others?

The radix 2^51 trick - https://news.ycombinator.com/item?id=33706153 - Nov 2022 (6 comments)

The radix 2^51 trick (2017) - https://news.ycombinator.com/item?id=23351007 - May 2020 (83 comments)

eru•23h ago

The 'radix trick' also works for data structures.

Okasaki's book 'Purely Functional Data Structures' has some nice examples.

alwahi•22h ago

how to do this for large multiplications instead?

nickcw•22h ago

You can do large multiplications with a convolution and do the carry afterwards.

A convolution can be done with FFT, pointwise multiply, inverse FFT which is O(n log n) rather that O(n^2) for traditional multiplication.

The bits in each limb can be quite small though as there are lots of carries and it depends on how many digits you have and how accurate your floating point is.

Some kind of FFT is how all large multiplications are done.

I had a lot of fun learning about this in relation to GIMPS (the Great Internet Mersenne Prime Search) where you use a variant FFT called a DWT over an irrational base which gives you a free mod 2^n-1 which is what you want for primality testing Mersenne prime candidates using the Lucas test.

phkahler•18h ago

GIMPS is also interesting since it doesn't need to do 2 operand multiplication. It only needs squaring.

e4m2•21h ago

On modern enough x86 CPUs (Intel Broadwell, AMD Ryzen) you could also use ADX [1] which may be faster nowadays in situations where radix 2^51 representation traditionally had an edge (e.g. Curve25519).

[1] https://en.wikipedia.org/wiki/Intel_ADX

ashdnazg•20h ago

With AVX512 (and to a lesser extent with AVX2) one can implement 256 bit addition pretty efficiently with the additional benefit of fitting more numbers in registers.

It looks more or less like this:

  __m256i s = _mm256_add_epi64(a, b);
  const __m256i all_ones = _mm256_set1_epi64x(~0);
  int g = _mm256_cmpgt_epu64_mask(a, s);
  int p = _mm256_cmpeq_epu64_mask(s, all_ones);
  int carries = ((g << 1) + p) ^ p;

  __m256i ret = _mm256_mask_sub_epi64(s, carries, s, all_ones);

The throughput even seems to be better: https://godbolt.org/z/e7zETe8xY

It's trivial to change this to do 512 bit addition where the improvement will be even more significant.

amitprasad•12h ago

Note that, especially on certain Intel architectures, using AVX512 instructions _at all_ can result in the whole processor downclocking, and thus ending up resulting in inconsistent / slower overall performance.

https://stackoverflow.com/questions/56852812/simd-instructio...

adgjlsfhk1•12h ago

> using AVX512 instructions _at all_

This isn't correct. AVX512 provides both a bunch of extra instructions, zmm (512 bit) registers, and an extra 16 (for a total of 32) vector registers. The donwnclocking only happens if you use 512 bit registers (not just avx512 instructions). The difference here matters a bunch since there are a bunch of really useful instructions (e.g. 64 bit integer multiply) that are added by avx512 that are pure upside.

Also none of this is an issue on Zen4 or Zen5 since they use much more sensible downlclocking where it will only downclock if you've used enough instructions in a row for it to start spiking power/temp.

amitprasad•12h ago

Ah yes, you’re completely correct :)

General idea was just to highlight some of the dangers of vector registers. I believe the same is true of ymm (256) to a lesser extent.

t0010q•19h ago

It's funny that carries don't just make addition difficult to parallelize. Binary addition without carry is XOR. XOR subset sum - find a subset whose XOR gives the desired target - is in P, but proper subset sum with carry is NP-complete.

hdjrudni•11h ago

I wish I came across this article a couple months ago.

I was trying to encode and decode some buffers into an arbitrary base, and I eventually came to the conclusion (after far too long) that a carry could ripple all the way down the buffer, which dramatically slows down the algorithm.

Actually, the eventual solution I came up might have some stuff in common with this trick too. I did eventually chunk up the buffer leaving some unused headroom to 'handle carries'. Not exactly though, I just have some wasted bits which uses a tiny bit more storage or network bandwidth but saves on compute. I wonder if I could instead pool up the carries like this and 'resolve' it in a later step. Have my cake and eat it too? Wishful thinking.

smcin•8h ago

Notwithstanding HN guidelines about not editorializing titles, I don't like these clickbaity titles amplifying a smaller claim to something overly broad: this one should have been titled:

"The radix 2^51 trick to adding 64-bit integers on *some* x86 architectures in parallel without slowing the pipeline due to dependencies on carry"

Dwedit•7h ago

Saw 2^51 and thought it was about storing integers into doubles. But nope, the number for that particular use is 2^53-1, not 2^51.

Beware of Fast-Math

Photos taken inside musical instruments

What's working for YC companies since the AI boom

Simpler Backoff

Valkey Turns One: Community fork of Redis

AI Responses May Include Mistakes

Cerebras achieves 2,500T/s on Llama 4 Maverick (400B)

AccessOwl (YC S22) is hiring an AI TypeScript Engineer to connect 100s of SaaS

Surprisingly fast AI-generated kernels we didn't mean to publish yet

The ‘white-collar bloodbath’ is all part of the AI hype machine

Beating Google's kernelCTF PoW using AVX512

Reverse engineering of Linear's sync engine

Mary Meeker's first Trends report since 2019, focused on AI

Show HN: MCP Defender – OSS AI Firewall for Protecting MCP in Cursor/Claude etc

Show HN: Icepi Zero – The FPGA Raspberry Pi Zero Equivalent

Microsandbox: Virtual Machines that feel and perform like containers

Gradients Are the New Intervals

Java Virtual Threads Ate My Memory: A Web Crawler's Tale of Speed vs. Memory

Systems Correctness Practices at Amazon Web Services

Cap: Lightweight, modern open-source CAPTCHA alternative using proof-of-work

Robot is 3D-printed upside-down in one piece, then walks out of the printer

Revenge of the Chickenized Reverse-Centaurs

How large should your sample size be?

Ray Tracing in J

C++ to Rust Phrasebook

The Darwin Gödel Machine: AI that improves itself by rewriting its own code

Silicon Valley finally has a big electronics retailer again: Micro Center opens

Every 5x5 Nonogram

Anthropic launches a voice mode for Claude

Google Duo will be replaced by Google Meet in Sept 2025

Beware of Fast-Math

Photos taken inside musical instruments

What's working for YC companies since the AI boom

Simpler Backoff

Valkey Turns One: Community fork of Redis

AI Responses May Include Mistakes

Cerebras achieves 2,500T/s on Llama 4 Maverick (400B)

AccessOwl (YC S22) is hiring an AI TypeScript Engineer to connect 100s of SaaS

Surprisingly fast AI-generated kernels we didn't mean to publish yet

The ‘white-collar bloodbath’ is all part of the AI hype machine

Beating Google's kernelCTF PoW using AVX512

Reverse engineering of Linear's sync engine

Mary Meeker's first Trends report since 2019, focused on AI

Show HN: MCP Defender – OSS AI Firewall for Protecting MCP in Cursor/Claude etc

Show HN: Icepi Zero – The FPGA Raspberry Pi Zero Equivalent

Microsandbox: Virtual Machines that feel and perform like containers

Gradients Are the New Intervals

Java Virtual Threads Ate My Memory: A Web Crawler's Tale of Speed vs. Memory

Systems Correctness Practices at Amazon Web Services

Cap: Lightweight, modern open-source CAPTCHA alternative using proof-of-work

Robot is 3D-printed upside-down in one piece, then walks out of the printer

Revenge of the Chickenized Reverse-Centaurs

How large should your sample size be?

Ray Tracing in J

C++ to Rust Phrasebook

The Darwin Gödel Machine: AI that improves itself by rewriting its own code

Silicon Valley finally has a big electronics retailer again: Micro Center opens

Every 5x5 Nonogram

Anthropic launches a voice mode for Claude

Google Duo will be replaced by Google Meet in Sept 2025

The radix 2^51 trick (2017)

Comments