When good pseudorandom numbers go bad

https://blog.djnavarro.net/posts/2025-05-18_multivariate-normal-sampling-floating-point/

71•chewxy•2mo ago

Comments

AlotOfReading•1mo ago

Most people don't really care about numerical stability or correctness. What they usually want is reproducibility, but they go down a rabbit hole with those other topics as a way to get it, at least in part because everyone thinks reproducibility is too slow.

It was 20 years ago, but that's not the case today. The vast majority of hardware today implements 754 reproducibly if you're willing to stick to a few basic principles:

1. same inputs

2. same operations, in the same order

3. no "special functions", denormals, or NaNs.

If you accept these restrictions (and realistically you weren't handling NaNs or denormals properly anyway), you can get practical reproducibility on modern hardware for minimal (or no) performance cost if your toolchain cooperates. Sadly, toolchains don't prioritize this because it's easy to get wrong across the scope of a modern language and users don't know that it's possible.

craigacp•1mo ago

The same operations in the same order is a tough constraint in an environment where core count is increasing and clock speeds/IPC are not. It's hard to rewrite some of these algorithms to use a parallel decomposition that's the same as the serial one.

I've done a lot of work on reproducibility in machine learning systems, and its really, really hard. Even the JVM got me by changing some functions in `java.lang.Math` between versions & platforms (while keeping to their documented 2ulp error bounds).

AlotOfReading•1mo ago

Many systems I've worked on can be written as embarrassingly parallel with synchronized checkpoints for a relatively small cost if they're not serial, but yeah, the middle ground is hard. That's the nature of the beast though. Imagine trying to rearrange the instructions in your program the same way. You're going to get all sorts of wacky behavior unless you invest an enormous amount of effort into avoiding problems the way superscalar CPUs do.

Relatively few programs truly need to operate that way though and they're usually written by people who know the cost of their choices.

saagarjha•1mo ago

You can definitely do operations in a reproducible way (assuming you do the reductions in a defined order) but, yeah, you might lose some performance. Unfortunately the ML people seem to pick better performance over correctness basically every time :(

aragilar•1mo ago

But is that correctness or reproducibility? The latter isn't important as the former.

saagarjha•1mo ago

I don't think they care about either

Dylan16807•1mo ago

Most people aren't spreading single calculations across multiple cores, and the ones that do are already deep enough into the technical weeds to handle deterministic chunking and combining.

"parallel decomposition that's the same as the serial one" would be difficult in many ways, but only needed when you can't afford a one-time change.

dzaima•1mo ago

Even denormals and NaNs should be perfectly consistent, at least on CPUs. (as long as you're not inspecting the bit patterns of the NaNs, at least)

Irrational stdlib functions (trig/log/exp; not sqrt though for whatever reason) should really be basically the only source of non-reproducibility in typical programs (assuming they don't explicitly do different things depending on non-controlled properties; and don't use libraries doing that either, which is also a non-trivial ask; and that there's no overly-aggressive optimizer that does incorrect transformations).

I'd hope that languages/libraries providing seeded random sources with a guarantee of equal behavior across systems would explicitly note which operations aren't reproducible though, otherwise seeding is rather pointless; no clue if R does that.

AlotOfReading•1mo ago

Denormals are consistent, but there's hardware variability in whether or not they're handled (even between different FPUs on the same processor in some cases. Depending on codegen and compiler settings you might get different results.

NaN handling is ambiguous in two different ways. First, binary operations taking two NaNs are specified to return one of them in IEEE-754. Which one (if they're different) isn't defined and different hardware makes different choices. Second, I'm not aware of any language where NaN propagation has simple and defined semantics in practice. LLVM essentially says "you'll get one of 4 randomly chosen behaviors in each function", for example.

dzaima•1mo ago

The only case I recall of denormals specifically being improperly handled is 32-bit ARM NEON; and indeed clang and gcc just don't use it without -ffast-math.

NaNs indeed can get different bit patterns, but my point was that, given that you need to explicitly bit-reinterpret the NaN to have that affect anything, that can't really affect most code. I guess perhaps blindly hashing a NaNs bits? (that of course goes under your "you weren't handling NaNs […] properly anyway")

AlotOfReading•1mo ago

The Intel compiler used to (and possibly still does) set the SSE FTZ bit by default, leading to a situation where two different FPUs on the same chip (x87 and SSE) had different FP configs. I was indeed thinking of ARM though. I've come across chips with both VFP and NEON where they're handled differently.

For NaNs, if we're being strict then you don't need to reach that far. Here's one example from LLVM:

https://github.com/llvm/llvm-project/issues/43070)

This can occasionally result in visible bugs, like: https://github.com/rust-lang/rust/issues/110174

You have to be pretty nitpicky to care about this stuff and it's not an issue in practice, but it's a reproducibility violation because it deviates from the standard nonetheless.

dzaima•1mo ago

Intel compilers default to -ffast-math wholesale, so they're just entirely broken: https://godbolt.org/z/18xoEf5Gf

Those LLVM issues are unrelated afaict - the former is high-level IR optimizations that just result in non-reproducible NaN bit patterns (and the things that wouldn't be just NaN bit patterns I can't even repro on the reported llvm 8.0 or any other version I tested)

The latter though is indeed a proper compiler bug, min/max are an awful mess... didn't know that the ARM fminnm/fmaxnm that were that funky though, caring about sNaN vs qNaN... makes them essentially unusable for compilers (not even useful in known-not-NaN contexts, as then fmin/fmax are equivalent)

thaumasiotes•1mo ago

I don't actually understand why you'd want reproducibility in a statistical simulation. If you fix the output, what are you learning? The point of the simulation is to produce different random numbers so you can see what the outcomes are like... right?

Let's say I write a paper that says "in this statistical model, with random seed 1495268404, I get Important Result Y", and you criticize me on the grounds that when you run the model with random seed 282086400, Important Result Y does not hold. Doesn't this entire argument fail to be conceptually coherent?

AlotOfReading•1mo ago

The point is that if you run the simulation with seed X, you'll get the same results I do. This means that all you need to reproduce my results is the code and any inputs like the seed, rather than the entire execution history. If you want to provide different inputs and get the same result, that's another matter entirely (where numerical stability will be much more important).

thaumasiotes•1mo ago

But what is the value of reproducing your results that way? Those results only mean anything if they're similar to what happens with other seeds.

saagarjha•1mo ago

If something fails, I can reproduce the failure with your seed.

dzaima•1mo ago

If everything goes roughly the same with all seeds, then everything's indeed just fine. If not, I have a paper trail that I did in fact actually encounter the unexpected thing instead of just making up results. (indeed a seed could be cherry-picked, but the more is computed from the seed, the more infeasible that gets)

And that hints at the possibility of using seeds specifically for noting rare occurrences - communicating weird/unexpected(/buggy?) examples to collaborators, running for X hours to gather statistics without having to store all intermediate states of every single attempt if you later want to do extra analysis for cases insufficiently analyzed by the existing code, etc.

hansvm•1mo ago

- It eliminates one way of forging results. Without stating the seed, people could consistently fail to reproduce your results and you could always have "oops, guess I had a bad seed" as an excuse. You still have to worry about people re-running the simulation till the results look good, but that's handled via other mechanisms.

- Many algorithms are vastly easier to implement stochastically than deterministically. If you want to replay the system (e.g., to locally debug some production issue), you need those "stochastic" behaviors to nonetheless be deterministic.

- If you're a little careful with how you implement deterministic randomness, you can start to ask counterfactual questions -- how would the system have behaved had I made this change -- and actually compare apples to apples when examining an experimental run.

Even in your counterexample, the random seeds being reproducible and published is still important. With the seed and source published, now anyone can cheaply verify the issue with the simulation, you can debug it, you can investigate the proportion of "bad" seeds and suss out the error bounds of the simulation, etc.

thaumasiotes•1mo ago

> It eliminates one way of forging results. Without stating the seed, people could consistently fail to reproduce your results and you could always have "oops, guess I had a bad seed" as an excuse.

Why does this matter? If people consistently fail to get results similar to yours by using different seeds, your results are invalid whether or not you made them up. And if people consistently do get results similar to yours with different seeds, your results are valid whether or not you made them up. No?

> Even in your counterexample, the random seeds being reproducible and published is still important.

> you can investigate the proportion of "bad" seeds and suss out the error bounds of the simulation

How does publishing seeds help with that? You do that by examining different seeds. If you know what seeds were used in the paper, this process doesn't change in any way.

aragilar•1mo ago

I mean, if you get different results for the same input on the same machine with the same code, I'd be concerned. But that's not that interesting really, as you actually care that you get the right answer, not the same answer (otherwise you get https://en.wikipedia.org/wiki/Oil_drop_experiment). Stability and correctness will get you closer to a right answer than demanding you get the same answer to someone else who's solving the wrong problem the wrong way.

AlotOfReading•1mo ago

I don't think it's usually meaningful to discuss the idea of a "correct answer" without doing the traditional numerical analysis stuff. Realistically, we don't need to formalize most programs to that degree and provide precise error bounds either.

To give a practical example, I work on autonomous vehicles. My employer runs a lot of simulations against the vehicle code to make sure it meets reasonable quality standards. The hardware the code will be running on in the field isn't necessarily available in large quantities in the datacenter though. And realistically I'm not going to be able to apply numerical analysis to thousands of functions and millions of test cases to determine what the "correct" answer is in all cases. Instead, I can privilege whatever the current result is as "correct enough" and ensure the vehicle hardware always gets the same results as the datacenter hardware, then leverage the many individual developers merging changes to review their own simulation results knowing they'll apply on-road. This is a pretty universal need for most software projects. If I'm a frontend dev, I want to know that my website will render the same on my computer as the client's. If I'm a game engine dev, I want to ensure that replays don't diverge from the in-game experience. If I'm a VFX programmer, I want to ensure that the results generated on the artist's workstation looks the same as what comes out of the render pipeline at the end. Etc.

All of these applications can still benefit from things like stability, but the benefits are orthogonal to reproducibility.

aragilar•1mo ago

I agree, if you know your requirements are such that you don't care about the correct answer, but any good enough consistent answer will do (e.g. the fast inverse square root), then doing the full numerical analysis is not worth your while (though I'd still do some back of the envelope estimates just to make sure the answer is in the "good enough" category and not in the "wtf" category). The issue is when there's libraries involved (especially more generic libraries are involved), typically they don't document what assumptions they are making, so unless you are very clear about the limitations of your code, people will use the library inappropriately (and I've seen some really bad implementations, even from people claiming that their code can be used for more traditional numerical work).

To me it's the same as whether your default random number generator is a CSPRNG or just a PRNG, and generally it's safer for all involved if it's the former. The latter should exist, just with lots of warnings and guidance.

hansvm•1mo ago

4. No math libraries, even if their results aren't used in the deterministic path you care about (e.g., floating-point rounding mode bugs)

5. All floating-point optimizations must be hand-rolled. That's implied by (2) and "toolchain cooperates", but it's worth calling out explicitly. Consider, e.g., summing a few thousand floats. On many modern computers, a near-optimal solution is having four vector accumulators, skipping in chunks four vectors wide, adding to each accumulator, adding the accumulators at the end, and then horizontally adding the result (handling any non-length-aligned straggling elements left as an exercise for the reader, and optimal behavior for those varies wildly). However, this has different results depending on whether you use SSE, AVX, or AVX512 if you want to use the hardware to its full potential. You need to make a choice (usually a bias toward wider vector types is better than smaller types, especially on AMD chips, but this is problem-specific), and whichever choice you make you can't let the compiler reorder it.

AlotOfReading•1mo ago

Neither 4 nor 5 are necessarily required in practice. I converted an employers monorepo to reproducible semi-recently, without eliminating any math libraries, just forcing the compiler to make consistent implementation choices across platforms.

Indeed, you can't do that kind of hsum. Still, haven't run into that behavior in practice (and given that I maintain a FP reproducibility test suite, I've tried). Would be open to any way you can suggest to replicate that with the usual GCC/clang etc.

mjcohen•1mo ago

I found this an enjoyable read. I also have Wilkinson, both text and Algol book, which I used many years ago to write a fortran eigenvalue/vector routine. Worked very nicely. Done in VAX fortran and showed me that having subscript checking on added 30% to the run time.

jdougan•1mo ago

I wonder what the bounds-check overhead looks like on a modern machine for that code?

coolcase•1mo ago

I don't grok this but if you had to describe it in a nutshell, is this because of a race condition? Differences in HW? Floating point ops have some randomness built in?

mattb314•1mo ago

Super rough summary of the first half: in order to pick out random vectors with a given shape (where the "shape" is determined by the covariance matrix), MASS::mvrnorm() computes some eigenvectors, and eigenvectors are only well defined up to a sign flip. This means tiny floating differences between machines can result in one machine choosing v_1, v_2, v_3,... as eigenvectors, while another machine chooses -v_1, v_3, -v_3,... The result for sampling random numbers is totally different with the sign flips (but still "correct" because we only care about the overall distribution--these are random numbers after all). The section around "Q1 / Q2" is the core of the article.

There's a lot of other stuff here too: mvtnorm::rmvnorm() also can use eigendecomp to generate your numbers, but it does some other stuff to eliminate the effect of the sign flips so you don't see this reproducibility issue. mvtnorm::rmvnorm also supports a second method (Cholesky decomp) that is uniquely defined and avoids eigenvectors entirely, so it's more stable. And there's some stuff on condition numbers not really mattering for this problem--turns out you can't describe all possible floating point problems a matrix could have with a single number.

coolcase•1mo ago

Thanks! So machine differences in their FP units drives the entropy, and the code doesn't handle that when picking eigenvectors?

saagarjha•1mo ago

It doesn't have to be their FP units; it could be that they run the operations in different orders, or that some different modes were set. I don't think the blog post goes into detail as to why but it does explain how this "cascades" into very different results coming out of the actual operation being performed.

pfortuny•1mo ago

Implementations of IEEE-754 can differ between machines, and the order of operations in a library/compiled function can also be different. It is not entropy, it is the nature of floating-point arithmetic.

coolcase•1mo ago

Thanks. So sounds like the same machine should run the same calculations and get the same results each time, but differences may appear between machines.

jam0wal•1mo ago

So, you want to use the random function but want a constant output. Simpler to just use a constant array and not impose your bias on a corner case interpretation on random.

seanhunter•1mo ago

You can’t always do that because you often don’t know how many pseudorandom numbers you will need. Search for probabilistic computational linear algebra for more, but say you’re trying to do research into genetic conditions. Your data might be a matrix of samples vs genes and each cell a record of how much a gene is expressed in a certain sample. So you have a very big matrix that you need to do a singular value decomposition. The standard way to donthis involves random sampling of the columns to make the computational complexity manageable. You would still want to seed the rng so your results are reproducible.

somat•1mo ago

This is a perpetual problem in computer science, people want a hash function, then decide that the random function very nearly does what they want. then use the random function as a hash function and are dumbfounded that it turns out there is no hard specification for how it works internally. The random function has no guarantee of result across versions, system, time. Hell I would even be harsh enough to say that any reproducibility in random is an accident of implementation. random should always be considered the non-deterministic function, hash is the deterministic function.

hansvm•1mo ago

> hash is the deterministic function

In most languages and libraries, hashing is still only deterministic within a given run of the program. Authors generally have no qualms "fixing" implementations over time, and some systems introduce a salt to the hash intentionally to help protect web programmers from DOS "CVEs".

If you want reproducible RNG, you need to write it yourself.

System's based on Jax's notion of splitting a PRNG are often nice. If your splitting function takes in inputs (salts, seeds, whatever you want to call them), you can gain the property that sub-branches are reproducible even if you change the overall input to have more or fewer components.

skybrian•1mo ago

I don’t think you need to write it yourself, since there are also libraries that implement specific algorithms for pseudo-random number generators, if if you’re worried, you could pin or vendor the dependency.

But it’s true that if the algorithm isn’t specified then the implementation is free to change it, and probably should change it sometimes, to ensure calls don’t depend on it being fixed.

(Reproducibility is normally about what happens when you use the same version of everything.)

FooBarBizBazz•1mo ago

The article has absolutely nothing to do with "hashes vs. PRNGs". Literally nothing. It's all about the linear algebra and floating point operations that happen (in this case) to be used to transform those numbers. It has nothing to do with PRNGs.

Also, PRNG and hash functions are extremely similar, and there's no reason why one should be deterministic and the other not. They're both built out of simple integer and bit operations. If a certain PRNG is implementation-defined, that's a detail of the specific (possibly standard) library you've chosen; it's nothing fundamental.

Shutting Down Clear Linux OS

Asynchrony is not concurrency

How to write Rust in the Linux kernel: part 3

Ccusage: A CLI tool for analyzing Claude Code usage from local JSONL files

Silence Is a Commons by Ivan Illich (1983)

Broadcom to discontinue free Bitnami Helm charts

Wii U SDBoot1 Exploit “paid the beak”

lsr: ls with io_uring

Multiplatform Matrix Multiplication Kernels

Valve confirms credit card companies pressured it to delist certain adult games

EPA says it will eliminate its scientific reseach arm

Meta says it wont sign Europe AI agreement, calling it growth stunting overreach

Trying Guix: A Nixer's impressions

AI capex is so big that it's affecting economic statistics

Replication of Quantum Factorisation Records with a VIC-20, an Abacus, and a Dog

Show HN: Molab, a cloud-hosted Marimo notebook workspace

Mango Health (YC W24) Is Hiring

Sage: An atomic bomb kicked off the biggest computing project in history

The year of peak might and magic

Show HN: I built library management app for those who outgrew spreadsheets

CP/M creator Gary Kildall's memoirs released as free download

Cancer DNA is detectable in blood years before diagnosis

A New Geometry for Einstein's Theory of Relativity

Making a StringBuffer in C, and questioning my sanity

How I keep up with AI progress

Show HN: Simulating autonomous drone formations

Benben: An audio player for the terminal, written in Common Lisp

Hundred Rabbits – Low-tech living while sailing the world

How to Get Foreign Keys Horribly Wrong

When root meets immutable: OpenBSD chflags vs. log tampering

Shutting Down Clear Linux OS

Asynchrony is not concurrency

How to write Rust in the Linux kernel: part 3

Ccusage: A CLI tool for analyzing Claude Code usage from local JSONL files

Silence Is a Commons by Ivan Illich (1983)

Broadcom to discontinue free Bitnami Helm charts

Wii U SDBoot1 Exploit “paid the beak”

lsr: ls with io_uring

Multiplatform Matrix Multiplication Kernels

Valve confirms credit card companies pressured it to delist certain adult games

EPA says it will eliminate its scientific reseach arm

Meta says it wont sign Europe AI agreement, calling it growth stunting overreach

Trying Guix: A Nixer's impressions

AI capex is so big that it's affecting economic statistics

Replication of Quantum Factorisation Records with a VIC-20, an Abacus, and a Dog

Show HN: Molab, a cloud-hosted Marimo notebook workspace

Mango Health (YC W24) Is Hiring

Sage: An atomic bomb kicked off the biggest computing project in history

The year of peak might and magic

Show HN: I built library management app for those who outgrew spreadsheets

CP/M creator Gary Kildall's memoirs released as free download

Cancer DNA is detectable in blood years before diagnosis

A New Geometry for Einstein's Theory of Relativity

Making a StringBuffer in C, and questioning my sanity

How I keep up with AI progress

Show HN: Simulating autonomous drone formations

Benben: An audio player for the terminal, written in Common Lisp

Hundred Rabbits – Low-tech living while sailing the world

How to Get Foreign Keys Horribly Wrong

When root meets immutable: OpenBSD chflags vs. log tampering

When good pseudorandom numbers go bad

Comments