There's a lot of other stuff here too: mvtnorm::rmvnorm() also can use eigendecomp to generate your numbers, but it does some other stuff to eliminate the effect of the sign flips so you don't see this reproducibility issue. mvtnorm::rmvnorm also supports a second method (Cholesky decomp) that is uniquely defined and avoids eigenvectors entirely, so it's more stable. And there's some stuff on condition numbers not really mattering for this problem--turns out you can't describe all possible floating point problems a matrix could have with a single number.
In most languages and libraries, hashing is still only deterministic within a given run of the program. Authors generally have no qualms "fixing" implementations over time, and some systems introduce a salt to the hash intentionally to help protect web programmers from DOS "CVEs".
If you want reproducible RNG, you need to write it yourself.
System's based on Jax's notion of splitting a PRNG are often nice. If your splitting function takes in inputs (salts, seeds, whatever you want to call them), you can gain the property that sub-branches are reproducible even if you change the overall input to have more or fewer components.
But it’s true that if the algorithm isn’t specified then the implementation is free to change it, and probably should change it sometimes, to ensure calls don’t depend on it being fixed.
(Reproducibility is normally about what happens when you use the same version of everything.)
Also, PRNG and hash functions are extremely similar, and there's no reason why one should be deterministic and the other not. They're both built out of simple integer and bit operations. If a certain PRNG is implementation-defined, that's a detail of the specific (possibly standard) library you've chosen; it's nothing fundamental.
AlotOfReading•1mo ago
It was 20 years ago, but that's not the case today. The vast majority of hardware today implements 754 reproducibly if you're willing to stick to a few basic principles:
1. same inputs
2. same operations, in the same order
3. no "special functions", denormals, or NaNs.
If you accept these restrictions (and realistically you weren't handling NaNs or denormals properly anyway), you can get practical reproducibility on modern hardware for minimal (or no) performance cost if your toolchain cooperates. Sadly, toolchains don't prioritize this because it's easy to get wrong across the scope of a modern language and users don't know that it's possible.
craigacp•1mo ago
I've done a lot of work on reproducibility in machine learning systems, and its really, really hard. Even the JVM got me by changing some functions in `java.lang.Math` between versions & platforms (while keeping to their documented 2ulp error bounds).
AlotOfReading•1mo ago
Relatively few programs truly need to operate that way though and they're usually written by people who know the cost of their choices.
saagarjha•1mo ago
aragilar•1mo ago
saagarjha•1mo ago
Dylan16807•1mo ago
"parallel decomposition that's the same as the serial one" would be difficult in many ways, but only needed when you can't afford a one-time change.
dzaima•1mo ago
Irrational stdlib functions (trig/log/exp; not sqrt though for whatever reason) should really be basically the only source of non-reproducibility in typical programs (assuming they don't explicitly do different things depending on non-controlled properties; and don't use libraries doing that either, which is also a non-trivial ask; and that there's no overly-aggressive optimizer that does incorrect transformations).
I'd hope that languages/libraries providing seeded random sources with a guarantee of equal behavior across systems would explicitly note which operations aren't reproducible though, otherwise seeding is rather pointless; no clue if R does that.
AlotOfReading•1mo ago
NaN handling is ambiguous in two different ways. First, binary operations taking two NaNs are specified to return one of them in IEEE-754. Which one (if they're different) isn't defined and different hardware makes different choices. Second, I'm not aware of any language where NaN propagation has simple and defined semantics in practice. LLVM essentially says "you'll get one of 4 randomly chosen behaviors in each function", for example.
dzaima•1mo ago
NaNs indeed can get different bit patterns, but my point was that, given that you need to explicitly bit-reinterpret the NaN to have that affect anything, that can't really affect most code. I guess perhaps blindly hashing a NaNs bits? (that of course goes under your "you weren't handling NaNs […] properly anyway")
AlotOfReading•1mo ago
For NaNs, if we're being strict then you don't need to reach that far. Here's one example from LLVM:
https://github.com/llvm/llvm-project/issues/43070)
This can occasionally result in visible bugs, like: https://github.com/rust-lang/rust/issues/110174
You have to be pretty nitpicky to care about this stuff and it's not an issue in practice, but it's a reproducibility violation because it deviates from the standard nonetheless.
dzaima•1mo ago
Those LLVM issues are unrelated afaict - the former is high-level IR optimizations that just result in non-reproducible NaN bit patterns (and the things that wouldn't be just NaN bit patterns I can't even repro on the reported llvm 8.0 or any other version I tested)
The latter though is indeed a proper compiler bug, min/max are an awful mess... didn't know that the ARM fminnm/fmaxnm that were that funky though, caring about sNaN vs qNaN... makes them essentially unusable for compilers (not even useful in known-not-NaN contexts, as then fmin/fmax are equivalent)
thaumasiotes•1mo ago
Let's say I write a paper that says "in this statistical model, with random seed 1495268404, I get Important Result Y", and you criticize me on the grounds that when you run the model with random seed 282086400, Important Result Y does not hold. Doesn't this entire argument fail to be conceptually coherent?
AlotOfReading•1mo ago
thaumasiotes•1mo ago
saagarjha•1mo ago
dzaima•1mo ago
And that hints at the possibility of using seeds specifically for noting rare occurrences - communicating weird/unexpected(/buggy?) examples to collaborators, running for X hours to gather statistics without having to store all intermediate states of every single attempt if you later want to do extra analysis for cases insufficiently analyzed by the existing code, etc.
hansvm•1mo ago
- Many algorithms are vastly easier to implement stochastically than deterministically. If you want to replay the system (e.g., to locally debug some production issue), you need those "stochastic" behaviors to nonetheless be deterministic.
- If you're a little careful with how you implement deterministic randomness, you can start to ask counterfactual questions -- how would the system have behaved had I made this change -- and actually compare apples to apples when examining an experimental run.
Even in your counterexample, the random seeds being reproducible and published is still important. With the seed and source published, now anyone can cheaply verify the issue with the simulation, you can debug it, you can investigate the proportion of "bad" seeds and suss out the error bounds of the simulation, etc.
thaumasiotes•1mo ago
Why does this matter? If people consistently fail to get results similar to yours by using different seeds, your results are invalid whether or not you made them up. And if people consistently do get results similar to yours with different seeds, your results are valid whether or not you made them up. No?
> Even in your counterexample, the random seeds being reproducible and published is still important.
> you can investigate the proportion of "bad" seeds and suss out the error bounds of the simulation
How does publishing seeds help with that? You do that by examining different seeds. If you know what seeds were used in the paper, this process doesn't change in any way.
aragilar•1mo ago
AlotOfReading•1mo ago
To give a practical example, I work on autonomous vehicles. My employer runs a lot of simulations against the vehicle code to make sure it meets reasonable quality standards. The hardware the code will be running on in the field isn't necessarily available in large quantities in the datacenter though. And realistically I'm not going to be able to apply numerical analysis to thousands of functions and millions of test cases to determine what the "correct" answer is in all cases. Instead, I can privilege whatever the current result is as "correct enough" and ensure the vehicle hardware always gets the same results as the datacenter hardware, then leverage the many individual developers merging changes to review their own simulation results knowing they'll apply on-road. This is a pretty universal need for most software projects. If I'm a frontend dev, I want to know that my website will render the same on my computer as the client's. If I'm a game engine dev, I want to ensure that replays don't diverge from the in-game experience. If I'm a VFX programmer, I want to ensure that the results generated on the artist's workstation looks the same as what comes out of the render pipeline at the end. Etc.
All of these applications can still benefit from things like stability, but the benefits are orthogonal to reproducibility.
aragilar•1mo ago
To me it's the same as whether your default random number generator is a CSPRNG or just a PRNG, and generally it's safer for all involved if it's the former. The latter should exist, just with lots of warnings and guidance.
hansvm•1mo ago
5. All floating-point optimizations must be hand-rolled. That's implied by (2) and "toolchain cooperates", but it's worth calling out explicitly. Consider, e.g., summing a few thousand floats. On many modern computers, a near-optimal solution is having four vector accumulators, skipping in chunks four vectors wide, adding to each accumulator, adding the accumulators at the end, and then horizontally adding the result (handling any non-length-aligned straggling elements left as an exercise for the reader, and optimal behavior for those varies wildly). However, this has different results depending on whether you use SSE, AVX, or AVX512 if you want to use the hardware to its full potential. You need to make a choice (usually a bias toward wider vector types is better than smaller types, especially on AMD chips, but this is problem-specific), and whichever choice you make you can't let the compiler reorder it.
AlotOfReading•1mo ago
Indeed, you can't do that kind of hsum. Still, haven't run into that behavior in practice (and given that I maintain a FP reproducibility test suite, I've tried). Would be open to any way you can suggest to replicate that with the usual GCC/clang etc.