Often it doesn't matter, like you ask for a gemm and you get whatever order of operations blas, AVX-whatever, and OpenMP conspire to give you, so it is more-or-less random.
But if it does matter, the ability to define it is there.
Until it isn't. I used to play the CodeWeavers port of Kohan and while the game would allow you to do crossplay between Windows and Linux, differences in how the two OSes rounded floats would cause the game to desynchronize after 15-20 minutes of play or so. Some unit's pathfinding algorithm would zig on Windows and zag on Linux, causing the state to diverge and end up kicking off one of the players.
Or, since it was a port, maybe they were compiled with different optimizations.
There are a lot of things happening under the hood but most of them should be deterministic.
For cross-OS issues, the most likely culprit is that Windows and Linux are using different libm implementations, which means that the results of functions like sin or atan2 are going to be slightly different.
My recollection is fuzzy, but IIRC the legacy x87 control word is set always to extended precision on linux, while it is on double precision on windows, and this affect normal float and double computations as well: the conversion to float and double is only done when storing to and from memory while intermediate in-register operations are always at the at the maximum enabled precision. Changing precision before each operation is expensive, so it is not done.
This is one of the cause of x87 apparent nondeterminism as it depends on the compiler unpredictably spilling fp registers [1]: unless you always use the maximum enabled precision, computations might not be reproducible from one build to the other even on the same environment.
[1] eventually GCC added compilation modes with deterministic behaviors, but that was well after x87 was obsolete. In the meantime people had to do with -ffloat-store and or volatile. See https://gcc.gnu.org/wiki/FloatingPointMath.
edit: but you know this as you mentioned it elsethread.
(And I have heard of the 80-bit internal register thing described in https://news.ycombinator.com/item?id=44888692 causing real problems for people before. And --ffast-math is basically spooky action at a distance considering how it bleeds into the entire program; see e.g. https://moyix.blogspot.com/2022/09/someones-been-messing-wit....)
This is especially useful when writing branchless or SIMD code. Adding a branch for checking against zero can have bad performance implications and it isn't even necessary in many cases.
Especially in graphics code I often see a zero check before a division and a fallback for the zero case. This often practically means "wait until numerical precision artifacts arise and then do something else". Often you could just choose the better of the two options you have instead of checking for zero.
Case in point: choosing the axes of your shadow map projection matrix. You have two options (world x axis or z axis), choose the better one (larger angle with viewing direction). Don't wait until the division goes to inf and then fall back to the other.
[0] https://www.cs.uaf.edu/2011/fall/cs301/lecture/11_09_weird_f...
Operations on those sentinel values are also defined. This can affect when checking needs to be done in optimized code.
Personally, I am lazy, so I don’t check the mxcsr register before I start running my programs. Maybe gcc does something by default, I don’t know. IMO legitimate division by zero is rare but not impossible, so if you do it, the onus is on you to make sure the flags are set up right.
It's well-defined is all that really matters AFAIC.
I'll give two examples from a recent project where I very intentionally divided by zero. First one was about solving a zero in a derivative and check if it falls on 0..1 range. This exploits the fact that (x < NaN) is always false and comparisons with +/- inf behave as expected.
float t = a / b; // might divide by zero. NaN if a and b == 0.0, +/- inf if b == 0.0
if (t > 0.0 && t < 1.0) {
// we don't get here if t is +/- inf or nan
split_at(t); // do the thing
}
The second one was similar, but clamping to 0..1 range using branchless simd min/max. f32x8 x = a / b; // might divide by zero
return simd_min(simd_max(0.0, x), 1.0); // returns 0 for -inf or nan, 1 for +inf
In both of these cases, explicitly checking for division by zero or isinf/isnan would've been (worse than) useless because just using the inf/NaN values gave the correct answer for what comes next. >>> 1.0/0.0
ZeroDivisionError
>>> np.float64(1)/np.float64(0)
inf
I'm so used to writing such zero division in other languages like C/C++ that this Python quirk still trips me up.It is implementation-dependent. It is not obligatory for implementation to respect IEEE 754.
However, the story about non-determinism is no myth. The intel processors have a separate math coprocessor that supports 80bit floats (https://en.wikipedia.org/wiki/Extended_precision#x86_extende...). Moving a float from a register in this coprocessor to memory truncates the float. Repeated math can be done inside this coprocessor to achieve higher precision so hot loops generally don't move floats outside of these registers. Non-determinism occurs in programs running on intel with floats when threads are interrupted and the math coprocessor flushed. The non-determinism isn't intrinsic to the floating point arithmetic but to the non-determinism of when this truncation may occur. This is more relevant for fields where chaotic dynamics occur. So the same program with the same inputs can produce different results.
NaN is an error. If you take the square root of a negative number you get a NaN. This is just a type error, use complex numbers to overcome this one. But then you get 0. / 0. and that's a NaN or Inf - Inf and a whole slew of other things that produce out of bounds results. Whether it is expected or not is another story, but it does mean that you are unable to represent the value with a float and that is a type error.
That's ridiculous. No OS in his right mind would flush FPU regs to 64 bits only, because that would break many things, most obviously "real" 80 bit FP which is still a thing and the only reason x87 instructions still work. It would even break plain equality comparisons making all FP useless.
For 64 bit FP most compilers prefer SSE rather than x87 instructions these days.
Never, for sure.
FTFY. They even changed some of the more obscure handling between 8087,80287,80387. So much hoop jumping if you cared about binary reproducibility.
Seems to be largely fixed with targeting SSE even for scalar code now.
> The intel processors have a separate math coprocessor that supports 80bit floats
x86 processors have two FPU units, the x87 unit (that you're describing) and the SSE unit. Anyone compiling for x86-64 uses the SSE unit for default, and most x86-32 compilers still default to SSE anyways.
> Moving a float from a register in this coprocessor to memory truncates the float.
No it doesn't. The x87 unit has load and store instructions for 32-bit, 64-bit, and 80-bit floats. If you want to spill 80-bit values as 80-bit values, you can do so.
> Repeated math can be done inside this coprocessor to achieve higher precision so hot loops generally don't move floats outside of these registers.
Hot loops these days use the SSE stuff because they're so much faster than x87. Friends don't let friends use long double without good reason!
> Non-determinism occurs in programs running on intel with floats when threads are interrupted and the math coprocessor flushed.
Lol, nope. You'll spill the x87 register stack on thread context switch with FSAVE or FXSAVE or XSAVE, all of which will store the registers as 80-bit values without loss of precision.
That said, there was a problem with programs that use the x87 unit, but it has absolutely nothing to do with what you're describing. The x87 unit doesn't have arithmetic for 32-bit and 64-bit values, only 80-bit values. Many compilers, though, just pretended that the x87 unit supported arithmetic on 32-bit and 64-bit values, so that FADD would simultaneously be a 32-bit addition, a 64-bit addition, and a 80-bit addition. If the compiler needed to spill a floating-point register, they would spill the value as a 32-bit value (if float) or 64-bit value (if double), and register spills are pretty unpredictable for user code. That's the nondeterminism you're referring to, and it's considered a bug in every compiler I'm aware of. (See https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p37... for a more thorough description of the problem).
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p37...
Summarized as,
> Most users cannot be expected to know all of the ways that their floating-point code is not reproducible.
Glad to know that the situation is defaulting to SSE2 nowadays though.
I passed through the highly-irreproducible eras described in the section you link, and that you summarize in your last paragraph. There was so much different FP hardware, and so many different niche compilers, that my takeaway became “you can’t rely on reproducibility across any hardware/os version/compiler/library difference”.
There are still issues with libraries and compilers, summarized farther down (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p37...). And these can be observed in the wild when changing compilers on the same underlying hardware.
But your point is that irreproducibility at the level of interrupts or processor scheduling is not a thing on contemporary mainstream hardware. That’s important and I hadn’t realized that.
If NaN is invalid input for the next step, then sure why not treat it as an error? But that's a design decision not an imperative that everybody must follow. (I picture Mel Brooks' 15 commandments, #11-15, that fell and broke. This is not like that.)
The design decision isn't, "Was this a type error?" but, "What do I need to do about this type error?"
THEODOTUS (outraged). How!
CAESAR (recovering his self-possession). Pardon him. Theodotus: he is a barbarian, and thinks that the customs of his tribe and island are the definition of the float type.”
The article is right that sometimes you can can use simple equality on floating point values. But if you have a (almost) blanket rule to use fuzzy comparison, or (even better) just avoid any sort if equality-like comparison altogether, then your code might be simpler to understand and you might be less likely to make a mistake about when you really can safely do that.
It's still sensible to understand that it's possible though. Just that you should try to avoid writing tricky code that depends on it.
Updating with a concrete example: The Fortran standard defines the real MOD and MODULO intrinsic functions as being equivalent to the straightforward sequence of a division, conversion to integer, multiplication, and subtraction. This formula can round, obviously. But MOD can be (and thus should be) implemented exactly by other means, and most Fortran compilers do so instead. This leaves Fortran implementors in a bit of a pickle -- conform to the standard, or produce good results?
They are approximations of floating point numbers, and even if your current approximation of a floating point number to represent a value seems to be accurate as an integer vale...
There is no guarantee if you take two of those floating point number approximations that appear completely accurate that the resulting operation between them will also be completely accurate.
Floating point numbers are compared against fixed point numbers where the point (the part between the integer part and fractional part) is fixed. That is why they are called floating. The nature of the real numbers is such that they can in general only be approximated, regardless of whether you use fixed point numbers or floating point numbers or a fancier computable number representation.
We can't even represent all the natural numbers: that would require infinite memory. Now you want to represent the real numbers, equinumerous with the power set of natural numbers?
We can represent all real numbers to the same extent that we can represent all natural numbers. They're just infinite strings, with each additional letter in the string mattering less and less as it goes along. The "Type Two Effectivity" model for modelling computations with infinite strings doesn't limit you to using just the computable real numbers. An uncomputable real number can be produced by outputting an infinite string letter by letter with each letter getting picked at random. The TTE model does OTOH limit you to computing only with the computable functions on those real numbers.
The TTE model basically uses (Python-like) generators to output an endless sequence of interval approximations to a real number.
This certainly outputs an uncomputable real number, given that you have true random input and not a PRNG.
But random numbers from a seed and an algorithm are by definition computable.
And if you can't compute the number, you have no basis for the claim that it's the right number.
Also, good luck computing with arbitrarily long, computed-on-demand digit strings. Consider: how do you know how many digits you need from the input in order to be sure of N correct digits in the output? This might be possible in general, but it certainly doesn't sound like my idea of fun.
Representable numbers are countable, they have measure zero in the reals.
Because anyone who's had any experience knows you absolutely can't.
I am explicitly excluding big decimal type functionality here. What we are talking about explicitly are floats and doubles.
https://wingolog.org/archives/2011/05/18/value-representatio...
It's not exactly a myth, as the article mentions, they're only exact for certain ranges of values.
> NaN and INF are indication of an error
This is somewhat semantic, but dividing by zero typically does create a hardware exception. However, it's all handled behind the scenes, and you get "Inf" as the result.
You can make it so dividing by zero is explicitly an error, see the "ftrapping-math" flag.
Some operations involve rounding, notably conversion from decimal, but the (rounded) result is an exact number that can be stored and recovered without loss of precision and equality will work as expected. Converting from floating point binary to decimal is exact though, given enough digits (can be >1000 for very small or very large numbers).
They are though. All arithmetic operations involve rounding, so e.g. (7.0 / 1234 + 0.5) * 1234 is not equal to 7.0 + 617 (it differs in 1 ULP). On the other hand, (9.0 / 1234 + 0.5) * 1234 is equal to 9.0 + 617, so the end result is sometimes exact and sometimes is not. How can you know beforehand which one is the case in your specific case? Generally, you can't, any arithmetic operation can potentially give you 1 ULP of error, and it can (and likely, will) slowly accumulate.
I.e., two floats that are _identical_ to each other (even when it's _the same_ variable, on the same memory address) can be not _equal_ to each other, specifically if it's NaN. This is dictated by IEEE-754, and this is true for all programming languages I know, and to this day, this makes zero sense to me, but apparently this is useful for some reason.
It lets you easily test whether a value is NaN without needing a library function call. (Even if the language wanted to provide a NaN literal, there's more than one NaN value, so that wouldn't actually work!)
Breaking a basic programming expectation for a specific class of values of a specific type is not a good thing, in my opinion.
Amusingly, pocket calculators became affordable while I was in high school, and anybody who was interested in math learned the foibles of floating point almost instantly. Now it's an advanced topic. A difference may have been that our calculators had very few significant digits, so the errors introduced by successive calculations became noticeable more quickly. Also, you had to think about what you were doing because, even if you trusted the calculator, you didn't trust your fingers.
> the largest integer value that can be represented exactly is 2^53
— I am confused as to why it not 2^52, given that there are 52 bits of mantissa, so relative accuracy is 2^-52, which translates to absolute accuracy larger than 1 after 2^52. Compare this to the table there saying "Next value after 1 = 1 + 2^-52".
1{hidden bit} + (1-2^-52){mantissa with all ones}
the relative accuracy — corresponding to the absolute accuracy of a single bit in mantissa — is about 2^-53. The hidden bit is easy to forget about...
A few more resources on this topic:
https://randomascii.wordpress.com/category/floating-point/ - a series of articles about floating point, with deeper details.
https://float.exposed/ - explorer of floating point values by Bartosz Ciechanowski.
Obvious: An array containing NaNs will almost certainly be sorted incorrectly with std::sort.
Non-obvious: sorting an array containing NaNs with std::sort from libstdc++ (gcc's implementation) could lead to a buffer overflow.
Non-obvious: using MMX instructions without clearing the FPU state leads to incorrect results of floating-point instructions in other parts of the code base: https://github.com/simdjson/simdjson/issues/169
This awesome article drives home the point very succinctly: https://fabiensanglard.net/floating_point_visually_explained...
BugsJustFindMe•5mo ago
rollcat•5mo ago
Cieric•5mo ago
jbjbjbjb•5mo ago
AshamedCaptain•5mo ago
Also compiler optimizations should not affect the result. Specially without "fast-math/O3" (which arguably many people stupidly use nowadays, then complain).
zokier•5mo ago
-ffp-contract is annoying exception to that principle.
jandrese•5mo ago
janalsncm•5mo ago
nspattak•5mo ago
a couple of decades ago, I was still a lowly applied mathematics (and aspiring programmer) student when i proposed to a professor to change the compiler options from O2 to O3 to gain performance (reducing run time by a significant percentage like 50% or more...) only to be disregarded by saying "no, compiler optimizations break numerical accuracy".