I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago.
Run two of these with the same prompts and same seed and you get the same results.
Obviously in GPU clusters with different hardware things get more complicated.
"I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago" looks like you're using llama-cpp in that repo. This is about vllm serving many requests at once, at long sequence lengths.
> As it turns out, our request’s output does depend on the parallel user requests. Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.
Your situation isn't really comparable.
Until those are addressed, closed-system nondeterminism doesn't really help except in cases where a lookup table would do just as well. You can't use "correct" unit tests or evaluation sets to prove anything about inputs you haven't tested.
Nondeterminism is what currently keeps me from working with other developers.
As I wrote in "Prompt Coding" [1], these days I am not looking for good code. I am looking for prompts that create good code. But how do you share prompts among developers when they produce different code every time? You cannot simply state "Here, I found a prompt that makes gpt-5-2025-08-07 output a solution with all the desired attributes".
Similar with images. At the moment, for most image models, you cannot outsource the task of writing prompts that create the desired images. Because most image models will not create the same image when given the same prompt and parameters.
i'm hoping that it becomes more useful as models improve and become more reliable at producing working code (though determinism would be great for improving prompts).
For many applications, non-determinism implies "useless". This has been a long standing issue with LDA topic models. In particular in the legal, financial and regulatory domains, if a method is not deterministic, it may be illegal to use it or it may lead to follow-on requirements that one does not want (e.g. all screens shown to humans must be preserved to be able to go back and reconstruct what exactly happened to a particular user in a particular second).
> For example, you might observe that asking ChatGPT the same question multiple times provides different results.
even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.
> Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.
The router also leaks batch-level information across sequences.
I don’t think this is correct - MoE routing happens at per token basis. It can be non deterministic and batch related if you try to balance out your experts load in a batch but that’s performance optimization (just like all of the blogpost) and not the way models are trained to work.
This is literally one of the most knowledgeable person on the topic. I think you are the one that hasn’t peeled enough layers to connect with what they are saying.
measurablefunc•2h ago
TimorousBestie•2h ago
measurablefunc•1h ago
reliabilityguy•1h ago
Valid point. Floating point summation is not always commutative.
AlotOfReading•1h ago
With revisions, you're trying to ensure a consistent floating point environment where the operations used are deterministic, and used in the same order with the same inputs. The best way to do that is to use operations that adhere to a mostly deterministic standard like IEEE-754.
saagarjha•1h ago
Compilers can also reorder operations but in practice this is rarely an issue because kernels typically synchronize frequently and this limits the ability for compilers to reorder things. This isn’t to say it doesn’t happen, but even if it does happen it’s likely because the compiler changed because the code they generate is generally run-to-run identical.