I run safety-critical math. "Close enough" doesn't cut it.
I ran the exact same workload on my local M1 and my H100. The numbers didn't match. Standard libraries (NumPy, PyTorch) drift because standard GPU schedulers fuse and reorder ops differently than CPUs.
I built my own engine (LuxiEdge) to fix it. Now I get bit-exact matches across both architectures. 0.00% drift.
Here is the repo and the reference hash. If you think your standard tools can do this, go ahead and try to match it. You can't.