Bit-exact SW-emulated FP64 on Metal, 5-11x faster than CPU HW-accelerated FP64.
I was learning about randomx and wanted to play with the algorithm on Mac, discovered Metal has no FP64 math. Further discovered this has been a frustration for a lot of people in ML/Science/Gaming.
I went down a rabbit hole. The naive implementation was ~10% the throughput of hardware CPU fp64 on the same machine. After obsessively squeezing every bit of juice out of the GPU, the final version is 5–11× faster than a 14-thread CPU hardware-fp64 baseline on arithmetic, and 10–35× on conversions and comparisons (M4 Pro, 20 GPU cores).
guyfischman•1h ago
I was learning about randomx and wanted to play with the algorithm on Mac, discovered Metal has no FP64 math. Further discovered this has been a frustration for a lot of people in ML/Science/Gaming.
I went down a rabbit hole. The naive implementation was ~10% the throughput of hardware CPU fp64 on the same machine. After obsessively squeezing every bit of juice out of the GPU, the final version is 5–11× faster than a 14-thread CPU hardware-fp64 baseline on arithmetic, and 10–35× on conversions and comparisons (M4 Pro, 20 GPU cores).
I hope you find this useful.