Note that the NVIDIA container uses CUDA+cuBLAS 13.0.2 which cites "Improved performance on NVIDIA DGX Spark for FP16/BF16 and FP8 GEMMs", which seems to be your use-case.
In general, I would suspect that it mostly comes to versions of the libs.
Interestingly, there is a cuBLAS 13.1 whl on PyPI, not sure what that does.
riomus•3mo ago
I did a shallow check on PyTorch (that reports it is 2.9.0) - and it is different from 2.9.0 from PyTorch index - and differences are from code parts that are months before 2.9.0 was out - that is why I am assuming that Nvidia is using their fork. For cuBLAS - natively i see it is available (libcublas.so.13.1.0.3) in same version as in the container.
t-vi•3mo ago
Interestingly, there is a cuBLAS 13.1 whl on PyPI, not sure what that does.
riomus•3mo ago