Starting with 4090, NVIDIA limits the performance of tensor cores on gaming cards, specifically for ops that might be used in training. FP8 and FP16 matmuls run at full speed if accumulating in FP16 (I've never seen anyone use this), but only half speed when accumulating in FP32. This restriction is not present for lower precision matmuls like FP4, or removed entirely on the workstation-class cards like RTX Pro 6000.
It doesn't seem worth it to use NVIDIA gaming cards as a "cheaper FLOPs" alternative anymore (e.g. diffusion models could have been cheaper to run on 4090 than H100). They are generous with memory bandwidth though, nearly 2TB/s is amazing!
doctorpangloss•2h ago
qeternity•2h ago
> The main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120.