MFU is indeed very useful. Today we found that while scaling Karpathy’s nanoGPT to multiple H100 nodes the MFU calculation itself was dropping MFU performance![1]
Commenting it out improved iter performance by almost 30%
MFU is probably the best but requires application logic. You can export metrics at the infra level like SM efficiency. We explain it a bit how we used it to do some optimization.
thundergolfer•8mo ago
Commenting it out improved iter performance by almost 30%
1. https://github.com/modal-labs/multinode-training-guide/blob/...