We use snapshot-based loading to pull model states from local NVMe RAIDs directly into VRAM. When running benchmarks to compare A100 (PCIe Gen4) vs H100 (PCIe Gen5), we hit a performance cliff on the A100s.
Throughput results (loading 70GB+ snapshots):
| Configuration | A100 (Gen4) | H100 (Gen5) |
| 1 GPU Load | ~1.71 GiB/s | ~1.57 GiB/s | | 2 GPU Load | ~0.22 GiB/s | ~1.33 GiB/s | | 4 GPU Load | ~0.21 GiB/s | ~2.20 GiB/s | | 8 GPU Load | ~0.25 GiB/s | ~1.12 GiB/s |
On the A100 setup, as soon as we parallelize the load across 2+ GPUs, the random-read throughput collapses to ~200MB/s. The H100 setup scales linearly up to ~2.2GB/s.
It appears the PCIe Gen4 lanes on the A100 host are getting saturated by the concurrent interrupt load from multiple GPUs requesting pages simultaneously. We initially thought this was a software lock in our runtime, but the H100/Gen5 comparison suggests it's a physical bandwidth/interrupt limitation.
Has anyone else building high-density inference rigs seen this specific degradation on Gen4 NVMe arrays?