I trained Andrej Karpathy’s nanoGPT model on his Shakespeare corpus with four separate 1x GPU nodes (representing four distinct chip models) from my spot aggregator. I found that for this (very small) training run, A100 40GB SXM was nearly 2x as efficient as H100 SXM. SXM > PCIe, and cheap per/hr consumer chips were actually more expensive per train than all due to long train times.
jack_nclx•1mo ago