https://chatgpt.com/share/685dea79-26ec-8002-bd62-7ed83aedf4...
Interested in lending PyTorch some compute? :)
torchft can handle much larger scales but for public multi-day demonstration run this is what we had available. Point of this blog was to demonstrate correctness of the quorum algorithm and recovery with a stock PyTorch stack and not so much peak flops.
Stay tuned though -- planning on doing some much larger demos on B200s!
These chips are $30m-$100m projects a pop. After the embarrassingly brutal failure of Barefoot nobody is going to do ASICs.
For specific types of failures check out the section on "Reliability and Operational Challenges" from the Llama 3 paper https://ai.meta.com/research/publications/the-llama-3-herd-o...
d4l3k•7mo ago
I'm the primary author so happy to answer any questions you might have!
bwfan123•7mo ago
d4l3k•7mo ago
Historically it's been limited to areas like federated learning for low power/low network training but with the massive increase in number of GPUs it's becoming relevant even for training in datacenters.
It is another variable ML researchers have to tune so does add some complexity and I expect most folks just aren't familiar with it yet.
On "typed language": all of torchft is typed! The coordination/quorum layers are written in Rust w/ GRPC and the front-end is typed Python with Pyre since it has to interact with PyTorch and model code.
bwfan123•7mo ago
[1] https://github.com/pytorch-labs/monarch/issues/175#issuecomm...
d4l3k•7mo ago