Monarch is a distributed programming framework for PyTorch based on scalable actor messaging. It provides:
- Remote actors with scalable messaging: Actors are grouped into collections called meshes and messages can be broadcast to all members.
- Fault tolerance through supervision trees: Actors and processes for a tree and failures propagate up the tree, providing good default error behavior and enabling fine-grained fault recovery.
- Point-to-point RDMA transfers: cheap registration of any GPU or CPU memory in a process, with the one-sided tranfers based on libibverbs
- Distributed tensors: actors can work with tensor objects sharded across processes
It seems like the goal of Monarch is to do what Ray does, but more tightly integrated with the Deep Learning/distributed training ecosystem?
lairv•2h ago
Monarch is a distributed programming framework for PyTorch based on scalable actor messaging. It provides:
- Remote actors with scalable messaging: Actors are grouped into collections called meshes and messages can be broadcast to all members.
- Fault tolerance through supervision trees: Actors and processes for a tree and failures propagate up the tree, providing good default error behavior and enabling fine-grained fault recovery.
- Point-to-point RDMA transfers: cheap registration of any GPU or CPU memory in a process, with the one-sided tranfers based on libibverbs
- Distributed tensors: actors can work with tensor objects sharded across processes
It seems like the goal of Monarch is to do what Ray does, but more tightly integrated with the Deep Learning/distributed training ecosystem?