PyTorch Monarch

https://pytorch.org/blog/introducing-pytorch-monarch/

164•jarbus•5h ago

Comments

pjmlp•3h ago

Apparently PyTorch oxidation has started.

> Monarch is split into a Python-based frontend, and a backend implemented in Rust.

Other than that, looks like a quite interesting project.

galangalalgol•2h ago

This is a new project right? Not the oxidation of an existing one.

gaogao•1h ago

Yup, hyperreactor, one of the new crates that's part of it, does some particularly interesting things for efficient parallel distributed channels.

dhrt12327•1h ago

Multiple sources say that it is an experimental framework around PyTorch, not a replacement. People will still get to enjoy a circular graph using std::shared_ptr with memory leaks.

It's a pity they don't do a complete rewrite with a functional language as the driver.

gaogao•1h ago

> It's a pity they don't do a complete rewrite with a functional language as the driver.

It's open source, so seeing such an extension would be quite cool. There's much that could be done with native Rust actors and code that get maybe at what you want, but nothing precludes mixing PyTorch and other backends.

For example, you could wrap a C++ inference engine as part of one of the actors generating data for other actors doing distributed training.

pjmlp•1h ago

Interesting, by the way, you can replicate the experience in Rust.

jonapro•3h ago

Beowulf then.

valzam•3h ago

I assume this is similar to Ray?

lairv•2h ago

I'm also curious what's the use case of this over Ray. Tighter integration with PyTorch/tensors abstractions?

porridgeraisin•1h ago

That.

Also, it has RDMA. Last I checked, Ray did not support RDMA.

There are probably other differences as well, but the lack of RDMA immediately splits the world into things you can do with ray and things you cannot do with ray

zacmps•33m ago

Not currently, but it is being worked on https://github.com/ray-project/ray/issues/53976.

disattention•2h ago

I had the same thought, especially because of their recent collaboration.

https://pytorch.org/blog/pytorch-foundation-welcomes-ray-to-...

unnah•1h ago

There's also Dask, which can do distributed pandas and numpy operations etc. However it was originally developed for traditional HPC systems and has only limited support for GPU computing. https://www.dask.org/

milancurcic•2h ago

Cool! Essentially Fortran coarrays from 2008.

philipallstar•2h ago

Or Hadoop from 2006? But you don't need to write MapReduce or Fortran, so it's probably far nicer.

alyxya•2h ago

I made my own single controller PyTorch extension [1], though mines doesn't yet support cross node communication. I found it interesting to compare how Monarch makes things performant. I believe Monarch also uses cloudpickle for code to be shared among all nodes, which is probably the only way to performantly have various nodes execute work as that ends up being a one time setup cost. I found the fanning out of sending messages from the single controller to be really interesting, so the controller is unlikely to be the bottleneck besides any synchronous operations.

As far as things that might be a performance loss here, one thing I'm wondering is if custom kernels are supported. I'm also wondering how much granularity of control there is with communication between different actors calling a function. Overall, I really like this project and hope to see it used over multi-controller setups.

[1] https://github.com/alyxya/mycelya-torch

gaogao•1h ago

> As far as things that might be a performance loss here, one thing I'm wondering is if custom kernels are supported

Yeah, you might end up needing some changes to remote worker initialization, but you can generally bake in whatever kernels and other system code you need.

logicchains•2h ago

This seems strictly less powerful than Jax, which comes with a powerful compiler that optimises how cross-node communication is conducted.

gaogao•1h ago

Nah, focusing on a different controller paradigm. Jax is focused on multi-controller SPMD, while this is focused on a single-controller setup. Both have their place, with single-controller being generally easier to reason about, and multi-controller more optimal for certain dataflows. There's also some interesting mixes of the two control paradigms.

nothrowaways•2h ago

FB should create a pytorch foundation and set it free before they fuck it up.

gooodvibes•2h ago

https://pytorch.org/foundation/

dkdcio•2h ago

damn that was fast!

porridgeraisin•2h ago

> This lets us avoid single-host bottlenecks, effectively using the whole mesh as a distributed cluster for message forwarding. (Cite scalability numbers here.)

In case someone that can fix this is reading here

chandureddyvari•2h ago

Interesting - this seems to target a different layer than services like Tinker (https://thinkingmachines.ai/blog/announcing-tinker/). Monarch provides the infrastructure primitives while Tinker is a managed finetuning service. Could someone build something like Tinker on top of Monarch?

gaogao•1h ago

Yup, there's stuff like https://pytorch.org/blog/introducing-torchforge/ on top of it now

chandureddyvari•1h ago

Nice, so the open source equivalent now exists. Meta basically commoditized Tinker's($12B valuation) value prop by giving away the infra (Monarch) and the RL framework (TorchForge). Will be interesting to see how a managed service competes with free + open source at this layer.

SomaticPirate•1h ago

"Our Rust-based backend facilitates our performance, scale, and robustness — we amply use Rust’s fearless concurrency in Monarch’s implementation"

Found a few typo's. The em dash makes me suspect an LLM was involved in proofreading

alt187•50m ago

https://www.scottsmitelli.com/articles/em-dash-tool/

geedzmo•6m ago

That was a really good read. Glad I clicked

whimsicalism•16m ago

that it is surrounded by spaces makes this less likely

fadedsignal•52m ago

It is a nice project. I have questions.

- Is this similar to openMPI?

- How is a mesh established? Do they need to be on the same host?

US axes website for reporting human rights abuses by US-armed foreign forces

I spent a year of my life making an ASN.1 compiler in D

PyTorch Monarch

The Game Theory of How Algorithms Can Drive Up Prices

VST3 audio plugin format is now MIT

Google flags Immich sites as dangerous

Programming with Less Than Nothing

C64 Blood Money

Ask HN: Does anyone have scans of these missing PC Plus issues (1991–1993)?

Show HN: Deta Surf – An open source and local-first AI notebook

Conflict-Free Replicated Data Types (CRDTs): Convergence Without Coordination

Scripts I wrote that I use all the time

Nango (YC W23) is hiring Staff Back end Engs (remote)

SpaceX disables 2,500 Starlink terminals allegedly used by Asian scam centers

Which Collatz numbers do Busy Beavers simulate (if any)?

Radios, how do they work? (2024)

Run interactive commands in Gemini CLI

Accessing Max Verstappen's passport and PII through FIA bugs

Willow quantum chip demonstrates verifiable quantum advantage on hardware

Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

JMAP for Calendars, Contacts and Files Now in Stalwart

Microsoft puts Office Online Server on the chopping block

Ovi: Twin backbone cross-modal fusion for audio-video generation

Why SSA Compilers?

Element: setHTML() method

Play abstract strategy board games online with friends or against bots

We need to start doing web blocking for non-technical reasons

When You Get to Be Smart Writing a Macro

The first interstellar software update: The hack that saved Voyager 1 [video]

Show HN: Silly Morse code chat app using WebSockets

US axes website for reporting human rights abuses by US-armed foreign forces

I spent a year of my life making an ASN.1 compiler in D

PyTorch Monarch

The Game Theory of How Algorithms Can Drive Up Prices

VST3 audio plugin format is now MIT

Google flags Immich sites as dangerous

Programming with Less Than Nothing

C64 Blood Money

Ask HN: Does anyone have scans of these missing PC Plus issues (1991–1993)?

Show HN: Deta Surf – An open source and local-first AI notebook

Conflict-Free Replicated Data Types (CRDTs): Convergence Without Coordination

Scripts I wrote that I use all the time

Nango (YC W23) is hiring Staff Back end Engs (remote)

SpaceX disables 2,500 Starlink terminals allegedly used by Asian scam centers

Which Collatz numbers do Busy Beavers simulate (if any)?

Radios, how do they work? (2024)

Run interactive commands in Gemini CLI

Accessing Max Verstappen's passport and PII through FIA bugs

Willow quantum chip demonstrates verifiable quantum advantage on hardware

Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

JMAP for Calendars, Contacts and Files Now in Stalwart

Microsoft puts Office Online Server on the chopping block

Ovi: Twin backbone cross-modal fusion for audio-video generation

Why SSA Compilers?

Element: setHTML() method

Play abstract strategy board games online with friends or against bots

We need to start doing web blocking for non-technical reasons

When You Get to Be Smart Writing a Macro

The first interstellar software update: The hack that saved Voyager 1 [video]

Show HN: Silly Morse code chat app using WebSockets

PyTorch Monarch

Comments