Modular Manifolds

https://thinkingmachines.ai/blog/modular-manifolds/

66•babelfish•1h ago

Comments

jasonjmcghee•1h ago

The learning rates they demonstrate are crazy - though the standard when talking about CIFAR-10 is 94% accuracy iirc. Showing ~60% accuracy is weird.

Has DAWNBench been done with manifold Muon (with a more appropriate architecture)?

snake_doc•1h ago

Um.. the model is tiny: https://github.com/thinking-machines-lab/manifolds/blob/main...

jasonjmcghee•20m ago

Yeah, it's just the wrong architecture for the job, so I found it to be a strange example.

Here's the top model on DAWNBench - https://github.com/apple/ml-cifar-10-faster/blob/main/fast_c...

Trains for 15 epochs and it, like all the others is a 9 layer resnet.

srean•2m ago

Usually there's more to a ML, data-science idea (that's not a full fledged fledged out journal paper) than beating a SOTA benchmark.

In fact beating SOTA is often the least interesting part 〽 f an interesting paper and the SOTA-blind reviewers often use it as a gatekeeping device.

Jackson__•1h ago

They say they train for ~3 epochs. Could it be that's just not long enough of a training run? I have no idea how many epochs are usually used in those models.

pooooooooooooop•19m ago

its a 3-layer MLP as stated in the article

snake_doc•1h ago

Hmmm… http://www.incompleteideas.net/IncIdeas/BitterLesson.html

whimsicalism•1h ago

this is a bad example to claim the bitter lesson applies to, it’s about the fundamentals of optimization techniques not about tying to hand-crafted things for the solution space.

TimorousBestie•1h ago

Reminiscing about an old HN comment arguing that differential geometry was irrelevant to machine learning with a smile on my face.

Happy to see this opinion expressed here, too. The more math skeptics there are out there, the longer I get to keep my job. :)

deviation•42m ago

The world is full of useful shapes! No reason that math shouldn't :)

esafak•1h ago

> This post covers one appealing way to constrain the weight matrices of a neural network—by keeping the tensors constrained to submanifolds at each layer. This opens the door to re-thinking optimization, as we can co-design optimization algorithms with these manifold constraints. As an example, we propose a manifold version of the Muon optimizer whose weights are constrained to the Stiefel manifold: the manifold of matrices with unit condition number. We conclude the post by defining the idea of a modular manifold, which is a composable manifold that attempts to make it easier to scale up and train large networks.

Very good presentation. Projected gradient methods were popular during the convex optimization craze two decades ago. The ideas advanced here have precedent and seem sensible to me. My concern is whether it helps much. The test accuracy in figure 6b shows a marginal increase, a much higher learning rates, and a gentler transition to the overfitting regime, suggesting the regularization is working. The LR did not translate to a speed up: "Manifold Muon increased the wall clock time per step compared to AdamW..."

More fundamentally, I am a bit skeptical that low test accuracy is the right goal in LLMs because statistical learning theory does not adequately model the macro-behavior of very large models.

uoaei•1h ago

This is exactly the kind of out-of-the-box thinking that will get us past some of the limitations of the current crop of AI architectures. Bravo to the authors.

SubiculumCode•48m ago

Curious why the authors chose the blog format over a research report?

almostgotcaught•31m ago

you mean a paper? because it's not paper quality content?

pooooooooooooop•20m ago

thinkingmachines likes to flex

fmap•20m ago

Isn't this an old idea? E.g., here's a textbook on optimization algorithms for matrix manifolds https://press.princeton.edu/absil and here's a library that implements this in python for the Stiefel manifold that's the subject of this blog post: https://pymanopt.org/docs/stable/manifolds.html#module-pyman...

What is novel about the approach in the blog post? Serious question, I really can't tell after reading the post.

aanet•9m ago

Not here to comment on the _content_ of the blog post...

Just wanted to say the blog post design looks super nice. Beautifully laid out, very readable typography, clear graphics, approachable design with a welcoming UX, footnotes in the side, etc.

Anybody know how this is designed / styled? (I can see three.js being used, along with katex.js - but don't know more details)

Thanks

ddellacosta•7m ago

UX on the other hand...I hate it when sites hijack my key commands for moving backwards and forwards in my browser history. Please don't do this!

cs702•8m ago

I find the idea compelling, but it's too early to know if it will work well at scale, you know, in the real world.

TL;DR: The OP notes that we currently use all sorts of tricks of the trade, including applying normalization layers, to keep unit values in DNNs from getting too large or too small when we train them. Keeping unit values from getting too large or small prevents numerical underflow/overflow, and also helps speed up learning by keeping the magnitudes of updates small in relation to weights. The OP proposes that we should constrain weights to be in sub-manifolds with unit condition number[a] at each layer, and that we should modify/design SGD algorithms to work well within those manifolds.

[a] https://en.wikipedia.org/wiki/Condition_number

EDIT: On the other hand, yesterday I saw a paper about doing basically the opposite, letting unit values in DNNs get as big or small as they need to get... by mapping them to complex logarithms and keeping them in that domain: https://openreview.net/forum?id=SUuzb0SOGu . I also found this opposite idea oddly compelling, but again, I don't know how well it works, because it hasn't been tested in real applications.

robots0only•6m ago

so their way to differentiate against frontier labs is to try writing research blog posts (not papers). It will be interesting to see how this plays out. I don't think that anyone serious about developing frontier models would be putting anything useful out there for others. We already see this with all the incumbents -- Google, OAI, Anthropic, xAI, DeepSeek and other chinese labs.

Mathematical Patterns in Phone Numbers

Teaching LLMs to spell with token healing

Corporate America Is Caving to Trump, Not Just Because of a Lack of Backbone

Arete Systems 1000 – Computer Ads from the Past

Why Early-Stage Founders Should Consider Skipping Prior Art Searches for Patents

Trump Clears Way for Cronies to Buy TikTok for $14B

Chrome DevTools MCP

We Got to See Snapdragon X2 Elite PCs in Action and They Look Impressive

Emergency Software: Software Development Lessons from Emisari

Retail Stores May Soon Use Drones to Chase Thieves

Goodbye petrostates, hello 'electrostates': clean energy shift reshaping world

We still chose C++ (instead of Rust) for new database development

Fungus-farming termites control weeds – Science – AAAS

Do Patents Help Startups Raise Funding? Evidence from the U.S. and Europe

Videogame Giant Electronic Arts Near Roughly $50B Deal to Go Private

Agentic AI as Unlimited Junior Analysts

How Does Lossless Compression in Fuji RAF Files Work?

Breakthrough carbon nanotube material sets new thermal insulation record

We committed to a zero-bugs policy

Gunman in shooting at NFL headquarters had CTE: Medical examiner

Question

Money manager Howard Rubin arrested on sex trafficking charges

Anxiety, AI Adoption and More Anxiety

What You Need to Know about Modern CSS (2025 Edition)

US autism research gets $50M funding boost – amid row over Tylenol

Profiling multimodal workloads: lessons from Daft

Malmö Faces Dilemma with 2k Untraceable Nutella Jars

Implementing a Kalman Filter in Postgres

Latest argument against regulating AI: that would be the Antichrist

Doug Bowser Bids Farewell to the Mushroom Kingdom