Happy to see this opinion expressed here, too. The more math skeptics there are out there, the longer I get to keep my job. :)
Very good presentation. Projected gradient methods were popular during the convex optimization craze two decades ago. The ideas advanced here have precedent and seem sensible to me. My concern is whether it helps much. The test accuracy in figure 6b shows a marginal increase, a much higher learning rates, and a gentler transition to the overfitting regime, suggesting the regularization is working. The LR did not translate to a speed up: "Manifold Muon increased the wall clock time per step compared to AdamW..."
More fundamentally, I am a bit skeptical that low test accuracy is the right goal in LLMs because statistical learning theory does not adequately model the macro-behavior of very large models.
What is novel about the approach in the blog post? Serious question, I really can't tell after reading the post.
Just wanted to say the blog post design looks super nice. Beautifully laid out, very readable typography, clear graphics, approachable design with a welcoming UX, footnotes in the side, etc.
Anybody know how this is designed / styled? (I can see three.js being used, along with katex.js - but don't know more details)
Thanks
TL;DR: The OP notes that we currently use all sorts of tricks of the trade, including applying normalization layers, to keep unit values in DNNs from getting too large or too small when we train them. Keeping unit values from getting too large or small prevents numerical underflow/overflow, and also helps speed up learning by keeping the magnitudes of updates small in relation to weights. The OP proposes that we should constrain weights to be in sub-manifolds with unit condition number[a] at each layer, and that we should modify/design SGD algorithms to work well within those manifolds.
--
[a] https://en.wikipedia.org/wiki/Condition_number
--
EDIT: On the other hand, yesterday I saw a paper about doing basically the opposite, letting unit values in DNNs get as big or small as they need to get... by mapping them to complex logarithms and keeping them in that domain: https://openreview.net/forum?id=SUuzb0SOGu . I also found this opposite idea oddly compelling, but again, I don't know how well it works, because it hasn't been tested in real applications.
jasonjmcghee•1h ago
Has DAWNBench been done with manifold Muon (with a more appropriate architecture)?
snake_doc•1h ago
jasonjmcghee•20m ago
Here's the top model on DAWNBench - https://github.com/apple/ml-cifar-10-faster/blob/main/fast_c...
Trains for 15 epochs and it, like all the others is a 9 layer resnet.
srean•2m ago
In fact beating SOTA is often the least interesting part 〽 f an interesting paper and the SOTA-blind reviewers often use it as a gatekeeping device.
Jackson__•1h ago
pooooooooooooop•19m ago