frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Can You Draw Every Flag in PowerPoint? (Part 2) [video]

https://www.youtube.com/watch?v=BztF7MODsKI
1•fgclue•5m ago•0 comments

Show HN: MCP-baepsae – MCP server for iOS Simulator automation

https://github.com/oozoofrog/mcp-baepsae
1•oozoofrog•8m ago•0 comments

Make Trust Irrelevant: A Gamer's Take on Agentic AI Safety

https://github.com/Deso-PK/make-trust-irrelevant
2•DesoPK•12m ago•0 comments

Show HN: Sem – Semantic diffs and patches for Git

https://ataraxy-labs.github.io/sem/
1•rs545837•14m ago•1 comments

Hello world does not compile

https://github.com/anthropics/claudes-c-compiler/issues/1
2•mfiguiere•20m ago•0 comments

Show HN: ZigZag – A Bubble Tea-Inspired TUI Framework for Zig

https://github.com/meszmate/zigzag
2•meszmate•22m ago•0 comments

Metaphor+Metonymy: "To love that well which thou must leave ere long"(Sonnet73)

https://www.huckgutman.com/blog-1/shakespeare-sonnet-73
1•gsf_emergency_6•24m ago•0 comments

Show HN: Django N+1 Queries Checker

https://github.com/richardhapb/django-check
1•richardhapb•39m ago•1 comments

Emacs-tramp-RPC: High-performance TRAMP back end using JSON-RPC instead of shell

https://github.com/ArthurHeymans/emacs-tramp-rpc
1•todsacerdoti•44m ago•0 comments

Protocol Validation with Affine MPST in Rust

https://hibanaworks.dev
1•o8vm•48m ago•1 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
2•gmays•49m ago•0 comments

Show HN: Zest – A hands-on simulator for Staff+ system design scenarios

https://staff-engineering-simulator-880284904082.us-west1.run.app/
1•chanip0114•50m ago•1 comments

Show HN: DeSync – Decentralized Economic Realm with Blockchain-Based Governance

https://github.com/MelzLabs/DeSync
1•0xUnavailable•55m ago•0 comments

Automatic Programming Returns

https://cyber-omelette.com/posts/the-abstraction-rises.html
1•benrules2•58m ago•1 comments

Why Are There Still So Many Jobs? The History and Future of Workplace Automation [pdf]

https://economics.mit.edu/sites/default/files/inline-files/Why%20Are%20there%20Still%20So%20Many%...
2•oidar•1h ago•0 comments

The Search Engine Map

https://www.searchenginemap.com
1•cratermoon•1h ago•0 comments

Show HN: Souls.directory – SOUL.md templates for AI agent personalities

https://souls.directory
1•thedaviddias•1h ago•0 comments

Real-Time ETL for Enterprise-Grade Data Integration

https://tabsdata.com
1•teleforce•1h ago•0 comments

Economics Puzzle Leads to a New Understanding of a Fundamental Law of Physics

https://www.caltech.edu/about/news/economics-puzzle-leads-to-a-new-understanding-of-a-fundamental...
3•geox•1h ago•1 comments

Switzerland's Extraordinary Medieval Library

https://www.bbc.com/travel/article/20260202-inside-switzerlands-extraordinary-medieval-library
2•bookmtn•1h ago•0 comments

A new comet was just discovered. Will it be visible in broad daylight?

https://phys.org/news/2026-02-comet-visible-broad-daylight.html
4•bookmtn•1h ago•0 comments

ESR: Comes the news that Anthropic has vibecoded a C compiler

https://twitter.com/esrtweet/status/2019562859978539342
2•tjr•1h ago•0 comments

Frisco residents divided over H-1B visas, 'Indian takeover' at council meeting

https://www.dallasnews.com/news/politics/2026/02/04/frisco-residents-divided-over-h-1b-visas-indi...
4•alephnerd•1h ago•5 comments

If CNN Covered Star Wars

https://www.youtube.com/watch?v=vArJg_SU4Lc
1•keepamovin•1h ago•1 comments

Show HN: I built the first tool to configure VPSs without commands

https://the-ultimate-tool-for-configuring-vps.wiar8.com/
2•Wiar8•1h ago•3 comments

AI agents from 4 labs predicting the Super Bowl via prediction market

https://agoramarket.ai/
1•kevinswint•1h ago•1 comments

EU bans infinite scroll and autoplay in TikTok case

https://twitter.com/HennaVirkkunen/status/2019730270279356658
7•miohtama•1h ago•5 comments

Benchmarking how well LLMs can play FizzBuzz

https://huggingface.co/spaces/venkatasg/fizzbuzz-bench
1•_venkatasg•1h ago•1 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
35•SerCe•1h ago•31 comments

Octave GTM MCP Server

https://docs.octavehq.com/mcp/overview
1•connor11528•1h ago•0 comments
Open in hackernews

Modular Manifolds

https://thinkingmachines.ai/blog/modular-manifolds/
161•babelfish•4mo ago

Comments

jasonjmcghee•4mo ago
The learning rates they demonstrate are crazy - though the standard when talking about CIFAR-10 is 94% accuracy iirc. Showing ~60% accuracy is weird.

Has DAWNBench been done with manifold Muon (with a more appropriate architecture)?

snake_doc•4mo ago
Um.. the model is tiny: https://github.com/thinking-machines-lab/manifolds/blob/main...
jasonjmcghee•4mo ago
Yeah, it's just the wrong architecture for the job, so I found it to be a strange example.

Here's the top model on DAWNBench - https://github.com/apple/ml-cifar-10-faster/blob/main/fast_c...

Trains for 15 epochs and it, like all the others is a 9 layer resnet.

srean•4mo ago
Usually there's more to a ML, data-science idea (that's not a full fledged fledged out journal paper) than beating a SOTA benchmark.

In fact beating SOTA is often the least interesting part of an interesting paper and the SOTA-blind reviewers often use it as a gatekeeping device.

jasonjmcghee•4mo ago
Sure, of course. Wasn't suggesting "are you beating a sota benchmark"? I'm floating the idea of an ablation that matches a realistic scenario for the dataset / task. Personally curious how manifold muon performs compared to AdamW in a throughly explored context. This is the first time I've seen a 3-layer mlp on cifar-10.

I probably should have made the 9-layer ResNet part more, front-and-center / central to my point.

srean•4mo ago
Got you, this time.
Jackson__•4mo ago
They say they train for ~3 epochs. Could it be that's just not long enough of a training run? I have no idea how many epochs are usually used in those models.
pooooooooooooop•4mo ago
its a 3-layer MLP as stated in the article
snake_doc•4mo ago
Hmmm… http://www.incompleteideas.net/IncIdeas/BitterLesson.html
whimsicalism•4mo ago
this is a bad example to claim the bitter lesson applies to, it’s about the fundamentals of optimization techniques not about tying to hand-crafted things for the solution space.
snake_doc•4mo ago
Aren’t they all optimization techniques at the end of the day? Now you’re just debating semantics
whimsicalism•4mo ago
believe what you want, i guess
ACCount37•4mo ago
Doesn't apply as long as the improvements obtained there scale with compute.

Now, are there actual meaningful improvements to obtain, and do they stick around all the way to frontier runs? Unclear, really. So far, it looks like opening a can of hyperparameters.

TimorousBestie•4mo ago
Reminiscing about an old HN comment arguing that differential geometry was irrelevant to machine learning with a smile on my face.

Happy to see this opinion expressed here, too. The more math skeptics there are out there, the longer I get to keep my job. :)

deviation•4mo ago
The world is full of useful shapes! No reason that math shouldn't :)
srean•4mo ago
"I have never had to do integrate the "arctan" function by hand in my entire career" arguments are not worth engaging with.

If people are happy with a job or a role that does not need math that' fine.

Familiarity with Maths let's you to rise to the occasion, to become more than a replaceable cog.

The thing is, unless you are trained in math you wouldn't even recognise the opportunity, that a certain kind Of Math could have been used here. In fact, even if you are trained in Math you may not see it till much later -- it needs a special eye and something in that moment.

Polyhedrons were looked at for centuries after centuries by top-notch mathematicians. All missed Euler's formula, except perhaps Descartes.

Often what happens is some nontrivial branch of mathematics suddenly finds a novel and impactful application. Then crowds jump in to learn that Math. But it's mostly already a little too late for them, they have missed this bus.

The best case is one already knows the Math beforehand and you don't know which part will be handy. It helps if you love the subject and can afford to invest time to learn it for the love of the subject. Once in a while you happen to find yourself in the right place and the right time and with the right tools you need.

gowld•4mo ago
> Often what happens is some nontrivial branch of mathematics suddenly finds a novel and impactful application. Then crowds jump in to learn that Math. But it's mostly already a little too late for them, they have missed this bus.

However, in the meantime, the experts in that math have "missed the bus" on whatever the application area is, that the math expert knows not enough about because they were studying math instead.

esafak•4mo ago
> This post covers one appealing way to constrain the weight matrices of a neural network—by keeping the tensors constrained to submanifolds at each layer. This opens the door to re-thinking optimization, as we can co-design optimization algorithms with these manifold constraints. As an example, we propose a manifold version of the Muon optimizer whose weights are constrained to the Stiefel manifold: the manifold of matrices with unit condition number. We conclude the post by defining the idea of a modular manifold, which is a composable manifold that attempts to make it easier to scale up and train large networks.

Very good presentation. Projected gradient methods were popular during the convex optimization craze two decades ago. The ideas advanced here have precedent and seem sensible to me. My concern is whether it helps much. The test accuracy in figure 6b shows a marginal increase, and a gentler transition to the overfitting regime, suggesting the regularization is working. The higher LR did not translate to a speed up: "Manifold Muon increased the wall clock time per step compared to AdamW..."

More fundamentally, I am a bit skeptical that low test accuracy is the right goal in LLMs because statistical learning theory does not adequately model the macro-behavior of very large models.

namibj•4mo ago
> The test accuracy in figure 6b shows a marginal increase, and a gentler transition to the overfitting regime, suggesting the regularization is working.

Sounds like it might help for online RL training regimes as those are naturally quite vulnerable to overfitting .

jpt4•4mo ago
\> statistical learning theory does not adequately model the macro-behavior of very large models.

Might you please elaborate on this? I recognize that "artificial neural networks are lossy de/compression algorithms" does not enumerate the nuances of these structures, but am curious whether anything in particular is both interesting and missing from SLT.

esafak•4mo ago
SLT typically uses empirical risk minimization, leading to the bias-variance decomposition and a unimodal extremum as the monotonically decreasing bias supposedly balances against the monotonically increasing variance. We now know this does not accurately model overparameterized models, which exhibit double descent, and other phenomena like grokking. To explain them you have to look past classical statistics to statistical mechanics.
p1esk•4mo ago
The test accuracy in figure 6b shows a marginal increase, and a gentler transition to the overfitting regime, suggesting the regularization is working.

Higher LR does not mean there’s overfitting.

uoaei•4mo ago
This is exactly the kind of out-of-the-box thinking that will get us past some of the limitations of the current crop of AI architectures. Bravo to the authors.
SubiculumCode•4mo ago
Curious why the authors chose the blog format over a research report?
almostgotcaught•4mo ago
you mean a paper? because it's not paper quality content?
sudohalt•4mo ago
Exactly, it’s like they’re targeting people who don’t really know much about ML but are easily wowed by fancy math jargon and nice drawings.
SubiculumCode•4mo ago
Which was my round about way of asking :)
pooooooooooooop•4mo ago
thinkingmachines likes to flex
fmap•4mo ago
Isn't this an old idea? E.g., here's a textbook on optimization algorithms for matrix manifolds https://press.princeton.edu/absil and here's a library that implements this in python for the Stiefel manifold that's the subject of this blog post: https://pymanopt.org/docs/stable/manifolds.html#module-pyman...

What is novel about the approach in the blog post? Serious question, I really can't tell after reading the post.

cs702•4mo ago
I don't think it's been tried at scale, with large models.

It remains to be seen if it works better than conventional training schemes.

godelski•4mo ago

  > Isn't this an old idea?
So are neural networks. So is attention.

What's your point? Sometimes things need to be retried. Sometimes there are small subtle details that make or break an idea. So what's the point of acting dismissively? If an old idea that didn't work now works, then what's the problem? Besides, progress is typically iterative, not through leaps and bounds. The vast majority of things that look like leaps are just because we don't see the steps between.

The reason I'm saying this is because that sentiment is often used to pass over working solutions and slows down their progress. So even if unintentional it should cause us to rethink how we respond. Otherwise we end up with such silly claims like "Einstein just used Tensors" and "Nash just used topology". In some sense these are accurate, but they are too high level descriptions (and these are real dismissals. Which again, so what? If it works, it works?).

Why is "novelty" even a question? Novelty is only ever in the eyes of the beholder.

  > What is novel about the approach in the blog post? Serious question, I really can't tell after reading the post.
Honestly, I do not know, but I'll give you my best read on it.

1) Scale: Don't underestimate the importance of this. While I don't think scale is all you need, it certainly is a critical factor.

2) Different optimization: I may be missing something, but it looks like they are using a different optimizer. They mention that they're using the muon optimizer constraining to a Stiefel manifold. Neither of those things are unique on their own, but is their combination? This is where I'm uncertain because such a thing would be easy to miss. Maybe someone did and was unsuccessful with it. Maybe someone did, but was not at scale. Maybe someone did, it worked, and just nobody noticed (that happens a lot!).

So I think this is quite similar to how 99% of progress and breakthroughs are made: putting together ideas that seem unrelated and inventing some glue to generalize the process. At a high level this always looks like you're just putting existing things together, but that glue is really hard to make. And to continue that analogy, if we do a good enough job gluing things together then to anyone but an expert it'll look like there is no glue. It can be surprisingly difficult to tell if something is glued, welded, mated, milled, printed, or whatever. It usually takes a very keen eye to determine the answer non-destructively.

fmap•4mo ago
Apologies if this came across the wrong way. I really do want to know what the novel contributions of the post are, because the author implies that something about what they're doing is solving previously open problems:

> I figured out how to solve manifold Muon in the square case late last year, but I was unable to solve the full rectangular case and thus posed the problem as an open problem on the Modula docs. Jianlin Su solved the problem this summer

It sounds like the generalisation of projected gradient decent to "Muon" is what they're focusing on, but the derivation is all about the retraction map on the Stiefel manifold? I think I'm missing some background here.

godelski•4mo ago

  > Apologies if this came across the wrong way
I was uncertain but your other statements made me think that sentiment was unintentional. I just want to push back against it because it is too common and misused even with good intentions. I hope you don't see this as me saying anything about your character. Honestly, impressions are that you do care.

  > It sounds like the generalisation of projected gradient decent to "Muon"
I'm not a niche expert here, but do have knowledge in adjacent/overlapping domains. It sounds like you're in a similar boat? I ask because this pulls back to what I was trying to say about sometimes needing an expert eye.

If it helps, here's the "paper" for the Muon optimizer[0] and here's a follow-up[1]. Muon is definitely a gradient decent technique, but so are Adam, SGD, Ada, and many more[2].

The big thing of Muon is using NewtonSchulz5. So you update parameters with θ_{t-1} - η[NS_5(μB_{t-1} + ∇L(θ_{t-1}))] (I bracketed so you can see that this is just a specific version of θ_{t-1} - ηF(∇L(θ_{t-1}),...) which the standard gradient descent -- θ - η∇L(θ) -- is in that class of functions, right?). So we should be careful to over generalize and say that this is just gradient descent. You could even say [1] is "just [0] but with weight-decay" (or go look at the Adam and AdamW algos ;)

But one thing I should add is that gradient descent algorithms aren't topologically aware. I was able to find this post which asks a related question, trying to find what the conditions are for a surface's geodesic to align with gradient descent (note Newton differs from GD too). I don't think this paper is creating a solution where the GD formulation results in following a geodesic to the minimum, but my take is that it is working towards that direction. And to clarify, we'd want to follow the geodesic because that gives us the shortest or most energy efficient path (which ever perspective you want to use). In optimization we want to try to accomplish these two things (and more!): 1) take the "best" path to the optima, 2) find the best optima. Unfortunately these are ill-defined and there's not always objective answers to them. But in an ideal gradient descent algorithm we'd want it to go to the global minimum and take the fastest path, right? So with that it helps to be aware of the geometry (part of why people look at the Hessian but that comes at the cost of increased computation even if the additional information can get us there in fewer steps. So that's not (always) "the best").

I know this isn't a full answer and maybe with more reading I'll have a better one for you. But I'm hoping my answer can at least help you see some of the underlying nuanced problems that (_I think_) the authors are trying to get at. Hopefully I'm not too far off base lol. I'm hoping someone with more expertise can jump in and provide corrections/clarifications in the mean time.

[0] https://kellerjordan.github.io/posts/muon/

[1] https://arxiv.org/abs/2502.16982

[2] (far from a complete list) https://docs.pytorch.org/docs/stable/optim.html#algorithms

[3] (I think similar types of questions may also be fruitful) https://mathoverflow.net/questions/42617/functions-whose-gra...

aanet•4mo ago
Not here to comment on the _content_ of the blog post...

Just wanted to say the blog post design looks super nice. Beautifully laid out, very readable typography, clear graphics, approachable design with a welcoming UX, footnotes in the side, etc.

Anybody know how this is designed / styled? (I can see three.js being used, along with katex.js - but don't know more details)

Thanks

ddellacosta•4mo ago
UX on the other hand...I hate it when sites hijack my key commands for moving backwards and forwards in my browser history. Please don't do this!
manas96•4mo ago
I think the diagrams look very similar to what Keenan Crane uses in his papers, perhaps they used that tool. I think his students have now fleshed it out for general use.
spyder•4mo ago
For me it's horrible, some scripts makes the scroll very choppy, unusable... had to disable scripts just to be able to normally scroll :-(
cs702•4mo ago
TL;DR: The OP notes that we currently use all sorts of tricks of the trade, including applying normalization layers, to keep unit values in DNNs from getting too large or too small when we train them. Keeping unit values from getting too large or small prevents numerical underflow/overflow, and also helps speed up learning by keeping the magnitudes of updates small in relation to weights. The OP proposes that we should constrain weights to be in sub-manifolds with unit condition number[a] at each layer, and that we should modify/design SGD algorithms to work well within those manifolds.

I find the idea compelling, but it's too early to know if it will work well at scale, you know, with large models, in the real world.

--

[a] https://en.wikipedia.org/wiki/Condition_number

--

EDIT: On the other hand, yesterday I saw a paper about doing basically the opposite, letting unit values in DNNs get as big or small as they need to get... by mapping them to complex logarithms and keeping them in that domain: https://openreview.net/forum?id=SUuzb0SOGu . I also found this opposing idea oddly compelling, but I don't know how well it works either, because it hasn't been tested at scale.

robots0only•4mo ago
so their way to differentiate against frontier labs is to try writing research blog posts (not papers). It will be interesting to see how this plays out. I don't think that anyone serious about developing frontier models would be putting anything useful out there for others. We already see this with all the incumbents -- Google, OAI, Anthropic, xAI, DeepSeek and other chinese labs.
sudohalt•4mo ago
Because it’s not research quality. The only people excited by this are people who don’t know anything about actual ML, and think this is amazing.
lijok•4mo ago
Why is it not research quality? What’s missing?
aghilmort•4mo ago
Interesting. Modular manifolds are precisely what hypertokens use for prompt compiling.

Specifically, we linearize the emergent KVQ operations of an arbitrary prompt in any arbitrary model by way of interleaving error-correcting code (ECC).

ECC tokens are out-of-band tokens, e.g., Unicode's Private Use Area (PUA), interleaved with raw context tokens. This construction induces an in-context associate memory.

Any sort of interleaved labeling basis, e.g., A1, quick brown fox, A2, jumped lazy dog, induces a similar effect to for chaining recall & reasoning more reliably.

This trick works because PUA tokens are generally untrained hence their initial embedding is still random Gaussian w.h.p. Similar effects can be achieved by simply using token combos unlikely to exist and are often in practice more effective since PUA tokens like emojis or Mandarin characters are often 2,3, or 4 tokens after tokenization vs. codeword combos like zy-qu-qwerty every k content tokens, where can be variable.

Building attention architecture using modular manifolds in white / gray-box models like this new work shows vs. prompt-based black box injection is a natural next step, and so can at least anecdotally validate what they're building ahead of next paper or two.

Which is all to say, absolutely great to see others building in this way!

glowcoil•4mo ago
The original article discusses techniques for constraining the weights of a neural network to a submanifold of weight space during training. Your comment discusses interleaving the tokens of an LLM prompt with Unicode PUA code points. These are two almost completely unrelated things, so it is very confusing to me that you are confidently asserting that they are the same thing. Can you please elaborate on why you think there is any connection at all between your comment and the original article?
aghilmort•4mo ago
Our ECC construction induces an emergent modular manifold during KVQ computation.

Suppose we use 3 codeword lanes every codeword which is our default. Each lane of tokens is based on some prime, p, so collectively forms CRT-driven codeword (Chinese Remainder Theorem). This is discretely equivalent to labeling every k tokens with 1x globally unique indexing grammar.

That interleaving also corresponds to a triple of adjacent orthogonal embeddings since those tokens still retain a random gaussian embedding. The net effect is we similarly slice the latent space into spaced chain of modular manifolds within the latent space every k content tokens.

We also refer to that interleaving as Steifel frames for similar reasons as the post reads etc. We began work this spring or so to inject that net construction inside the model with early results in similar direction as post described. That's another way of saying this sort of approach lets us make that chained atlas (wc?) of modular manifolds as tight as possible within dimensional limits of the embedding, floating point precision, etc.

We somewhat tongue-in-cheek refer to this as the retokenization group at the prompt level re: renormalization group / tensor nets / etc. Relayering group is the same net intuition or perhaps reconnection group at architecture level.

glowcoil•4mo ago
I'm sorry, but even if I am maximally charitable and assume that everything you are saying is meaningful and makes sense, it still has essentially nothing to do with the original article. The original article is about imposing constraints on the weights of a neural network, during training, so that they lie on a particular manifold inside the overall weight space. The "modular" part is about being able to specify these constraints separately for individual layers or modules of a network and then compose them together into a meaningful constraint for the global network.

You are talking about latent space during inference, not weight space during training, and you are talking about interleaving tokens with random Gaussian tokens, not constraining values to lie on a manifold within a larger space. Whether or not the thing you are describing is meaningful or useful, it is basically unrelated to the original article, and you are not using the term "modular manifold" to refer to the same thing.

aghilmort•4mo ago
hmm / hear you. my point wasn't that we are applying modular manifolds in the same way it was that we are working on model reliability from two extremal ends using the same principle. there are various ways to induce modular manifolds in model at various levels of resolution / power. we started at outside / working in level and so it works with any black-box model out of the box and zero knowledge needed, dont even need to know token dictionary to show effect.

We're already working on pushing construction deeper into model both architecture and training. currently that's for fine-tuning and ultimately full architecture shrinkage / pruning and raw training vs. just fine-tuning etc.

& it was just great to see someone else using modular manifolds even if they are using them at the training stage vs. inference stage. they're exploiting modular form at training, we're doing it at inference. cool to see.

snake_doc•4mo ago
Wot? Is this what AI generated non-sense has come to? This is totally unrelated.
aghilmort•4mo ago
Nope. Construction induces ECC-driven emergent modular manifolds in latent space during KVQ maths. Can't use any ole ECC / crux why works. More in another reply.
yodon•4mo ago
Is the original Thinking Machines trademark[0] no longer active? They were the original AI company, back when AI was a completely different thing than it is today.

[0] https://en.m.wikipedia.org/wiki/Thinking_Machines_Corporatio...

tintor•4mo ago
That company is defunct since 1994; 31 years ago
gowld•4mo ago
What does this mean?
phoenicyan•4mo ago
Well-done post, I'd like to read more of their work and it's exciting to see these new ideas. Though as other people have said, the one set of empirical results that they present is a bit... confusing? I'd think they'd have some more compelling examples to present given all the pretty math.

Their modular norm paper (https://arxiv.org/abs/2405.14813) has several more examples; see their appendix D in particular, but these are also mystifying. Yes they're interested in how things scale but am I the only one to whom it seems that the training losses they report are just not competitive with things that are currently being used?

nenenejej•4mo ago
https://archive.is/bP3BG

If you like to scroll on mobile :)

nenenejej•4mo ago
Nice! Posts like this make me remorseful of not following a mathematics career. I'm sure some of the notation is basic (as in undergrad) but I'd need an entire weekend to understand the post.