This sounds similar to the Kimi's mixture of experts architecture if I understood it correctly(likely I have not), can you comment on this ?
MoE (mixture of experts), is an architecture that forces sparsity (not all 'neurons' are active during the forward pass.
This is pretty much orthogonal to that; it works with dense and MoE models, by repeating 'vertical' sections of the transformer stack.
I think that these models have to learn to efficiently use their parameters, and the best way to do that is 'evolve' (yes, a bad word for it), structures over pretraining time. Unfortunately, they don't have a way to access these structures 'from the inside'. I hope this new approach lets up boost performance in s more experimentally rigorous way
I will make another post if the topic is popular; its pretty geeky though, even more than my usual blog posts...
First pass runs your input through, second pass runs it's output as input?
Just, in double check it presumably runs the entire stack while you're trying to skip the translation steps and only double check the logic?
They are of course open to interpretation, but it suggest to me that the models develop 'organs' for processing different types of data, and without duplicating the 'whole organ' you don't get the benefits.
This is quite different to what you usually see, which is via layer ablation experiments. Thoughts?
Hopefully the cost per GPU will kick-it soon and we'll see people properly play, but frankly the "middle section" layers 2(ish) to (n-1)(ish) of a model can be shuffled up/down and left/right and still perform well.
The fun one will be an LLM router for LLM layers to apply the best reasoning to the best input so far, but frankly that would need the years and years of training that the author hints at.
The one that's still out of grasps is still how to combine/manipulate per-layer k,v caches into a globally coherent state. i.e. if layers can be moved up/down why can't the cached k,v be swapped/combined with different projections? global k,v caches work, but they have to be _huge_ in order to prevent model collapse even on something as simple as owt.
Author is right about the base64 part. Does seem weird that it can decode and understand it at same time. And I guess what makes it weird that we just sorta accept that for say English and German this works ie normal use but when framed as base64 then it suddenly stops feeling intuitive
Do you think karpathy's autoresearch would be useful here?
dnhkng•3h ago
The weird finding: single-layer duplication does nothing. Too few layers, nothing. Too many, it gets worse. Only circuit-sized blocks of ~7 layers work. This suggests pretraining carves out discrete functional circuits in the layer stack that only work when preserved whole.
The whole thing was developed on 2x RTX 4090s in my basement. I'm now running current models (GLM-4.7, Qwen3.5, MiniMax M2.5) on a dual GH200 rig (see my other post). Code and new models coming soon.
Happy to answer questions.
rapatel0•2h ago
Have you tried a simple inline loop over the duplicated layers? Would be interesting to see performance. Also, would be interesting to compare with a MOE model. See if these layers are acting like different agreeing "experts" or if there is reasoning happening in the latent space.
dnhkng•50m ago
I think this hasn't been tried before because it's totally unintuitive that feeding the output from later layers into previous ones would actually do anything. And in fact, it usually is detrimental. I guess it takes really bored hobbyists with too much compute to check this stuff.
This works with MoE, and yes, I would be interested in looking into this in more detail. But my wife might disagree with this time sink...
skerit•9m ago
naasking•1h ago
https://ouro-llm.github.io/
Pretty cool though. LLM brain surgery.
dnhkng•1h ago
I really think from the experiments that 'organs' (not sure what to term this), develop during massive pretraining. This also means maybe looping the entire models is actually not efficient. Maybe a better way is [linear input section -> loop 1 -> linear section -> loop 2 -> linear section -> ... -> loop n -> linear output]?
This would give 'organs' space to develop.
jauntywundrkind•1h ago
digdugdirk•1h ago
[1] https://weightwatcher.ai/
dnhkng•1h ago
It less 'tool', than an assorted set of scripts, tailored to my unusual hardware setup. But it should be easy to extend; I would have released this earlier but I had the (stupid) idea to 'write a paper' on this. Aiming for that delayed this a year. Blogs are the way to go (for me).
Balinares•58m ago
MoE notwithstanding, a model trained on the whole Internet and a few hundred thousands stolen books carries way more knowledge than is actually needed for any given workflow. It would be great if we could ship slimmed down models into which we'd plug the knowledge banks useful for today's work, and only those.
It would also mean that you could keep a model's knowledge fresh without retraining the whole of it.
afpx•15m ago
user_7832•13m ago
Extra thanks for making it written in a readable and approachable way! I don't have much of a background in this topic, but still managed to understand about 70-80% of it :) You're a good writer