frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: How I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs

https://dnhkng.github.io/posts/rys/
94•dnhkng•3h ago

Comments

dnhkng•3h ago
Author here. I found that duplicating a specific block of 7 middle layers in Qwen2-72B, without modifying any weights, improved performance across all Open LLM Leaderboard benchmarks and took #1. As of 2026, the top 4 models on that leaderboard are still descendants.

The weird finding: single-layer duplication does nothing. Too few layers, nothing. Too many, it gets worse. Only circuit-sized blocks of ~7 layers work. This suggests pretraining carves out discrete functional circuits in the layer stack that only work when preserved whole.

The whole thing was developed on 2x RTX 4090s in my basement. I'm now running current models (GLM-4.7, Qwen3.5, MiniMax M2.5) on a dual GH200 rig (see my other post). Code and new models coming soon.

Happy to answer questions.

rapatel0•2h ago
I think you may have cracked latent space reasoning. I've had a hunch that something like this would work, but couldn't figure out how the training would back propagate. But you've shown that you just need to duplicate existing layers.

Have you tried a simple inline loop over the duplicated layers? Would be interesting to see performance. Also, would be interesting to compare with a MOE model. See if these layers are acting like different agreeing "experts" or if there is reasoning happening in the latent space.

dnhkng•50m ago
Yes, I've tried duplicating layers, but its useful.

I think this hasn't been tried before because it's totally unintuitive that feeding the output from later layers into previous ones would actually do anything. And in fact, it usually is detrimental. I guess it takes really bored hobbyists with too much compute to check this stuff.

This works with MoE, and yes, I would be interested in looking into this in more detail. But my wife might disagree with this time sink...

skerit•9m ago
This is kind of what LoopLM is doing, no? https://arxiv.org/abs/2510.25741
naasking•1h ago
This layer duplication strikes me as a bit of "poor man's" version of looped language models:

https://ouro-llm.github.io/

Pretty cool though. LLM brain surgery.

dnhkng•1h ago
Agrees, but one thing to note:

I really think from the experiments that 'organs' (not sure what to term this), develop during massive pretraining. This also means maybe looping the entire models is actually not efficient. Maybe a better way is [linear input section -> loop 1 -> linear section -> loop 2 -> linear section -> ... -> loop n -> linear output]?

This would give 'organs' space to develop.

jauntywundrkind•1h ago
The dual GH200 build was amazing. Awesome to see someone with such talent & flare in one area also doing great in another area. Thanks for noting that that was you. https://news.ycombinator.com/item?id=46222237
digdugdirk•1h ago
Super cool! Do you do any analysis or have any tools that help you identify these circuits? I came across this [1] recently, and wanted to try to identify specifically strong "circuits" in what seems to be a similar way to what you did.

[1] https://weightwatcher.ai/

dnhkng•1h ago
I build my own analysis tools. I'm just finishing up running the current generation of LLMs (MiniMax M2.5 and the Qwen3.5 family), and then I will put it all on Github.

It less 'tool', than an assorted set of scripts, tailored to my unusual hardware setup. But it should be easy to extend; I would have released this earlier but I had the (stupid) idea to 'write a paper' on this. Aiming for that delayed this a year. Blogs are the way to go (for me).

Balinares•58m ago
The idea that there may be a cognitive lingua franca hiding in the layers is fascinating and gives me hope for a neat idea: pluggable knowledge banks.

MoE notwithstanding, a model trained on the whole Internet and a few hundred thousands stolen books carries way more knowledge than is actually needed for any given workflow. It would be great if we could ship slimmed down models into which we'd plug the knowledge banks useful for today's work, and only those.

It would also mean that you could keep a model's knowledge fresh without retraining the whole of it.

afpx•15m ago
Thank you so much for sharing this in a delightful blog post. One of the more enjoyable things I've read in a while. Very motivating!
user_7832•13m ago
Thanks for the post, really cool stuff you did!

Extra thanks for making it written in a readable and approachable way! I don't have much of a background in this topic, but still managed to understand about 70-80% of it :) You're a good writer

blourvim•3h ago
I am not really an ml dev so I don't understand most of it. It does sound ridiculous how it would even work work. Brilliant work and great article I enjoyed reading it

This sounds similar to the Kimi's mixture of experts architecture if I understood it correctly(likely I have not), can you comment on this ?

dnhkng•1h ago
No worries, happy to discuss anyway :)

MoE (mixture of experts), is an architecture that forces sparsity (not all 'neurons' are active during the forward pass.

This is pretty much orthogonal to that; it works with dense and MoE models, by repeating 'vertical' sections of the transformer stack.

tgw43279w•2h ago
That was a fun read! The base64 decoding and encoding is quite interesting. A parallel: these models are surprisingly robust to heavy word mangling, back in 2023 people used this trick to jailbreak the models very often, but what was more surprising is that they even understand it. I always thought of it this way there must be some circuitry in the model that maps these almost unrecognizable words/sentences into their rectified versions. But what your base64 also shows is the fact thy can also encode them back as well! (However models are known to not be able to produce mangled output that looks convincingly random. I think the base64 transformation is more mechanical in this regard and hence it‘s easier to do the reverse for them.) So your layer circuit hypothesis aligns pretty well with my mental model of how these models work based on the interpretability work I am familiar with! I really also like the way you used the heatmaps as a tool to derive layer insights, very intuitive! But it’s really surprising that you can simply duplicate layers and achieve better results that generalize! This is some research grade effort! I’m confident you could publish this in NeurIPS or ICML if you put it into a paper! I‘m quite impressed! Great work!
WithinReason•1h ago
Here is a paper that made a similar observation recently:

https://www.alphaxiv.org/abs/2512.19941

tgw43279w•1h ago
Very cool, thanks for sharing! Recovering 96% using just two blocks on IMN-1k, wow!
dnhkng•1h ago
Thanks for the link!

I think that these models have to learn to efficiently use their parameters, and the best way to do that is 'evolve' (yes, a bad word for it), structures over pretraining time. Unfortunately, they don't have a way to access these structures 'from the inside'. I hope this new approach lets up boost performance in s more experimentally rigorous way

WithinReason•1h ago
I think the recurrence is a consequence of using a residual connection, seems like that makes the representation stay consistent across layers
seeknotfind•1h ago
Did you ever try multiple copies?
dnhkng•1h ago
I did, but the combinatorics are mad. I have also tried training a meta-model that predicts the outputs of the combinations.

I will make another post if the topic is popular; its pretty geeky though, even more than my usual blog posts...

tjwei•1h ago
Really interesting discovery, especially the part about base64. Reminds me of this: Transformer Layers as Painters https://arxiv.org/abs/2407.09298
cootsnuck•1h ago
Super cool. Love seeing these writeups of hobbyists getting their hands dirty, breaking things, and then coming out on the other side of it with something interesting.
goodmythical•1h ago
Isn't this similar to models that have "double check the answer"?

First pass runs your input through, second pass runs it's output as input?

Just, in double check it presumably runs the entire stack while you're trying to skip the translation steps and only double check the logic?

dnhkng•57m ago
Maybe, but the interesting thing for me it this only works with specific 'chunks' of the transformer layer stack. More or less that the optimal leads to worse performance.
sva_•38m ago
I don't think its mathematically equivalent or even close because the context/logprobs will be very different, since you only produce 1 token per pass. I'd say the token itself has a lot less information than the signal propagating through the residual stream of transformer blocks.
patchnull•1h ago
This lines up with what I have seen doing CKA (centered kernel alignment) analysis on transformer internals. The middle layers in most large models have surprisingly similar representations to their neighbors, so duplicating them is basically giving the model extra compute cycles in a region where it is already doing useful refinement without messing up the input/output encoding stages. Curious whether picking layers by representation similarity instead of just a contiguous block would do even better.
dnhkng•58m ago
Have a look at the boundaries in the heatmaps.

They are of course open to interpretation, but it suggest to me that the models develop 'organs' for processing different types of data, and without duplicating the 'whole organ' you don't get the benefits.

This is quite different to what you usually see, which is via layer ablation experiments. Thoughts?

doctorpangloss•9m ago
Maybe you are observing artifacts of Qwen's training procedure. Perhaps they initialized further layers with the weights of previous ones as part of the training curriculum. But it's fun to imagine something more exotic.
GaggiX•1h ago
This reminds me when people were doing crazy stuff to improve the first Stable Diffusion model by swapping layers, interpolating weights, documenting which layer was most responsible for the quality of the hands etc. At the end the final models had dozens of different ancestors.
hmokiguess•54m ago
I really enjoyed reading this. I feel like generalists intuitively experience this exact thing so much throughout their lives because they must have this neuroanatomy you describe. There’s a certain geometry to knowledge that makes possible for this orthogonal movement and it is really fascinating to me. Thank you for publishing this, you made my day!
dnhkng•42m ago
Thanks!
rob_c•47m ago
very awesome writeup, glad to see someone with access to hw actually playing with this.

Hopefully the cost per GPU will kick-it soon and we'll see people properly play, but frankly the "middle section" layers 2(ish) to (n-1)(ish) of a model can be shuffled up/down and left/right and still perform well.

The fun one will be an LLM router for LLM layers to apply the best reasoning to the best input so far, but frankly that would need the years and years of training that the author hints at.

The one that's still out of grasps is still how to combine/manipulate per-layer k,v caches into a globally coherent state. i.e. if layers can be moved up/down why can't the cached k,v be swapped/combined with different projections? global k,v caches work, but they have to be _huge_ in order to prevent model collapse even on something as simple as owt.

Havoc•41m ago
Crazy writeup.

Author is right about the base64 part. Does seem weird that it can decode and understand it at same time. And I guess what makes it weird that we just sorta accept that for say English and German this works ie normal use but when framed as base64 then it suddenly stops feeling intuitive

dinobones•6m ago
why tho? it's just an alternate alphabet/set of symbols.
lordmathis•41m ago
That's cool. I tried the b64 thing on my local qwen3.5 27b without access to tools and it did it.
priowise•32m ago
Very cool build. I’m always curious with experiments like this — was the biggest bottleneck compute, data curation, or evaluation methodology?
user_7832•13m ago
A 5 hour old account with a standard chatgpt reply? Seriously, try harder.
kovek•16m ago
Is this similar to send 48656c6c6f2c20686f772061726520796f753f in the prompt? As done here: https://youtu.be/GiaNp0u_swU?si=m7-LZ7EYxJCw0k1-
Aditya_Garg•8m ago
Wild stuff and great read

Do you think karpathy's autoresearch would be useful here?

Tony Hoare has died

https://blog.computationalcomplexity.org/2026/03/tony-hoare-1934-2026.html
569•speckx•2h ago•59 comments

Debian decides not to decide on AI-generated contributions

https://lwn.net/SubscriberLink/1061544/125f911834966dd0/
106•jwilk•2h ago•80 comments

Intel Demos Chip to Compute with Encrypted Data

https://spectrum.ieee.org/fhe-intel
138•sohkamyung•3h ago•46 comments

I put my whole life into a single database

https://howisfelix.today/
322•lukakopajtic•7h ago•148 comments

Rebasing in Magit

https://entropicthoughts.com/rebasing-in-magit
118•ibobev•3h ago•78 comments

Launch HN: Didit (YC W26) – Stripe for Identity Verification

25•rosasalberto•2h ago•24 comments

Online age-verification tools for child safety are surveilling adults

https://www.cnbc.com/2026/03/08/social-media-child-safety-internet-ai-surveillance.html
288•bilsbie•4h ago•159 comments

Show HN: How I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs

https://dnhkng.github.io/posts/rys/
96•dnhkng•3h ago•40 comments

The Gervais Principle, or the Office According to "The Office" (2009)

https://www.ribbonfarm.com/2009/10/07/the-gervais-principle-or-the-office-according-to-the-office/
209•janandonly•3d ago•89 comments

How many options fit into a boolean?

https://herecomesthemoon.net/2025/11/how-many-options-fit-into-a-boolean/
24•luu•3d ago•4 comments

Sending Jabber/XMPP Messages via HTTP

https://gultsch.de/posts/xmpp-via-http/
35•inputmice•3h ago•5 comments

Meta acquires Moltbook

https://www.axios.com/2026/03/10/meta-facebook-moltbook-agent-social-network
134•mmayberry•2h ago•86 comments

Yann LeCun's AI startup raises $1B in Europe's largest ever seed round

https://www.ft.com/content/e5245ec3-1a58-4eff-ab58-480b6259aaf1
353•ottomengis•6h ago•184 comments

PgAdmin 4 9.13 with AI Assistant Panel

https://www.pgadmin.org/docs/pgadmin4/9.13/query_tool.html#ai-assistant-panel
54•__natty__•5h ago•18 comments

Show HN: DD Photos – open-source photo album site generator (Go and SvelteKit)

https://github.com/dougdonohoe/ddphotos
35•dougdonohoe•3h ago•8 comments

Amazon is holding a mandatory meeting about AI breaking its systems

https://twitter.com/lukolejnik/status/2031257644724342957
182•lwhsiao•2h ago•120 comments

A New Version of Our Oracle Solaris Environment for Developers

https://blogs.oracle.com/solaris/announcing-a-new-version-of-our-oracle-solaris-environment-for-d...
31•naves•2d ago•16 comments

I used pulsar detection techniques to turn a phone into a watch timegrapher

https://www.chronolog.watch/timegrapher
7•tylerjaywood•2d ago•1 comments

Caxlsx: Ruby gem for xlsx generation with charts, images, schema validation

https://github.com/caxlsx/caxlsx
49•earcar•4d ago•3 comments

Practical Guide to Bare Metal C++

https://arobenko.github.io/bare_metal_cpp/#_abstract_classes
87•ibobev•3d ago•31 comments

Two Years of Emacs Solo

https://www.rahuljuliato.com/posts/emacs-solo-two-years
327•celadevra_•16h ago•122 comments

LoGeR – 3D reconstruction from extremely long videos (DeepMind, UC Berkeley)

https://loger-project.github.io
115•helloplanets•10h ago•25 comments

$3 ChromeOS Flex stick will revive old and outdated computers

https://9to5google.com/2026/03/10/this-3-chromeos-stick-will-revive-old-and-outdated-computers/
10•pentagrama•44m ago•8 comments

Lotus 1-2-3 on the PC with DOS

https://stonetools.ghost.io/lotus123-dos/
160•TMWNN•3d ago•62 comments

TCXO Failure Analysis

https://serd.es/2026/03/06/TCXO-failure-analysis.html
85•zdw•3d ago•38 comments

Traffic from Russia to Cloudflare is 60% down from last year

https://radar.cloudflare.com/traffic/ru?dateRange=52w
87•secondary_op•4h ago•47 comments

No, it doesn't cost Anthropic $5k per Claude Code user

https://martinalderson.com/posts/no-it-doesnt-cost-anthropic-5k-per-claude-code-user/
397•jnord•17h ago•284 comments

RFC 454545 – Human Em Dash Standard

https://gist.github.com/bignimbus/a75cc9d703abf0b21a57c0d21a79e2be
72•jdauriemma•2h ago•51 comments

Optimizing Top K in Postgres

https://www.paradedb.com/blog/optimizing-top-k
134•philippemnoel•1d ago•17 comments

Redox OS has adopted a Certificate of Origin policy and a strict no-LLM policy

https://gitlab.redox-os.org/redox-os/redox/-/blob/master/CONTRIBUTING.md
289•pjmlp•8h ago•313 comments