Show HN: How I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs

https://dnhkng.github.io/posts/rys/

11•dnhkng•1h ago

Comments

dnhkng•1h ago

Author here. I found that duplicating a specific block of 7 middle layers in Qwen2-72B, without modifying any weights, improved performance across all Open LLM Leaderboard benchmarks and took #1. As of 2026, the top 4 models on that leaderboard are still descendants.

The weird finding: single-layer duplication does nothing. Too few layers, nothing. Too many, it gets worse. Only circuit-sized blocks of ~7 layers work. This suggests pretraining carves out discrete functional circuits in the layer stack that only work when preserved whole.

The whole thing was developed on 2x RTX 4090s in my basement. I'm now running current models (GLM-4.7, Qwen3.5, MiniMax M2.5) on a dual GH200 rig (see my other post). Code and new models coming soon.

Happy to answer questions.

rapatel0•2m ago

I think you may have cracked latent space reasoning. I've had a hunch that something like this would work, but couldn't figure out how the training would back propagate. But you've shown that you just need to duplicate existing layers.

Have you tried a simple inline loop over the duplicated layers? Would be interesting to see performance. Also, would be interesting to compare with a MOE model. See if these layers are acting like different agreeing "experts" or if there is reasoning happening in the latent space.

blourvim•1h ago

I am not really an ml dev so I don't understand most of it. It does sound ridiculous how it would even work work. Brilliant work and great article I enjoyed reading it

This sounds similar to the Kimi's mixture of experts architecture if I understood it correctly(likely I have not), can you comment on this ?

tgw43279w•54m ago

That was a fun read! The base64 decoding and encoding is quite interesting. A parallel: these models are surprisingly robust to heavy word mangling, back in 2023 people used this trick to jailbreak the models very often, but what was more surprising is that they even understand it. I always thought of it this way there must be some circuitry in the model that maps these almost unrecognizable words/sentences into their rectified versions. But what your base64 also shows is the fact thy can also encode them back as well! (However models are known to not be able to produce mangled output that looks convincingly random. I think the base64 transformation is more mechanical in this regard and hence it‘s easier to do the reverse for them.) So your layer circuit hypothesis aligns pretty well with my mental model of how these models work based on the interpretability work I am familiar with! I really also like the way you used the heatmaps as a tool to derive layer insights, very intuitive! But it’s really surprising that you can simply duplicate layers and achieve better results that generalize! This is some research grade effort! I’m confident you could publish this in NeurIPS or ICML if you put it into a paper! I‘m quite impressed! Great work!

Tools I found that make using Claude Code easier on your phone

Show HN: Svglib a SVG parser and renderer for Windows

The ugly history of regime change

What software knowledge will stay relevant?

Show HN: Base Layer – Open-source behavioral compression from any text

Para-biathlete wins silver using ChatGPT as his coach

Amazon is holding a mandatory meeting about AI breaking its systems

Show HN: Claude Tuner – Monitor your Claude usage and find the right plan

CragCLI – a new calculator for the command line

Show HN: Jottit – Reviving the Original from 2007

Stripe: Billing for LLM Tokens

Unlocked SaaS, file source as truth?

Understanding OBD2 codes (past, present, future)

Ask HN: What Happened to Llama Models?

Meta to Acquire Moltbook

Disorder Drives One of Nature's Most Complex Machines

Spacecraft's impact changed asteroid's orbit in a save-the-Earth test

Volkswagen to cut 50k jobs as profits drop

Microsoft 365 confirms new premium tier, stuffed with AI and few discounts

Smol AI WorldCup: What Small LLMs Can Do

Debian decides not to decide on AI-generated contributions

License Laundering and the Death of Clean Room (The Chardet Saga)

We are building data breach machines and nobody cares

Turing Award winner and former Oxford professor Tony Hoare passed away

Non-blocking SQLite for Node.js. Ported 100% of better-sqlite3 tests

AI Agent hacked McKinsey's chatbot and gained full read-write access in 2 hours

Forward to Hell?

Elements of AI Agents

Portable Secret is now open source

Why $100 Oil Isn't Going to Spark a New Shale Boom – Oilprice.com