frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Solving NP-Complete Structures via Information Noise Subtraction (P=NP)

https://zenodo.org/records/18395618
1•alemonti06•2m ago•1 comments

Cook New Emojis

https://emoji.supply/kitchen/
1•vasanthv•5m ago•0 comments

Show HN: LoKey Typer – A calm typing practice app with ambient soundscapes

https://mcp-tool-shop-org.github.io/LoKey-Typer/
1•mikeyfrilot•8m ago•0 comments

Long-Sought Proof Tames Some of Math's Unruliest Equations

https://www.quantamagazine.org/long-sought-proof-tames-some-of-maths-unruliest-equations-20260206/
1•asplake•9m ago•0 comments

Hacking the last Z80 computer – FOSDEM 2026 [video]

https://fosdem.org/2026/schedule/event/FEHLHY-hacking_the_last_z80_computer_ever_made/
1•michalpleban•9m ago•0 comments

Browser-use for Node.js v0.2.0: TS AI browser automation parity with PY v0.5.11

https://github.com/webllm/browser-use
1•unadlib•10m ago•0 comments

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

https://www.nytimes.com/2026/02/07/magazine/michael-pollan-interview.html
1•mitchbob•10m ago•1 comments

Software Engineering Is Back

https://blog.alaindichiappari.dev/p/software-engineering-is-back
1•alainrk•11m ago•0 comments

Storyship: Turn Screen Recordings into Professional Demos

https://storyship.app/
1•JohnsonZou6523•12m ago•0 comments

Reputation Scores for GitHub Accounts

https://shkspr.mobi/blog/2026/02/reputation-scores-for-github-accounts/
1•edent•15m ago•0 comments

A BSOD for All Seasons – Send Bad News via a Kernel Panic

https://bsod-fas.pages.dev/
1•keepamovin•19m ago•0 comments

Show HN: I got tired of copy-pasting between Claude windows, so I built Orcha

https://orcha.nl
1•buildingwdavid•19m ago•0 comments

Omarchy First Impressions

https://brianlovin.com/writing/omarchy-first-impressions-CEEstJk
2•tosh•24m ago•1 comments

Reinforcement Learning from Human Feedback

https://arxiv.org/abs/2504.12501
2•onurkanbkrc•25m ago•0 comments

Show HN: Versor – The "Unbending" Paradigm for Geometric Deep Learning

https://github.com/Concode0/Versor
1•concode0•25m ago•1 comments

Show HN: HypothesisHub – An open API where AI agents collaborate on medical res

https://medresearch-ai.org/hypotheses-hub/
1•panossk•28m ago•0 comments

Big Tech vs. OpenClaw

https://www.jakequist.com/thoughts/big-tech-vs-openclaw/
1•headalgorithm•31m ago•0 comments

Anofox Forecast

https://anofox.com/docs/forecast/
1•marklit•31m ago•0 comments

Ask HN: How do you figure out where data lives across 100 microservices?

1•doodledood•31m ago•0 comments

Motus: A Unified Latent Action World Model

https://arxiv.org/abs/2512.13030
1•mnming•32m ago•0 comments

Rotten Tomatoes Desperately Claims 'Impossible' Rating for 'Melania' Is Real

https://www.thedailybeast.com/obsessed/rotten-tomatoes-desperately-claims-impossible-rating-for-m...
3•juujian•33m ago•2 comments

The protein denitrosylase SCoR2 regulates lipogenesis and fat storage [pdf]

https://www.science.org/doi/10.1126/scisignal.adv0660
1•thunderbong•35m ago•0 comments

Los Alamos Primer

https://blog.szczepan.org/blog/los-alamos-primer/
1•alkyon•37m ago•0 comments

NewASM Virtual Machine

https://github.com/bracesoftware/newasm
2•DEntisT_•40m ago•0 comments

Terminal-Bench 2.0 Leaderboard

https://www.tbench.ai/leaderboard/terminal-bench/2.0
2•tosh•40m ago•0 comments

I vibe coded a BBS bank with a real working ledger

https://mini-ledger.exe.xyz/
1•simonvc•40m ago•1 comments

The Path to Mojo 1.0

https://www.modular.com/blog/the-path-to-mojo-1-0
1•tosh•43m ago•0 comments

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

https://github.com/voice-of-japan/Virtual-Protest-Protocol/blob/main/README.md
5•sakanakana00•46m ago•1 comments

Show HN: I built Divvy to split restaurant bills from a photo

https://divvyai.app/
3•pieterdy•49m ago•0 comments

Hot Reloading in Rust? Subsecond and Dioxus to the Rescue

https://codethoughts.io/posts/2026-02-07-rust-hot-reloading/
4•Tehnix•49m ago•1 comments
Open in hackernews

Reproducing DeepSeek's MHC: When Residual Connections Explode

https://taylorkolasinski.com/notes/mhc-reproduction/
121•taykolasinski•3w ago

Comments

taykolasinski•3w ago
OP here. I spent the last few days reproducing the mHC architecture from the recent DeepSeek paper (2512.24880).

Two key takeaways from the reproduction:

Unconstrained Hyper-Connections really do explode (7x amplification even at 10M scale).

I hit a nasty "stream persistence" bug where my tensors were the right shape, but the architecture was functionally broken.

This is Part 1 (10M scale). Part 2 (scaling to 1B on A100s) is coming later this week. Happy to answer questions about the implementation.

WiSaGaN•3w ago
How do you know "GPT-5, Claude, Llama, Gemini. Under the hood, they all do the same thing: x+F(x)."?
taykolasinski•3w ago
I’m referring specifically to the fundamental residual connection backbone that defines the transformer architecture (x_{l+1} = x_l + F(x_l)).

While the sub-modules differ (MHA vs GQA, SwiGLU vs GeLU, Mixture-of-Depths, etc.), the core signal propagation in Llama, Gemini, and Claude relies on that additive residual stream.

My point here is that DeepSeek's mHC challenges that fundamental additive assumption by introducing learnable weighted scaling factors to the residual path itself.

WiSaGaN•3w ago
I guess I am asking how we know Gemini and Claude relies on the additive residual stream. We don't know the architecture details for these closed models?
taykolasinski•3w ago
That's a fair point. We don't have the weights or code for the closed models, so we can't be 100% certain.

However, transformer-based (which their technical reports confirm they are) implies the standard pre-norm/post-nnorm residual block structure. Without those additive residual connections, training networks of that depth (100+ layers) becomes difficult due to the vanishing gradient problem.

If they had solved deep signal propagation without residual streams, that would likely be a bigger architectural breakthrough than the model itself (akin to Mamba/SSMs). It’s a very high-confidence assumption, but you are right that it is still an assumption.

solarkraft•3w ago
I’ve been wondering for a while: Why isn’t this architecture more common in other LLMs? The context efficiency is amazing, after all - doesn’t that translate to a lot of money at scale?
kevmo314•3w ago
It's an incremental improvement, not really a revolutionary step.

That being said, I think one could adapt an existing model to add mHC by initializing the routing matrix to the regular residual connection and then post-train the hyper connection matrices. This would let you continue training more efficiently on existing models.

taykolasinski•3w ago
That initialization strategy (effectively starting as identity to match the standard residual stream) is clever. It would let you surgery an existing model like Llama-3 and fine-tune it into an mHC architecture.

The main risk I see is that the 7x signal amplification happens very aggressively. Even with a gentle initialization, you’d likely need very strict gradient clipping or a tiny learning rate on those new routing matrices to prevent them from blowing up the pre-trained features in the first few steps.

Also, I think there's a mix-up here between mHC (this paper, expressivity) and MLA (latent attention, which provides the massive context efficiency). mHC doesn't save memory, but it might make the model 'smarter' per parameter.

solarkraft•3w ago
You’re right, I totally mixed this up with MLA.
yorwba•3w ago
https://arxiv.org/abs/2512.24880 was published less than two weeks ago, which should explain why it's not more common yet. And it's not that amazing either. It's a slight quality improvement for a slight increase in cost. It's not even clear to me whether it pays for itself.
solarkraft•3w ago
My bad, I took this as something Multi-head Latent Attention (MLA) related.
graemefawcett•3w ago
I think the biggest benefit is bandwidth more so than efficiency. This gives you multiple streams to mux which and a means to control their mixing.

The biggest innovation I think may have been accidental. The doubly stochastic matrix implements conservation on the signal stream.

Treating the signal like the information it is as we do in any other domain is crucial for maintaining its coherence. We don't allow a network router to generate more packets than it receives for the same reason.

sbondaryev•3w ago
Nice visualization of the residual connections. Is the animated svg manually created or programmatically generated? What tools did you use?
taykolasinski•3w ago
Thanks! Manually created Astro components with inline SVG and CSS animations.
cpldcpu•3w ago
May be worth pointing out, that this is not the first residual connection innovation to be in production.

Gemma 3n is also using a low-rank projection of the residual stream called LAuReL. Google did not publicize this too much, I noted it when poking around in the model file.

https://arxiv.org/pdf/2411.07501v3

https://old.reddit.com/r/LocalLLaMA/comments/1kuy45r/gemma_3...

Seems to be what they call LAuReL-LR in the paper, with D=2048 and R=64

taykolasinski•3w ago
This is a fantastic catch. I hadn't realized Gemma 3n was already shipping with a variant of this in production.

It feels like we are entering the era of residual stream engineering. For a long time, the standard x + F(x) additive backbone was treated as untouchable. Now, between mHC (weighted scaling) and LAuReL (low-rank projections), labs are finally finding stable ways to make that signal path more dynamic.

I'm curious if the Low-Rank constraint in LAuReL acts as a natural stabilizer against the gradient explosion I saw with unconstrained hyper-connections.

Thanks for the paper link, definitely reading that tonight.

cpldcpu•3w ago
Thanks! Would be quite interesting to see how this fares compared to mHC.

I noted that LAuReL is cited in the mHC paper, but they refer to it as "expanding the width of the residual stream", which is rather odd.

Scene_Cast2•3w ago
I implemented this for a toy 8M ViT-style model. Got neutral results. This is just an anecdote and is not representative - I think mHC will help with larger parameter sizes and larger token counts.
taykolasinski•3w ago
That's interesting.

I suspect your intuition about scale is correct. The theoretical benefit of mHC is that it acts as a sort of relief valve/router for information flow in very deep/wide networks where the standard residual bottleneck becomes an issue. At 8M params, the standard residual stream is likely already perfectly adequate, so mHC might just be adding parameter overhead without solving a real signal propagation problem yet.

Quick question on your run: did you see the signal amplification/instability I saw (values growing during the forward pass)? or was it stable for you, just neutral on loss?

Scene_Cast2•3w ago
My baseline was non-HC "vanilla" residuals; I didn't do a meaningful HC run to compare.

My application has some particularities (important and easy to identify per-token signals) that result in values growing (about 3x to 10x) through layers even in the baseline.

astrange•3w ago
> Quick question on your run: did you see the signal amplification/instability I saw (values growing during the forward pass)? or was it stable for you, just neutral on loss?

I think your brain may have been taken over by ChatGPT.

theschwa•3w ago
Between the clear writing and the diagrams, this was a great write up. I had actually skipped reading up on mHC as it sounded like it was going to take some time to grok, but this made it immediately approachable. I hope you do more write ups like this in the future.
roywiggins•3w ago
imho the prose is very ChatGPT unfortunately
E-Reverance•3w ago
> Residual connections are more than a trick to help gradients flow. They’re a conservation law.

> Not a hack, not a trick. A principled constraint that makes the architecture work at scale.

DoctorOetker•3w ago
yes this reads like classic intellectual fellicitatio
jszymborski•3w ago
OK, I thought I was reading too much into it but those same sentences also jumped out for me
roywiggins•3w ago
pangram thinks the whole thing was LLM generated fwiw, as dodgy as AI detectors are it is probably among the best. I don't doubt the author started with their own text, but I think it's been substantially revised via ChatGPT
in-silico•3w ago
Why can't you just leave H_res as the identity matrix (or just not use it at all)? In that case, the model is basically a ResNet again and you don't need to worry about exploding/vanishing gradients from H_res.

I would think that H_post and H_pre could cover the lost expressiveness.

john-titor•3w ago
great write up. it's been a while since I had the pleasure to read a straightforward blog post about ML tricks that feel genuinely applicable to many use cases.
AlexCoventry•3w ago
What's the advantage of having multiple channels with separate residual connections? Why not just concatenate those channels, and do residual connections on the concatenated channel?