frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

TimeCapsuleLLM: LLM trained only on data from 1800-1875

https://github.com/haykgrigo3/TimeCapsuleLLM
78•admp•52m ago•37 comments

LLVM: The Bad Parts

https://www.npopov.com/2026/01/11/LLVM-The-bad-parts.html
123•vitaut•2h ago•16 comments

Date is out, Temporal is in

https://piccalil.li/blog/date-is-out-and-temporal-is-in/
59•alexanderameye•1h ago•19 comments

Floppy disks turn out to be the greatest TV remote for kids

https://blog.smartere.dk/2026/01/floppy-disks-the-best-tv-remote-for-kids/
241•mchro•3h ago•148 comments

The struggle of resizing windows on macOS Tahoe

https://noheger.at/blog/2026/01/11/the-struggle-of-resizing-windows-on-macos-tahoe/
2312•happosai•20h ago•975 comments

Reproducing DeepSeek's MHC: When Residual Connections Explode

https://taylorkolasinski.com/notes/mhc-reproduction/
54•taykolasinski•2h ago•17 comments

2025 marked a record-breaking year for Apple services

https://www.apple.com/newsroom/2026/01/2025-marked-a-record-breaking-year-for-apple-services/
34•soheilpro•2h ago•35 comments

How problematic is resampling audio from 44.1 to 48 kHz?

https://kevinboone.me/sample48.html
13•brewmarche•3d ago•10 comments

Telegram recovery model allows permanent lockout after phishing

https://bugs.telegram.org/c/58477
3•saloed•13m ago•1 comments

Launch a Debugging Terminal into GitHub Actions

https://blog.gripdev.xyz/2026/01/10/actions-terminal-on-failure-for-debugging/
78•martinpeck•4h ago•23 comments

Lightpanda migrate DOM implementation to Zig

https://lightpanda.io/blog/posts/migrating-our-dom-to-zig
146•gearnode•7h ago•77 comments

Ai, Japanese chimpanzee who counted and painted dies at 49

https://www.bbc.com/news/articles/cj9r3zl2ywyo
110•reconnecting•7h ago•36 comments

CLI agents make self-hosting on a home server easier and fun

https://fulghum.io/self-hosting
688•websku•19h ago•457 comments

JRR Tolkien reads from The Hobbit for 30 Minutes (1952)

https://www.openculture.com/2026/01/j-r-r-tolkien-reads-from-the-hobbit-for-30-minutes-1952.html
228•bookofjoe•5d ago•84 comments

Personal thoughts/notes from working on Zootopia 2

https://blog.yiningkarlli.com/2025/12/zootopia-2.html
136•pantalaimon•5d ago•10 comments

The Manchester Garbage Collector and purple-garden's runtime

https://xnacly.me/posts/2026/manchester-garbage-collector/
11•xnacly•4d ago•0 comments

Apple picks Google's Gemini to power Siri

https://www.cnbc.com/2026/01/12/apple-google-ai-siri-gemini.html
139•stygiansonic•1h ago•94 comments

Windows 8 Desktop Environment for Linux

https://github.com/er-bharat/Win8DE
131•edent•3h ago•120 comments

39c3: In-house electronics manufacturing from scratch: How hard can it be? [video]

https://media.ccc.de/v/39c3-in-house-electronics-manufacturing-from-scratch-how-hard-can-it-be
208•fried-gluttony•3d ago•95 comments

Ireland fast tracks Bill to criminalise harmful voice or image misuse

https://www.irishtimes.com/ireland/2026/01/07/call-to-fast-track-bill-targeting-ai-deepfakes-and-...
71•mooreds•3h ago•50 comments

Zen-C: Write like a high-level language, run like C

https://github.com/z-libs/Zen-C
72•simonpure•3h ago•58 comments

iCloud Photos Downloader

https://github.com/icloud-photos-downloader/icloud_photos_downloader
576•reconnecting•21h ago•221 comments

This game is a single 13 KiB file that runs on Windows, Linux and in the Browser

https://iczelia.net/posts/snake-polyglot/
271•snoofydude•18h ago•68 comments

Keychron's Nape Pro turns your keyboard into a laptop‑style trackball rig

https://www.yankodesign.com/2026/01/08/keychrons-nape-pro-turns-your-mechanical-keyboard-into-a-l...
48•tortilla•2h ago•17 comments

XMPP and Metadata

https://blog.mathieui.net/xmpp-and-metadata.html
58•todsacerdoti•5d ago•16 comments

Ozempic reduced grocery spending by an average of 5.3% in the US

https://news.cornell.edu/stories/2025/12/ozempic-changing-foods-americans-buy
254•giuliomagnifico•4h ago•395 comments

Conbini Wars – Map of Japanese convenience store ratios

https://conbini.kikkia.dev/
112•zdw•5d ago•43 comments

Show HN: 30k IKEA items in flat text

https://huggingface.co/datasets/tsazan/ikea-us-commercetxt
48•tsazan•5d ago•33 comments

The next two years of software engineering

https://addyosmani.com/blog/next-two-years/
255•napolux•18h ago•272 comments

I'm making a game engine based on dynamic signed distance fields (SDFs) [video]

https://www.youtube.com/watch?v=il-TXbn5iMA
416•imagiro•4d ago•67 comments
Open in hackernews

Reproducing DeepSeek's MHC: When Residual Connections Explode

https://taylorkolasinski.com/notes/mhc-reproduction/
54•taykolasinski•2h ago

Comments

taykolasinski•2h ago
OP here. I spent the last few days reproducing the mHC architecture from the recent DeepSeek paper (2512.24880).

Two key takeaways from the reproduction:

Unconstrained Hyper-Connections really do explode (7x amplification even at 10M scale).

I hit a nasty "stream persistence" bug where my tensors were the right shape, but the architecture was functionally broken.

This is Part 1 (10M scale). Part 2 (scaling to 1B on A100s) is coming later this week. Happy to answer questions about the implementation.

WiSaGaN•2h ago
How do you know "GPT-5, Claude, Llama, Gemini. Under the hood, they all do the same thing: x+F(x)."?
taykolasinski•2h ago
I’m referring specifically to the fundamental residual connection backbone that defines the transformer architecture (x_{l+1} = x_l + F(x_l)).

While the sub-modules differ (MHA vs GQA, SwiGLU vs GeLU, Mixture-of-Depths, etc.), the core signal propagation in Llama, Gemini, and Claude relies on that additive residual stream.

My point here is that DeepSeek's mHC challenges that fundamental additive assumption by introducing learnable weighted scaling factors to the residual path itself.

WiSaGaN•2h ago
I guess I am asking how we know Gemini and Claude relies on the additive residual stream. We don't know the architecture details for these closed models?
taykolasinski•1h ago
That's a fair point. We don't have the weights or code for the closed models, so we can't be 100% certain.

However, transformer-based (which their technical reports confirm they are) implies the standard pre-norm/post-nnorm residual block structure. Without those additive residual connections, training networks of that depth (100+ layers) becomes difficult due to the vanishing gradient problem.

If they had solved deep signal propagation without residual streams, that would likely be a bigger architectural breakthrough than the model itself (akin to Mamba/SSMs). It’s a very high-confidence assumption, but you are right that it is still an assumption.

solarkraft•2h ago
I’ve been wondering for a while: Why isn’t this architecture more common in other LLMs? The context efficiency is amazing, after all - doesn’t that translate to a lot of money at scale?
kevmo314•2h ago
It's an incremental improvement, not really a revolutionary step.

That being said, I think one could adapt an existing model to add mHC by initializing the routing matrix to the regular residual connection and then post-train the hyper connection matrices. This would let you continue training more efficiently on existing models.

taykolasinski•1h ago
That initialization strategy (effectively starting as identity to match the standard residual stream) is clever. It would let you surgery an existing model like Llama-3 and fine-tune it into an mHC architecture.

The main risk I see is that the 7x signal amplification happens very aggressively. Even with a gentle initialization, you’d likely need very strict gradient clipping or a tiny learning rate on those new routing matrices to prevent them from blowing up the pre-trained features in the first few steps.

Also, I think there's a mix-up here between mHC (this paper, expressivity) and MLA (latent attention, which provides the massive context efficiency). mHC doesn't save memory, but it might make the model 'smarter' per parameter.

yorwba•1h ago
https://arxiv.org/abs/2512.24880 was published less than two weeks ago, which should explain why it's not more common yet. And it's not that amazing either. It's a slight quality improvement for a slight increase in cost. It's not even clear to me whether it pays for itself.
solarkraft•20m ago
My bad, I took this as something Multi-head Latent Attention (MLA) related.
sbondaryev•1h ago
Nice visualization of the residual connections. Is the animated svg manually created or programmatically generated? What tools did you use?
taykolasinski•1h ago
Thanks! Manually created Astro components with inline SVG and CSS animations.
cpldcpu•1h ago
May be worth pointing out, that this is not the first residual connection innovation to be in production.

Gemma 3n is also using a low-rank projection of the residual stream called LAuReL. Google did not publicize this too much, I noted it when poking around in the model file.

https://arxiv.org/pdf/2411.07501v3

https://old.reddit.com/r/LocalLLaMA/comments/1kuy45r/gemma_3...

Seems to be what they call LAuReL-LR in the paper, with D=2048 and R=64

taykolasinski•40m ago
This is a fantastic catch. I hadn't realized Gemma 3n was already shipping with a variant of this in production.

It feels like we are entering the era of residual stream engineering. For a long time, the standard x + F(x) additive backbone was treated as untouchable. Now, between mHC (weighted scaling) and LAuReL (low-rank projections), labs are finally finding stable ways to make that signal path more dynamic.

I'm curious if the Low-Rank constraint in LAuReL acts as a natural stabilizer against the gradient explosion I saw with unconstrained hyper-connections.

Thanks for the paper link, definitely reading that tonight.

Scene_Cast2•34m ago
I implemented this for a toy 8M ViT-style model. Got neutral results. This is just an anecdote and is not representative - I think mHC will help with larger parameter sizes and larger token counts.
taykolasinski•28m ago
That's interesting.

I suspect your intuition about scale is correct. The theoretical benefit of mHC is that it acts as a sort of relief valve/router for information flow in very deep/wide networks where the standard residual bottleneck becomes an issue. At 8M params, the standard residual stream is likely already perfectly adequate, so mHC might just be adding parameter overhead without solving a real signal propagation problem yet.

Quick question on your run: did you see the signal amplification/instability I saw (values growing during the forward pass)? or was it stable for you, just neutral on loss?