frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Attention Residuals

https://github.com/MoonshotAI/Attention-Residuals
46•GaggiX•2h ago

Comments

jszymborski•1h ago
This is reminds me of the input gates of an LSTM.
jjcm•1h ago
Two things stand out to me with this:

1. Drops compute required for training by ~20%. This approach wont just help the ever escalating model sizes larger companies are pushing for, it means things like autoresearch can iterate on new model architectures faster.

2. WAY lower bandwidth requirements for inference. Means with approaches like this it should run on consumer hardware far better. It apparently requires 1/6th the memory bandwidth of a traditional approach for better results.

This is a big improvement if it can be generalized. They're claiming it's a drop in replacement, so it seems like it can as well.

dvt•1h ago
> Drops compute required for training by ~20%.

This is not true. Authors claim that w.r.t. training, their method adds negigible overhead for AttnRes with no memory impact (but is way more complicated for Block AttnRes since we need to use pipelining for larger models).

> WAY lower bandwidth requirements for inference.

Also not true. Paper has nothing to do with inference, apart from the benchmarks. If you're looking at the graph about "compute advantage," it's about training compute. They do some interpolation to get to the 1.25x number, basically answering the question "if non-AttnRes architecture were trained, how much compute would it take to get to the same loss as AttnRes?" (The answer being ~20% more compute.) It's an interesting claim, but there's all kinds of weird and unexpected convergence that can happen, so take it with a grain of salt.

com2kid•1h ago
> 2. WAY lower bandwidth requirements for inference. Means with approaches like this it should run on consumer hardware far better. It apparently requires 1/6th the memory bandwidth of a traditional approach for better results.

That should be the headline right there. Giant side 60 font headline.

Some people have PhDs in burying the lede!

talloaktrees•27m ago
except it's not true
westurner•1h ago
ScholarlyArticle: "Attention Residuals" (2026) https://arxiv.org/abs/2603.15031 :

> Abstract: Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. [...]

jryio•40m ago
This is the key piece

> Full AttnRes is straightforward but requires O(Ld) memory at scale. Block AttnRes partitions layers into N blocks, accumulates within each block via standard residuals, and applies attention only over block-level representations. With ~8 blocks, it recovers most of Full AttnRes's gains while serving as a practical drop-in replacement with marginal overhead.

Murfalo•14m ago
Amazingly, the first author is a high school student! https://nathanchen.me/public/About%20me.html

France's aircraft carrier located in real time by Le Monde through fitness app

https://www.lemonde.fr/en/international/article/2026/03/20/stravaleaks-france-s-aircraft-carrier-...
358•MrDresden•7h ago•317 comments

Attention Residuals

https://github.com/MoonshotAI/Attention-Residuals
46•GaggiX•2h ago•8 comments

VisiCalc Reconstructed

https://zserge.com/posts/visicalc/
133•ingve•3d ago•54 comments

The Los Angeles Aqueduct Is Wild

https://practical.engineering/blog/2026/3/17/the-los-angeles-aqueduct-is-wild
234•michaefe•3d ago•128 comments

Parallel Perl – autoparallelizing interpreter with JIT

https://perl.petamem.com/gpw2026/perl-mit-ai-gpw2026.html#/4/1/1
77•bmn__•2d ago•30 comments

Delve – Fake Compliance as a Service

https://deepdelver.substack.com/p/delve-fake-compliance-as-a-service
377•freddykruger•1d ago•129 comments

Entso-E final report on Iberian 2025 blackout

https://www.entsoe.eu/publications/blackout/28-april-2025-iberian-blackout/
157•Rygian•9h ago•59 comments

Launch HN: Sitefire (YC W26) – Automating actions to improve AI visibility

24•vincko•3h ago•20 comments

The Social Smolnet

https://ploum.net/2026-03-20-social-smolnet.html
89•aebtebeten•7h ago•11 comments

The worst volume control UI in the world

https://uxdesign.cc/the-worst-volume-control-ui-in-the-world-60713dc86950
21•andsoitis•2d ago•12 comments

Super Micro Shares Plunge 25% After Co-Founder Charged in $2.5B Smuggling Plot

https://www.forbes.com/sites/tylerroush/2026/03/20/super-micro-shares-plunge-25-after-co-founder-...
250•pera•6h ago•116 comments

Video Encoding and Decoding with Vulkan Compute Shaders in FFmpeg

https://www.khronos.org/blog/video-encoding-and-decoding-with-vulkan-compute-shaders-in-ffmpeg
132•y1n0•3d ago•47 comments

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

https://arxiv.org/abs/2603.09229
151•matt_d•3d ago•12 comments

Regex Blaster

https://mdp.github.io/regex-blaster/
119•mdp•2d ago•48 comments

Just Put It on a Map

https://progressandpoverty.substack.com/p/just-put-it-on-a-map
119•surprisetalk•4d ago•61 comments

Java is fast, code might not be

https://jvogel.me/posts/2026/java-is-fast-your-code-might-not-be/
155•siegers•7h ago•157 comments

ArXiv declares independence from Cornell

https://www.science.org/content/article/arxiv-pioneering-preprint-server-declares-independence-co...
682•bookstore-romeo•16h ago•233 comments

HP trialed mandatory 15-minute support call wait times (2025)

https://arstechnica.com/gadgets/2025/02/misguided-hp-customer-support-approach-included-forced-15...
281•felineflock•7h ago•181 comments

90% of crypto's Illinois primary spending failed to achieve its objective

https://www.mollywhite.net/micro/entry/202603172318
111•speckx•4h ago•77 comments

Too Much Color

https://www.keithcirkel.co.uk/too-much-color/
88•maguay•2d ago•50 comments

FSF statement on copyright infringement lawsuit Bartz v. Anthropic

https://www.fsf.org/blogs/licensing/2026-anthropic-settlement
224•m463•4d ago•107 comments

Randomization in Controlled Experiments

https://queue.acm.org/detail.cfm?id=3778029
16•pykq•3d ago•2 comments

The Soul of a Pedicab Driver

https://www.sheldonbrown.com/pedicab.html
123•haritha-j•11h ago•35 comments

Chuck Norris has died

https://variety.com/2026/film/news/chuck-norris-dead-walker-texas-ranger-dies-1236694953/
641•mp3il•6h ago•385 comments

The bespoke software revolution? I'm not buying it

https://world.hey.com/jason/the-bespoke-software-revolution-i-m-not-buying-it-4bfad9ec
8•FireBy2024•30m ago•0 comments

Show HN: An open-source safety net for home hemodialysis

https://safehemo.com/
12•qweliantanner•3d ago•5 comments

Full Disclosure: A Third (and Fourth) Azure Sign-In Log Bypass Found

https://trustedsec.com/blog/full-disclosure-a-third-and-fourth-azure-sign-in-log-bypass-found
277•nyxgeek•19h ago•88 comments

Drugwars for the TI-82/83/83 Calculators (2011)

https://gist.github.com/mattmanning/1002653/b7a1e88479a10eaae3bd5298b8b2c86e16fb4404
259•robotnikman•20h ago•73 comments

Drawvg Filter for FFmpeg

https://ayosec.github.io/ffmpeg-drawvg/
163•nolta•3d ago•26 comments

MacBook M5 Pro and Qwen3.5 = Local AI Security System

https://www.sharpai.org/benchmark/
141•aegis_camera•4h ago•133 comments