frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

https://github.com/localgpt-app/localgpt
197•yi_wang•7h ago•78 comments

Haskell for all: Beyond agentic coding

https://haskellforall.com/2026/02/beyond-agentic-coding
93•RebelPotato•6h ago•24 comments

Roger Ebert Reviews "The Shawshank Redemption"

https://www.rogerebert.com/reviews/great-movie-the-shawshank-redemption-1994
17•monero-xmr•3h ago•4 comments

SectorC: A C Compiler in 512 bytes (2023)

https://xorvoid.com/sectorc.html
284•valyala•15h ago•55 comments

Software factories and the agentic moment

https://factory.strongdm.ai/
223•mellosouls•17h ago•378 comments

LLMs as the new high level language

https://federicopereiro.com/llm-high/
95•swah•4d ago•175 comments

The Architecture of Open Source Applications (Volume 1) Berkeley DB

https://aosabook.org/en/v1/bdb.html
22•grep_it•5d ago•2 comments

Speed up responses with fast mode

https://code.claude.com/docs/en/fast-mode
180•surprisetalk•14h ago•181 comments

LineageOS 23.2

https://lineageos.org/Changelog-31/
33•pentagrama•3h ago•7 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
189•AlexeyBrin•20h ago•36 comments

Stories from 25 Years of Software Development

https://susam.net/twenty-five-years-of-computing.html
191•vinhnx•18h ago•19 comments

Brookhaven Lab's RHIC concludes 25-year run with final collisions

https://www.hpcwire.com/off-the-wire/brookhaven-labs-rhic-concludes-25-year-run-with-final-collis...
79•gnufx•13h ago•62 comments

uLauncher

https://github.com/jrpie/launcher
19•dtj1123•4d ago•0 comments

Substack confirms data breach affects users’ email addresses and phone numbers

https://techcrunch.com/2026/02/05/substack-confirms-data-breach-affecting-email-addresses-and-pho...
49•witnessme•4h ago•14 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
352•jesperordrup•1d ago•104 comments

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

https://solar.lowtechmagazine.com/2010/01/wood-gas-vehicles-firewood-in-the-fuel-tank/
45•Rygian•2d ago•16 comments

Moroccan sardine prices to stabilise via new measures: officials

https://maghrebi.org/2026/01/27/moroccan-sardine-prices-to-stabilise-via-new-measures-officials/
3•mooreds•5d ago•0 comments

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

https://github.com/Momciloo/fun-with-clip-path
97•momciloo•15h ago•23 comments

First Proof

https://arxiv.org/abs/2602.05192
143•samasblack•17h ago•87 comments

Start all of your commands with a comma (2009)

https://rhodesmill.org/brandon/2009/commands-with-comma/
600•theblazehen•3d ago•218 comments

Al Lowe on model trains, funny deaths and working with Disney

https://spillhistorie.no/2026/02/06/interview-with-sierra-veteran-al-lowe/
112•thelok•16h ago•24 comments

The Scriptovision Super Micro Script video titler is almost a home computer

http://oldvcr.blogspot.com/2026/02/the-scriptovision-super-micro-script.html
10•todsacerdoti•6h ago•1 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
335•1vuio0pswjnm7•21h ago•542 comments

Show HN: A luma dependent chroma compression algorithm (image compression)

https://www.bitsnbites.eu/a-spatial-domain-variable-block-size-luma-dependent-chroma-compression-...
43•mbitsnbites•3d ago•6 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
916•klaussilveira•1d ago•277 comments

FDA intends to take action against non-FDA-approved GLP-1 drugs

https://www.fda.gov/news-events/press-announcements/fda-intends-take-action-against-non-fda-appro...
123•randycupertino•10h ago•250 comments

Selection rather than prediction

https://voratiq.com/blog/selection-rather-than-prediction/
38•languid-photic•4d ago•20 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
173•speckx•4d ago•258 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
307•isitcontent•1d ago•39 comments

Vouch

https://twitter.com/mitchellh/status/2020252149117313349
98•chwtutha•5h ago•26 comments
Open in hackernews

TransMLA: Multi-head latent attention is all you need

https://arxiv.org/abs/2502.07864
123•ocean_moist•9mo ago

Comments

olq_plo•9mo ago
Very cool idea. Can't wait for converted models on HF.
MichaelMoser123•9mo ago
deepseek-v2,v3,r1 are all using multi-headed attention.
kavalg•9mo ago
My (possibly wrong) TLDR: TransMLA is a method to "compress" an already trained GQA model, with the additional option to further fine tune it. Shall make inference faster.
freeqaz•9mo ago
Also makes models smarter ("expressive")
yorwba•9mo ago
It is not a method to compress a Grouped-Query Attention model, but to expand it into an equivalent Multi-head Latent Attention model with the same key-value cache size but larger effective key/value vectors and a correspondingly larger number of trainable parameters. With additional training, you can then obtain a better model that only uses a little bit more memory.
kavalg•9mo ago
Thanks for the clarification.
wiz21c•9mo ago
Not quite related, but do the mamba models gain ground ?

Answering my own question: https://www.reddit.com/r/MachineLearning/comments/1hpg91o/d_...

EGreg•9mo ago
All you need to stop posting titles like that !
jbellis•9mo ago
[abstract] This approach significantly reduces the KV cache size relative to traditional multi-head attention

[3.3] For saving the KV cache, only the intermediate latent representations need to be stored: [latex] where r is much smaller than nh · dh [n-sub-h, d-sub-h]

[background] In traditional multi-head attention you must cache full key and value matrices of size T x (nh · dh) where T is the token length, nh is the number of attention heads, dh is the dimensionality of each individual head

sounds like a big win for memory constrained environments like local inference

magicalhippo•9mo ago
I'm just following the field from the sidelines, but this looks interesting to me. Especially the increase in expressiveness that the new model allows for over GQA, at the cost of just ~10% more memory, and the fact that you can convert existing GQA models like LLaMA, Qwen etc with just a bit of fine-tuning.

Perhaps a trivial insight but I feel a lot of progress often comes in the form of generalizations, where existing approaches can be seen as special cases. Here the authors show that Group Query Attention (GQA) and Multi-Query Attention (MQA) falls out as special cases of their new model.

edit:

Adding my own summary, as I understand it.

The key to what they're doing, no pun intended, is to rely on the fact that large, high-dimensional, matrices may contain a lot of redundant information. Thus one may be able to find an good approximation which has less redundant information, by going through an intermediary stage which has fewer dimensions.

A n-by-m matrix M takes n-dimensional vectors and transforms them to m-dimensional vectors. The trick here is to replace matrix A by two matrices, L and R, which are n-by-r and r-by-m respectively, where r is smaller than n and m. This is called a low-rank approximation.

In a sense you're "straining the matrix", by forcing the information to pass through an intermediary, low-dimensional vector.

The memory savings come from the fact that matrix A has n*m entries, while L and R have n*r and r*m entries respectively. Say n = m = 100 and r = 20, that means A has 100*100 = 10k entries, while L and R have just 100*20 + 20*100 = 4k entries in total.

The trick itself is not new, for example it is also used in LoRA where an additional low-rank approximation matrix is used to tweak the output of an existing model. The low rank means there's far fewer the matrix entries, aka parameters, to train than if one had used a regular fully dense matrix.

The extra expressiveness of MLA comes from the fact that in GQA, in order to save memory, some of the matrices are actually built by gluing copies of a narrower matrix together. This means the information in the glued-up matrices are very redundant and fixed in a certain way, and thus are restricted in how they can transform the inputs.

By using the low-rank approximation instead, the information in the full, reconstructed matrices are not fixed in the same way compared to the glued-up result. Thus the inputs can be transformed in a less restrictive way, leading to the increase in expressiveness.

The GQA method saves a bit more memory compared to MLA as the narrower matrices are even smaller than the low-rank matrices in MLA, but at the cost of expressiveness.

killerstorm•9mo ago
Another paper related to attention distillation, although doing something far more radical: transformer attention is distilled onto RWKV-like model: https://huggingface.co/papers/2505.03005
karmakaze•9mo ago
I'm not "in the field" though I like to read about and use LLMs. This video "How DeepSeek Rewrote the Transformer [MLA]"[0] is really good at explaining MHA, MQA, GQA, and MLA with clear visuals/animations and how DeepSeek MLA is 57x more efficient.

[0] https://www.youtube.com/watch?v=0VLAoVGf_74&t=960s