frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Do transformers need three projections? Systematic study of QKV variants

https://arxiv.org/abs/2606.04032
58•Anon84•1h ago

Comments

xiaoyu2006•1h ago
Will be great and amusing if it actually turns out that we have been doing transformer overly-complex. The code repo is missing tho...
ares623•1h ago
Gets the juices flowing though..
amluto•59m ago
Hint for authors: when discussing linear algebra (or really most other kinds of math), follow normal conventions. In this case, the convention would be that - (the minus sign) means subtraction. It does not mean "and also", especially when you sandwich it between two variables that represent matrices.

I read the paper with much head scratching all the way through sections 1 and 2 and part of 3 before I figured out that, no, really, the description "Q-K=V" does not mean "Q minus K equals V" (the head scratching was because a bunch of their descriptions and symmetry comments really make little sense if you think "Q minus K equals V"). If you want to say that "K equals V", please spell it "K=V" :)

I am curious whether it makes any sense at all to enforce a more general linear constraint on the query, key and value attention matrices along the line of Q-K=V.

It is an entertaining paper. I admit I'm surprised that K=V appears to work as well as it does -- it seems like it's almost enforcing a sort of model where the query is a guess as to what the value is and the attention head returns a (softmaxed) value that is closest to the query's guess. Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.

xiaoyu2006•52m ago
Yeah the weird notation confused me too. Their own Limitations also says their experiments are too small. I am quite curious how it will play out big now, but unironically I cannot afford the hardware lol.
amemi•19m ago
> Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.

In fact, on the second last page of the paper, they discuss this very problem. There is a clear correlation between performance and increasing sequence lengths for the Q-K=V model. While limited to a tight n=3 sample between 512, 1024, 2048 lengths, the degradation decreases from 5.4% to 2.2% as context is increased, suggesting that it is unlikely shorter sequences are the reason K=V performs acceptably.

in-silico•58m ago
These types of ablation studies are always good. However, I'm not sure how generalizable the language model findings here are.

Their 1.2B model was trained on only 10B tokens, which is less than half of the chinchilla compute optimal number. Modern overtrained 1B LLMs are trained on the order of 10T tokens (1000x more).

This is important because, from my own experience, simplifications and alternatives to standard attention can look fine in the under-trained regime but lag after over-training. This happens because attention has very little out-of-the-gate inductive bias, so it takes a lot of training for the expressiveness to really shine through.

I can't fault the authors since longer training runs cost money, but it warrants pointing out.

I'm also disappointed that they didn't report reasoning benchmark results for the Q=K-V case, since that is by far the most theoretically interesting case (in my eyes).

jephs•14m ago
I'm terribly sorry, but scaling curves or GTFO. Any random pile of linear algebra works fine-ish at small scales. Very few random piles of linear algebra push the Pareto envelope at large scales.
Lerc•8m ago
I can see why the QKV gets used but I can't help but think that thete's got to be a better mechanism with turning a pair of vectors into a new vector and a significance field.

Geometrically I imagine the process of attention like picking up a bunch of vectots and spinning and squishing them in many-D until you can find a crack where you can see all the way through, then leveraging that crack to seperate what you want.

I doubt that's strictly accurate, but it might be close enough that it makes me think that if you were doing that with a bunch of bananas, it would be much easier to find the way through if you could also bend the bunch so they were all straight.

It's always the trade off of a smart complex operation against an absolute crapload of dumb ones.

brianjmingus•2m ago
See also:

Transformer Golf – The Unrolled Transformer https://github.com/mingusb/transformer-golf

Do transformers need three projections? Systematic study of QKV variants

https://arxiv.org/abs/2606.04032
58•Anon84•1h ago•9 comments

Anthropic's open-source framework for AI-powered vulnerability discovery

https://github.com/anthropics/defending-code-reference-harness
236•binyu•4h ago•78 comments

VoidZero Is Joining Cloudflare

https://blog.cloudflare.com/voidzero-joins-cloudflare/
557•coloneltcb•11h ago•250 comments

When AI Builds Itself: Our progress toward recursive self-improvement

https://www.anthropic.com/institute/recursive-self-improvement
302•meetpateltech•8h ago•400 comments

Branchless Quicksort faster than std:sort and pdqsort with C and C++ API

https://tiki.li/blog/blqsort
69•birdculture•2d ago•7 comments

Queen bees emerge from special wax chambers

https://cen.acs.org/materials/biobased-materials/queen-bees-special-wax/104/web/2026/06
32•gmays•3h ago•2 comments

IPv6 zones in URLs are a mistake

https://xeiaso.net/notes/2026/ipv6-zones-go-url/
72•xena•2h ago•60 comments

Ian's Secure Shoelace Knot

https://www.fieggen.com/shoelace/secureknot.htm
494•mooreds•13h ago•188 comments

I'm skeptical about efforts to revolutionize schooling

https://www.scotthyoung.com/blog/2026/05/27/revolutionize-schooling/
36•andrewstuart•2d ago•51 comments

Retro-Tech Parenting

https://havenweb.org/2026/05/28/retro-tech.html
235•mawise•8h ago•161 comments

KVarN: Native vLLM backend for KV-cache quantization by Huawei

https://github.com/huawei-csl/KVarN
112•theanonymousone•9h ago•11 comments

Castor: CERN Advanced STORage Manager

https://castor.web.cern.ch/content/home.html
41•naves•4h ago•18 comments

External Clock Generation on RTX 50 Series

https://www.xtremesystems.us/post/external-clock-generation-on-rtx-50-series
8•mfro•1d ago•2 comments

Samurai City

https://worksinprogress.co/issue/samurai-city/
92•zdw•2d ago•14 comments

WSL 2 is getting faster Windows file system access

https://www.boxofcables.dev/wsl2-per-device-swiotlb-pools-for-virtiofs-and-virtioproxy/
38•haydenbarnes•5h ago•26 comments

JLink JTAG Access on the Pinecil

https://danielmangum.com/posts/jlink-jtag-pinecil/
35•hasheddan•2d ago•6 comments

Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate

https://arxiv.org/abs/2604.24881
7•PaulHoule•1h ago•0 comments

Making Debian or Fedora persistent live images

https://sigwait.org/~alex/blog/2026/05/28/smdBC8.html
58•henry_flower•3d ago•6 comments

Meta's ships facial recognition on smart glasses

https://www.buchodi.com/meta-glasses-facial-recognition/
212•buchodi•5h ago•192 comments

Show HN: FFmpeg WebCLI – Full FFmpeg in Browser, Offline PWA, No Uploads(WASM)

https://github.com/tejaswigowda/ffmpeg-webCLI
71•tejaswigowda•4h ago•23 comments

Zettascale (YC S24) Is Hiring Founding FPGA Engineers

https://www.ycombinator.com/companies/zettascale/jobs/O9S1vqO-founding-engineer-fpga-rtl-asic-arc...
1•el_al•7h ago

Show HN: Mercek – A Desktop IDE for AWS ECS

https://www.mercek.dev/
15•utibeumanah•3h ago•4 comments

Sum-product, unit distances, and number fields

https://www.erdosproblems.com/forum/thread/blog:6
53•robinhouston•3d ago•13 comments

Mornings and nights no longer exist at 47C: A day in the hottest place in India

https://www.bbc.co.uk/news/articles/crmp0krp98ro
106•mellosouls•2d ago•74 comments

Show HN: Uruky (EU-based Kagi alternative) now has Image Search and URL Rewrites

https://uruky.com/?il=en
205•BrunoBernardino•15h ago•193 comments

Gaussian Point Splatting

https://momentsingraphics.de/Siggraph2026.html
174•ibobev•13h ago•65 comments

Show HN: Hitoku Draft – Context aware local assistant

https://hitoku.me/draft/
9•lostathome•6h ago•1 comments

AI, Ashby Engineering, and the future

https://www.ashbyhq.com/blog/engineering/ai-ashby-engineering-and-the-future
51•fredley•9h ago•33 comments

U.S. Army Corps of Engineers Bay Model

https://en.wikipedia.org/wiki/U.S._Army_Corps_of_Engineers_Bay_Model
206•tosh•2d ago•53 comments

3D-printed book turns its own G-code into raised lettering

https://www.designboom.com/design/3d-printed-book-manual-darius-ou-benson-chong/
75•surprisetalk•2d ago•27 comments