frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Do transformers need three projections? Systematic study of QKV variants

https://arxiv.org/abs/2606.04032
56•Anon84•1h ago

Comments

xiaoyu2006•56m ago
Will be great and amusing if it actually turns out that we have been doing transformer overly-complex. The code repo is missing tho...
ares623•53m ago
Gets the juices flowing though..
amluto•43m ago
Hint for authors: when discussing linear algebra (or really most other kinds of math), follow normal conventions. In this case, the convention would be that - (the minus sign) means subtraction. It does not mean "and also", especially when you sandwich it between two variables that represent matrices.

I read the paper with much head scratching all the way through sections 1 and 2 and part of 3 before I figured out that, no, really, the description "Q-K=V" does not mean "Q minus K equals V" (the head scratching was because a bunch of their descriptions and symmetry comments really make little sense if you think "Q minus K equals V"). If you want to say that "K equals V", please spell it "K=V" :)

I am curious whether it makes any sense at all to enforce a more general linear constraint on the query, key and value attention matrices along the line of Q-K=V.

It is an entertaining paper. I admit I'm surprised that K=V appears to work as well as it does -- it seems like it's almost enforcing a sort of model where the query is a guess as to what the value is and the attention head returns a (softmaxed) value that is closest to the query's guess. Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.

xiaoyu2006•36m ago
Yeah the weird notation confused me too. Their own Limitations also says their experiments are too small. I am quite curious how it will play out big now, but unironically I cannot afford the hardware lol.
amemi•3m ago
> Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.

In fact, on the second last page of the paper, they discuss this very problem. There seems to be a linear correlation with performance and sequence lengths for the Q-K=V model. While limited to a tight n=3 sample between 512, 1024, 2048 lengths, this does suggest that it is unlikely shorter sequences are the reason K=V performs acceptably.

in-silico•42m ago
These types of ablation studies are always good. However, I'm not sure how generalizable the language model findings here are.

Their 1.2B model was trained on only 10B tokens, which is less than half of the chinchilla compute optimal number. Modern overtrained 1B LLMs are trained on the order of 10T tokens (1000x more).

This is important because, from my own experience, simplifications and alternatives to standard attention can look fine in the under-trained regime but lag after over-training. This happens because attention has very little out-of-the-gate inductive bias, so it takes a lot of training for the expressiveness to really shine through.

I can't fault the authors since longer training runs cost money, but it warrants pointing out.

I'm also disappointed that they didn't report reasoning benchmark results for the Q=K-V case, since that is by far the most theoretically interesting case (in my eyes).

Show HN: ControllerTest-test gamepads,stick drift and polling rate by browser

https://controllertestonline.com/
1•zylics•3m ago•0 comments

QUIC (Quick UDP Internet Connections) Transport Layer Network Protocol

https://en.wikipedia.org/wiki/QUIC
1•Brysonbw•4m ago•0 comments

Only favors Indian-origin candidates: Chinese-American professor sues Texas univ

https://www.msn.com/en-in/news/world/only-favors-indian-origin-candidates-chinese-american-profes...
2•alecco•10m ago•1 comments

Lessons from building Claude Code: How we use skills

https://claude.com/blog/lessons-from-building-claude-code-how-we-use-skills
1•geoffbp•16m ago•0 comments

Jared Diamond's Collapse, Chaco Canyon, Why Agency Matters to Understand History

https://greysidewalk.substack.com/p/jared-diamonds-collapse-chaco-canyon
1•pseudolus•18m ago•0 comments

Disprove – a backtest auditor that rejected my own strategy

https://github.com/866y4tb8hc-coder/disprove
1•movadims•19m ago•0 comments

Alibaba/Open-Code-Review

https://github.com/alibaba/open-code-review
2•geoffbp•20m ago•0 comments

Bricks and Minifigs Parts Ways with Franchise Owners

https://bricksandminifigs.com/blog/blog/2026/06/04/bricks-and-minifigs-salem-joshua-johnson-brand...
1•cheschire•24m ago•1 comments

Build your own college curriculum with OpenLibrary

https://openlibrary.org/search/howto/more
1•d0able•25m ago•0 comments

Using AI to Ship a Real Product Without Losing the Plot

https://mckerlie.com/posts/building-calledup-using-ai-to-ship-a-real-product-without-losing-the-p...
1•silent1mezzo•25m ago•0 comments

Found: Milky Way black hole's missing wind

https://news.northwestern.edu/stories/2026/06/found-milky-way-black-holes-missing-wind
1•wglb•26m ago•1 comments

Congress rejects Khanna's attempt to stop deeper US/Israel military integration

https://twitter.com/DropSiteNews/status/2062613067859706230
1•sosomoxie•26m ago•0 comments

Local 'Little Red Dots' stay eerily steady for up to 15 years

https://sciencex.com/news/2026-06-local-red-dots-stay-eerily.html
1•wglb•28m ago•1 comments

S&P Global keeps fast index entry rules unchanged as SpaceX listing looms

https://www.reuters.com/business/finance/sp-global-keeps-fast-entry-proposal-unchanged-spacex-lis...
7•JumpCrisscross•29m ago•2 comments

Transformer Golf – The Unrolled Transformer

https://github.com/mingusb/transformer-golf
1•brianjmingus•33m ago•0 comments

Starcloud hits $1.1B valuation to build space-based data centers

https://www.geekwire.com/2026/orbital-ai-seattle-area-startup-starcloud-hits-1-1b-valuation-to-bu...
1•bko•33m ago•0 comments

New light-powered chip could accelerate AI and quantum computing

https://www.sciencedaily.com/releases/2026/06/260601025343.htm
1•pcael•34m ago•0 comments

Ask HN: Is Everyone an Engineer Now?

3•piratesAndSons•35m ago•2 comments

S&P will not change its rules to get SpaceX in early

https://www.axios.com/2026/06/04/musk-spacex-ipo-sp-investors
2•boguscoder•37m ago•0 comments

South Korean Forums Will Need to Scan Every Images with AI Censorship Tools

https://discuss.privacyguides.net/t/south-korean-online-communities-will-need-to-scan-every-image...
1•Cider9986•39m ago•1 comments

The impact of linguistic features on CTR in Instagram ads

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0338313
1•PaulHoule•39m ago•0 comments

Tilling the Garden: Use AI differently to make interesting and useful apps

https://mikecaulfield.substack.com/p/tilling-the-garden-a-different-way
1•lemming•40m ago•0 comments

Senior U.S. Officials Eye Government Shares in AI Giants

https://www.notus.org/technology/trump-ai-stake-openai
1•spenvo•42m ago•0 comments

What was your "Oh Shit" moment with GenAI?

1•andrehacker•42m ago•2 comments

China's solar majors charge into batteries as panel sales falter

https://www.reuters.com/business/energy/chinas-solar-majors-charge-into-batteries-panel-sales-fal...
1•JumpCrisscross•44m ago•0 comments

Flesh-eating screwworm confirmed in Texas calf as parasite crosses from Mexico

https://www.reuters.com/business/healthcare-pharmaceuticals/unconfirmed-us-case-flesh-eating-scre...
1•JumpCrisscross•45m ago•0 comments

SpaceX IPO Website

https://www.spacexipo.com
2•malshe•45m ago•0 comments

Microsoft to tighten human rights measures after inquiry into Israel deals

https://www.theguardian.com/technology/2026/jun/04/microsoft-to-tighten-human-rights-measures-aft...
1•lorecore•48m ago•0 comments

S&P Dow Jones Indices Consultation on Treatment of MegaCap Companies

https://press.spglobal.com/2026-06-04-S-P-Dow-Jones-Indices-Consultation-on-Treatment-of-MegaCap-...
2•pradn•49m ago•0 comments

ProlificTea

1•TheProfitKing•50m ago•0 comments