frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

nextTick but for React.js

https://suhaotian.github.io/use-next-tick/
1•jeremy_su•1m ago•0 comments

Show HN: I Built an AI-Powered Pull Request Review Tool

https://github.com/HighGarden-Studio/HighReview
1•highgarden•1m ago•0 comments

Git-am applies commit message diffs

https://lore.kernel.org/git/bcqvh7ahjjgzpgxwnr4kh3hfkksfruf54refyry3ha7qk7dldf@fij5calmscvm/
1•rkta•4m ago•0 comments

ClawEmail: 1min setup for OpenClaw agents with Gmail, Docs

https://clawemail.com
1•aleks5678•11m ago•1 comments

UnAutomating the Economy: More Labor but at What Cost?

https://www.greshm.org/blog/unautomating-the-economy/
1•Suncho•17m ago•1 comments

Show HN: Gettorr – Stream magnet links in the browser via WebRTC (no install)

https://gettorr.com/
1•BenaouidateMed•18m ago•0 comments

Statin drugs safer than previously thought

https://www.semafor.com/article/02/06/2026/statin-drugs-safer-than-previously-thought
1•stareatgoats•20m ago•0 comments

Handy when you just want to distract yourself for a moment

https://d6.h5go.life/
1•TrendSpotterPro•22m ago•0 comments

More States Are Taking Aim at a Controversial Early Reading Method

https://www.edweek.org/teaching-learning/more-states-are-taking-aim-at-a-controversial-early-read...
1•lelanthran•23m ago•0 comments

AI will not save developer productivity

https://www.infoworld.com/article/4125409/ai-will-not-save-developer-productivity.html
1•indentit•28m ago•0 comments

How I do and don't use agents

https://twitter.com/jessfraz/status/2019975917863661760
1•tosh•34m ago•0 comments

BTDUex Safe? The Back End Withdrawal Anomalies

1•aoijfoqfw•37m ago•0 comments

Show HN: Compile-Time Vibe Coding

https://github.com/Michael-JB/vibecode
5•michaelchicory•39m ago•1 comments

Show HN: Ensemble – macOS App to Manage Claude Code Skills, MCPs, and Claude.md

https://github.com/O0000-code/Ensemble
1•IO0oI•43m ago•1 comments

PR to support XMPP channels in OpenClaw

https://github.com/openclaw/openclaw/pull/9741
1•mickael•43m ago•0 comments

Twenty: A Modern Alternative to Salesforce

https://github.com/twentyhq/twenty
1•tosh•45m ago•0 comments

Raspberry Pi: More memory-driven price rises

https://www.raspberrypi.com/news/more-memory-driven-price-rises/
2•calcifer•50m ago•0 comments

Level Up Your Gaming

https://d4.h5go.life/
1•LinkLens•54m ago•1 comments

Di.day is a movement to encourage people to ditch Big Tech

https://itsfoss.com/news/di-day-celebration/
3•MilnerRoute•56m ago•0 comments

Show HN: AI generated personal affirmations playing when your phone is locked

https://MyAffirmations.Guru
4•alaserm•56m ago•3 comments

Show HN: GTM MCP Server- Let AI Manage Your Google Tag Manager Containers

https://github.com/paolobietolini/gtm-mcp-server
1•paolobietolini•58m ago•0 comments

Launch of X (Twitter) API Pay-per-Use Pricing

https://devcommunity.x.com/t/announcing-the-launch-of-x-api-pay-per-use-pricing/256476
1•thinkingemote•58m ago•0 comments

Facebook seemingly randomly bans tons of users

https://old.reddit.com/r/facebookdisabledme/
1•dirteater_•59m ago•1 comments

Global Bird Count Event

https://www.birdcount.org/
1•downboots•1h ago•0 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
2•soheilpro•1h ago•0 comments

Jon Stewart – One of My Favorite People – What Now? with Trevor Noah Podcast [video]

https://www.youtube.com/watch?v=44uC12g9ZVk
2•consumer451•1h ago•0 comments

P2P crypto exchange development company

1•sonniya•1h ago•0 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
2•jesperordrup•1h ago•0 comments

Write for Your Readers Even If They Are Agents

https://commonsware.com/blog/2026/02/06/write-for-your-readers-even-if-they-are-agents.html
1•ingve•1h ago•0 comments

Knowledge-Creating LLMs

https://tecunningham.github.io/posts/2026-01-29-knowledge-creating-llms.html
1•salkahfi•1h ago•0 comments
Open in hackernews

How attention sinks keep language models stable

https://hanlab.mit.edu/blog/streamingllm
219•pr337h4m•6mo ago

Comments

Havoc•6mo ago
> The first few tokens often carried minimal semantic information—sometimes just a start-of-sequence marker or common words like "the" or "a."

I wonder if it makes sense to use the first word as a title of sorts rather than going straight in grammatically correct sentence when prompting

optimalsolver•6mo ago
"Magnets. How do they work?"
gjm11•6mo ago
The heuristic doesn't work quite so well when applied to the actual original version of that line.
xg15•6mo ago
Some people start their prompts with "Hello" or "Please" or something similar, out of some habitual sense of politeness, I think. It would be hilarious if those prompts really work better because the model can use those words as attention sinks.
CamperBob2•6mo ago
One point that Karpathy has made in some of his videos is that using additional tokens in the prompt can facilitate computation. If you ask a transformer to do some basic math, it will be more likely to get the right answer (or at least a better approximation) with a more verbose prompt. To me, this backs up the use of more conversational language ("Please," etc.) when prompting.

However, that seems to be contradicted by what was shown recently with the successful International Math Olympiad effort. Their prompts, such as https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro... , were very terse. It's hard to tell where the prompt stops and the CoT response starts, in fact.

So there is probably some interplay between the need for attention sinks and the use of step-by-step reasoning. It might not be too surprising if the latter works because it's an indirect way to optimize the former.

xg15•6mo ago
I wonder if the model could also just make its own sink tokens if the prompt doesn't have any. E.g. if the model first emits some "fluff" like "The answer to this question is:" before starting with the actual answer, it could use those tokens as attention sinks. Same with "thinking tokens" that don't directly contribute to the answer or invisible formatting tokens, etc.
CamperBob2•6mo ago
True, along with "You're absolutely right! What an insightful observation. You're going places, bro," yadda yadda yadda.

It would be amusing if all that gratuitous sycophancy actually helped with inference accuracy. It would also be worth treating that as a bug to be fixed, of course.

blackbear_•6mo ago
Good thought, that indeed works: https://arxiv.org/abs/2310.02226
yorwba•6mo ago
> It's hard to tell where the prompt stops and the CoT response starts, in fact.

That's because you're looking at the final output that includes neither the prompt nor the intermediate chain of thought.

CamperBob2•6mo ago
Good point -- I can see that, but it all ends up in the same context, anyway. Point being, the model seems to prefer to conserve tokens.

That said, now I'm wondering if all those dashes it spews out are more than just window dressing.

Calavar•6mo ago
> Researchers had observed similar patterns in BERT, where "a surprisingly large amount of attention focuses on the delimiter token [SEP] and periods," which they argued was used by the model as a sort of no-op. The same summer at Meta, researchers studying vision transformers found similar behavior, observing that models would repurpose uninformative background patches as computational scratchpads.

This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].

[1] https://arxiv.org/abs/1712.02950

SpaceManNabs•6mo ago
I miss GANs. I understand that they are much harder to train than transformers for the same performance even with high data regime and high parameter regime, but there was such good optimization research and tricks that came out of them.

The work on the capacity of discriminators was super cool.

godelski•6mo ago

  > much harder to train than transformers
There's plenty of GANs that use transformers. PWC seems to be redirecting to GitHub currently but IIRC about half of top scores on FFHQ256 were GANs with transformers in them. I know that the number 2 was, I saw it at CVPR. It was a lot smaller and had higher throughput than the diffusion models it was outperforming.

Though the main reason diffusion took over was for the ability to encode more diversity. I still think there's a place for GANs and we overcorrected by putting too much focus on diffusion, but there are a lot of fundamental advantages to diffusion. Though they aren't strictly better, there's no global optima for solution spaces this large. I think the ML community (maybe CS in general) has a tendency to take an all or nothing approach. I don't think this is a really good strategy...

SpaceManNabs•6mo ago
thanks! got any links if you can spare the time? i think the info on gans using transformers might be enough. wasn't aware!
godelski•6mo ago
Sure. This was the paper[0]. Here's a few more you might find these interesting. Google's Transformer GAN[1] (not a transformer at all resolutions). Diffusion-GAN[2] is a hybrid architecture. Remember that technically the GAN process can use any underlying architecture. Arguably you could say some of the training steps in LLMs are GANs. And I think this one is also interesting in a similar respect[3]. Before PWC went down, StyleSAN[4] was the SOTA on FFHQ, but IIRC this doesn't change the architecture so it should probably work on all the other architectures too (comes with compute costs, but I think only training. It's been a bit since I read it)

[0] https://arxiv.org/abs/2211.05770

[1] https://arxiv.org/abs/2106.07631

[2] https://arxiv.org/abs/2206.02262

[3] https://arxiv.org/abs/2212.04473

[4] https://arxiv.org/abs/2301.12811

SpaceManNabs•5mo ago
Thank you for spending your time to answer my questions.
derbOac•6mo ago
> This seems to go beyond just transformers

And beyond that. Sometimes I feel like AI research is reinventing wheels that exist elsewhere. Maybe just the wheels, but still.

am17an•6mo ago
This is nice and useful because the new GPT-OSS model uses this technique. Kudos to the original authors!
diggan•6mo ago
And, as always, the FOSS ecosystem moves quickly, llama.cpp already fully support them! https://github.com/ggml-org/llama.cpp/pull/15157
esafak•6mo ago
> Barbero et al. have shown that attention sinks serve as "pressure valves" preventing what researchers call "over-mixing"—a pathological state where deep models processing long sequences blur important distinctions between tokens. The presence of a sink draws attention away from other tokens, limiting the spread of information (and noise) and resulting in more stable embeddings.

This sounds like it is working for the wrong reasons. Surely the right behavior is for the right neurons to receive attention rather than the first handful. Jamming everything there is the complementary sin of blurring. I would investigate attention equalization paired with a sparsity prior or something similar to prevent blurring.

yorwba•6mo ago
The point is that there's not always a right token to attend to. If the information you're looking for is not there, no clever attention scheme will find it. The best you can hope for when that happens is that the value returned in the "not found" case is distinguishable from the "found" case. Having an attention sink serve as a fixed "not found" value is one way to do this.
esafak•6mo ago
Good point. Does that make them mitigate hallucinations?
yorwba•6mo ago
In a sense? As the article notes, models trained using standard attention develop attention sinks naturally and removing them makes the model deteriorate completely, so the hallucinations you're thinking of were most likely output by a model that had already mitigated them in this way.
krackers•6mo ago
Maybe another analogy (or at least the way I intuitively understood it) was that humans sometimes skip over tokens we know to be fluff/filler. Without a sink, models have no way to "skip" over a token, that token _will_ attend to all previous tokens and be incorporated into the residual stream. It's easy to see that for filler tokens this will tend to hurt quality more than improve it, since you're more likely to pull in noise than if you could somehow "skip" that token entirely.
yorwba•6mo ago
Not quite. If some values are filler and some are not, and the corresponding keys are linearly separable, it's not difficult to find a query where standard attention gives low attention scores to filler and high attention scores to non-filler. Attention sinks deal with the problem when everything is filler, so there's no non-filler to allocate attention to.
canjobear•6mo ago
Seems like this was a better solution to the same problem https://www.evanmiller.org/attention-is-off-by-one.html
markisus•6mo ago
Did this end up working? It sounds plausible but it needs some empirical validation.
serialx•6mo ago
Yeah, attention sinks were applied to gpt-oss
Maxious•6mo ago
There was skepticism last time this was posted https://news.ycombinator.com/item?id=37740932

Implementation for gpt-oss this week showed 2-3x improvements https://github.com/ggml-org/llama.cpp/pull/15157 https://www.reddit.com/r/LocalLLaMA/comments/1mkowrw/llamacp...

microtonal•6mo ago
The attention sink as used in gpt-oss is similar to your link. But rather than adding one to the denominator, they add a trainable 'logit' (a different logit for each head).
scotty79•6mo ago
It would be super funny if it was sufficient.

This all reminds me of the bias term of a perceptron.

And with transformers we started with not having any and the network repurposed one of the inputs for that which annoyed people because now dropping this particular input makes the whole thing be unreasonably affected but also annoyed some other people because weight on that input was unreasonably high because it sort of balanced all others.

So initially people (from hanlab) tried to affix this input so it doesn't get dropped. Then they (from openai this time) decided to just skip the input by providing learnable bias inside of the network (doing the thing that classical perceptron does) and now this guy proposes further optimization just by setting bias to 1 everywhere, which might work perfectly fine since we don't really care about absolute values because ultimately we just pick largest one and don't care what it was. So in training all other weights just get scaled by the bias so it can be 1. It's little like doing physics calculations with speed of light set to 1.

If you have simple feed forward network of perceptrons where ultimately in the end you just pick the largest output and don't care about absolute values then maybe you'd also be fine with just setting all perceptron bias terms to 1 and excluding them from the learning.

Is bias learnable in biological neurons? Doesn't activation potential threshold (or whatever it's called) rely on some chemistry and isn't the same for all neurons?

Scene_Cast2•6mo ago
I found a fairly large improvement in my toy transformer model where I added a "global" token akin to the CLS token in ViT.

Another approach I've seen is the "Diff transformer" from MS Research (https://github.com/microsoft/unilm/tree/master/Diff-Transfor...).

innerlee•6mo ago
The singular defects (or high-norm tokens) [1] may be related to attention sinks. It is interesting that the direction of all high-norm tokens share the same direction. Maybe the theory behind is not very complex and the issue can be fixed cleverly during training.

[1] https://openreview.net/pdf?id=4yBnUokU2v

sanj•6mo ago
Is there a way to hint in the prompting what information should be retained in the attention sinks?
smaddox•6mo ago
> though we did not delve into the observation

Oh, the irony.

flimflamm•6mo ago
Hah they kind of found "NULL" pointer in LLMs.