Show HN: RBF-Attention – Trading dot-products for Euclidean distance

https://www.pisoni.ai/posts/scaled-rbf-attention/

1•4rtemi5•2h ago

Comments

4rtemi5•2h ago

I recently asked myself what would happen if we replaced the standard dot-product in self-attention with a different distance metric, e.g. an rbf-kernel?

The standard dot-product attention has this quirk where a key vector can trick the softmax simply by having a massive magnitude. A random key that points in roughly the right direction but is huge will easily outscore a perfectly aligned but shorter key. Distance-based (RBF) attention could fix this. To get a high attention score, Q and K actually have to be close to each other in high-dimensional space. You can't cheat by just being large.

What started as a 10-minute PyTorch experiment turned out to to be a rabbit hole and a reminder on how deeply the dot-product is hardcoded into the entire ML stack. Changing one core operation triggered a massive domino effect. :D

Here is the chain of things that broke, and how I had to fix them just to get a model to train reasonably well:

Instant OOMs: If you naively compute pairwise Euclidean distances using torch.cdist (without the matmul-trick), it materializes the full N x N distance matrix in memory. You will instantly OOM on any decent context length. Luckily with a little high-school algebra, you can expand the squared distance formula and get -||Q||2 - ||K||2 + 2(Q · K). Since the softmax is shift-invariant, the query norm is just a constant to that specific query and we can throw it in the trash. You're left with 2(Q · K) - ||K||2. Now, it turns out that RBF attention is mathematically just standard dot-product attention with a built-in, squared-L2 penalty on the keys.

Custom kernel: Even with that math trick, PyTorch's native scaled dot-product attention (SDPA) doesn't let you arbitrarily subtract a key-norm penalty inside its fused loop. You can hack it by padding your tensors with dummy dimensions, but that's clunky and moves unnecessary memory, so I gave up and wrote a custom Triton kernel. It mirrors the tiling logic of FlashAttention but computes the squared L2 norms of the keys on the fly in SRAM, subtracting them right before the softmax and the thing only uses linear memory.

Attention Sinks: So it turns out, that sometimes Models actually need magnitude bullying to create Attention Sinks. They scale up useless tokens (like <BOS>) so queries have a place to dump their attention mass when they don't care about the context. But in distance math, a massive vector means infinite distance and therefore zero probability and to be a universal sink in Euclidean space, a key must sit exactly at the origin, so I had to resolve that with register tokens. I prepended learnable dummy-vectors to the sequence and initialized them to zero. Whenever a query doesn't find anything useful, it naturally falls back to the register-tokens, safely dumping its attention into the blank registers without corrupting actual tokens.

RoPE makes zero sense anymore: Modern models use RoPE, which explicitly rotates vectors. This is mathematically elegant for dot-products (relative angles), but applying rotations to vectors before measuring their absolute spatial Euclidean distance completely destroys the geometry and makes no sense... So I ripped out RoPE entirely and swapped it for SuSiE (Subspace Sinusoidal Embeddings). It just adds cached unrotated sinusoids directly to the vectors. Because it's additive, positional distance explicitly acts as a penalty in Euclidean space.

Did it actually work? Hmm, kind of... I trained a tiny causal model on the miniscule TinyStories-dataset. It converged slightly faster than a standard SDPA baseline. Potentially that had to do with the distance math and the pre-softmax logits capped at 0, preventing early gradient spikes, but who knows...?

Is it going to replace FlashAttention in big models anytime soon? Nope. GPUs and the whole ML-stack are super optimized for pure dot-products, and the industry solved magnitude bullying with QK-Norm instead. But it was a fun engineering exercise in breaking and rebuilding a part of the ML stack.

I went through all of it so you don't have to. Here is the code: https://github.com/4rtemi5/rbf_attention

"The problem is Sam Altman": OpenAI Insiders don't trust CEO

The Limits of Integration (2021)

Delivery Deadlines Are a Mistake

AI Agent Sandboxes Got Security Wrong

Show HN: Hazmat – I made unrestricted Claude Code safe on macOS

You not always need an autonomous team

Why Japan has such good railways

Impeaching Donald J. Trump for High Crimes and Misdemeanors [pdf]

NASA Artemis II Multimedia Resources

The Downfall and Enshittification of Microsoft in 2026

Stop Vibecoding [video]

Ask HN: What's everyone's API billing stack?

Show HN: Ace Influence – Turn your brand into stories people watch

Why NASA flight director Gene Kranz is the gold standard for incident commanders

US Senator Calls Chinese Cars a 'Cancer,' Vowing Stricter Ban

What History Can Teach Us About Sleep and Dreams

The Hard Problems Nobody Has Solved

Show HN: Managarr – A TUI and CLI for managing *ARR servers, built in Rust

Code Is an Afterthought

MikroORM 7: Unchained

Show HN: Bucket Delta – Compute differences between two S3-compatible buckets

Show HN: I built an AI that forgets things when people leave the room

Local-heatmap-tile-server v1

AI Agent Guardrails: Pre-LLM and Post-LLM Best Practices

Same LLM but different output: we built a CI specialist

Good Taste the Only Real Moat Left

I tried building a home lab using Illumos distros

Show HN: Seek – Context-aware terminal search TUI

Show HN: NeverDue – turn emails into calendar events, exports as ICS

"Amazing Refresh" - A Malicious Chrome Extension