frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Hybrid Attention

27•JohannaAlmeida•2h ago
TLDR: Forked pytorch and triton internals . Changed attention so its linear first layer , middle quadratic layer, last linear layer Inference got much faster with a low perplexity hit in tests .

Full attention O(n²): 17.96s / 5.6 tok/s

HybridAttention O(n·W + n·D): 0.35s / 286.6 tok/s

I have been building a small Rust focused language model from scratch in PyTorch. This is not a finetune. It is byte level, trained from random initialization on a Rust heavy corpus assembled here: https://codeberg.org/JohannaJuntos/Sisyphus

Model and training setup

The model has 25.6M parameters with a 512 context length. It uses a byte level vocabulary of 256, with 8 layers, 8 heads, and 512 dimensional embeddings. Positional embeddings are learned and the embedding and LM head weights are tied.

Training ran for 30k steps on a 173.5M byte Rust corpus using a single RTX 4060 Ti 8GB.

Final metrics were a train loss of 0.5834, validation loss of 0.8217, and perplexity of 2.15. The best validation loss occurred around step 18.5k, which suggests some late overfitting or plateau.

Architecture

The model is a GPT style decoder, but replaces standard full attention with a HybridAttention block in each layer. This combines local windowed causal attention with a GRU like recurrent state path, along with a learned gate that mixes the two.

The local path handles short range syntax, while the recurrent path carries compressed long range state. The gate bias is initialized to favor local attention early in training.

Inference uses Triton kernels and custom torch.library ops.

Corpus

The biggest gain came from corpus expansion.

The run started with about 31MB from Rust official sources and major projects such as rustc, cargo, rust analyzer, tokio, serde, ripgrep, clap, and axum. The corpus was expanded to 173.5M bytes by cloning the top 500 crates, with 461 successful clones.

This expansion had more impact than any architectural change.

Inference performance

Full attention runs at about 5.6 tokens per second, while HybridAttention with KV cache reaches 286.6 tokens per second. This is about a 51x speedup with no visible quality loss.

The KV cache uses a hot window of 64 tokens in VRAM, while older tokens are compressed to 8 bit magnitude and angle and can be selectively promoted back to full precision. This changes the effective complexity from quadratic to near linear for this setup.

Quality

Surface Rust syntax looks decent, and imports and function signatures are often plausible. Semantics are still weak, and repetition and recursive patterns are common. It looks like Rust, but does not reason well yet.

What seems interesting

This project combines byte level Rust only pretraining from scratch, a hybrid local attention and recurrent architecture, large scale corpus expansion across the Rust ecosystem, and a practical KV cache paging strategy that delivers large speedups on consumer GPUs.

Next steps

I plan to run ablations comparing hybrid attention against local only and recurrent only variants, evaluate checkpoints around 18.5k versus the final model, and add syntax level validation such as parsing and compiling generated code. I also want to explore scaling context length from 256 up to 2048 and test whether switching from byte level to BPE becomes worthwhile now that the corpus is larger.

Questions

For small code models, which evaluations have been most useful beyond perplexity?

Has anyone seen hybrid local plus recurrent attention work well for code generation?

Given this setup, would you prioritize more tokens, longer context, or clean ablations first?

Comments

JohannaAlmeida•2h ago
Full attention O(n²): 17.96s / 5.6 tok/s

HybridAttention O(n·W + n·D): 0.35s / 286.6 tok/s

empath75•1h ago
Is this for just like auto complete, because you are not going to get anything very useful out of a code-only training set.
JohannaAlmeida•1h ago
Yeah auto complete is an amazing use case. I needed a small model that used transformers , could fit on my weak consumer GPU .

So i needed to make fundamental arquitecture changes .Do some KV cache tricks.

And then prove the new arquitecture was faster with benchmarks and perplexity was acceptable.

altruios•40m ago
I think it's more a proof of concept: locally trained. It would take lots of resources/time to train something non-trivial.
woodson•55m ago
Look into RWKV.
JohannaAlmeida•42m ago
Yeah RWKV is definitely related in spirit (recurrent state for long context). Here I’m combining local windowed attention with a gated recurrent path + KV cache compression, so it’s more hybrid than fully replacing attention

Show HN: Brutalist Concrete Laptop Stand (2024)

https://sam-burns.com/posts/concrete-laptop-stand/
336•sam-bee•4h ago•135 comments

Claude Code is locking people out for hours

https://github.com/anthropics/claude-code/issues/44257
87•sh1mmer•42m ago•73 comments

We found an undocumented bug in the Apollo 11 guidance computer code

https://www.juxt.pro/blog/a-bug-on-the-dark-side-of-the-moon/
260•henrygarner•5h ago•134 comments

Cloudflare targets 2029 for full post-quantum security

https://blog.cloudflare.com/post-quantum-roadmap/
26•ilreb•1h ago•7 comments

Dropping Cloudflare for Bunny.net

https://jola.dev/posts/dropping-cloudflare
197•shintoist•2h ago•91 comments

Show HN: A cartographer's attempt to realistically map Tolkien's world

https://www.intofarlands.com/atlasofarda
85•intofarlands•3h ago•16 comments

You can't cancel a JavaScript promise (except sometimes you can)

https://www.inngest.com/blog/hanging-promises-for-control-flow
34•goodoldneon•2h ago•19 comments

Every GPU That Mattered

https://sheets.works/data-viz/every-gpu
226•jonbaer•7h ago•122 comments

Identify a London Underground Line just by listening to it

https://tubesoundquiz.com/
124•nelson687•5h ago•36 comments

9 Mothers (YC P26) Is Hiring – Lead Robotics and More

https://jobs.ashbyhq.com/9-mothers?utm_source=x8pZ4B3P3Q
1•ukd1•2h ago

Moving fast in hardware: lessons from lab to $100M ARR

https://blog.zacka.io/p/simplify-then-add-lightness-bc4
6•rryan•39m ago•1 comments

SQLite in Production: Lessons from Running a Store on a Single File

https://ultrathink.art/blog/sqlite-in-production-lessons
69•thunderbong•3d ago•45 comments

Wi-Fi That Can Withstand a Nuclear Reactor: This receiver chip can take it

https://spectrum.ieee.org/robotics-in-nuclear-industry
49•voxadam•4d ago•2 comments

My Experience as a Rice Farmer

https://xd009642.github.io/2026/04/01/My-Experience-as-a-Rice-Farmer.html
279•surprisetalk•5d ago•128 comments

Show HN: Stop paying for Dropbox/Google Drive, use your own S3 bucket instead

https://locker.dev
169•Zm44•4h ago•138 comments

Blackholing My Email

https://www.johnsto.co.uk/blog/blackholing-my-email/
119•semyonsh•7h ago•11 comments

DeiMOS – A Superoptimizer for the MOS 6502

https://aransentin.github.io/deimos/
47•Aransentin•4h ago•15 comments

Haunting Photos Show the Aftermath of the Kursk Submarine Disaster in 2000

https://rarehistoricalphotos.com/kursk-submarine-disaster-photos/
86•mooreds•4d ago•21 comments

Global Physics Photowalk: 2025 winners revealed

https://www.quantamagazine.org/global-physics-photowalk-2025-winners-revealed-20260401/
5•ibobev•3d ago•0 comments

Running Out of Disk Space in Production

https://alt-romes.github.io/posts/2026-04-01-running-out-of-disk-space-on-launch.html
102•romes•4d ago•45 comments

AI may be making us think and write more alike

https://dornsife.usc.edu/news/stories/ai-may-be-making-us-think-and-write-more-alike/
159•giuliomagnifico•4h ago•156 comments

Show HN: Pion/handoff – Move WebRTC out of browser and into Go

https://github.com/pion/handoff
59•Sean-Der•3h ago•11 comments

Breaking the console: a brief history of video game security

https://sergioprado.blog/breaking-the-console-a-brief-history-of-video-game-security/
61•sprado•5h ago•16 comments

Floating point from scratch: Hard Mode

https://essenceia.github.io/projects/floating_dragon/
68•random__duck•2d ago•16 comments

Sam Altman may control our future – can he be trusted?

https://www.newyorker.com/magazine/2026/04/13/sam-altman-may-control-our-future-can-he-be-trusted
1760•adrianhon•1d ago•721 comments

Record wind and solar saved UK from gas imports worth £1B in March 2026

https://www.carbonbrief.org/analysis-record-wind-and-solar-saved-uk-from-gas-imports-worth-1bn-in...
81•mindracer•3h ago•37 comments

"The new Copilot app for Windows 11 is really just Microsoft Edge"

https://twitter.com/TheBobPony/status/2041112541909205001
81•bundie•3h ago•48 comments

Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS

https://github.com/matthartman/ghost-pepper
429•MattHart88•19h ago•189 comments

Three hundred synths, 3 hardware projects, and one app

https://midi.guide/blog/three-hunded-synths-one-app/
99•ductionist•10h ago•13 comments

Second Revision of 6502 Laptop

https://codeberg.org/TechPaula/LT6502b
96•uticus•4d ago•17 comments