On my 7800XT gaming GPU, using less than 3GB of VRAM for the buffer, I have built an architecture that can process a 10 million token context.
This is not a joke. You can run it in a Google Colab notebook, on a free T4, and prove it to yourself right now:
The Proteus Playground https://colab.research.google.com/github/Zen-Sherbert/Proteus-Attention/blob/main/TinyPlayground.ipynb
It runs flawlessly on both CUDA and ROCm. It works. With the proof-of-concept out of the way, here are the three core ideas that got me here.
1. DNA - Tokens have value.
My journey started with a simple idea: tokens mean something. They have value. So why don't we use it?
I built a system called DNA, where each attention "gate" learns a "taste" for certain tokens and pulls them in like gravity. The crazy part? On a raw, untrained model, I found that 334 out of 500 tokens were already being caught by this system. It's a natural, emergent behavior.
2. The Alpha Slider - "Why can't I just change my model?"
I hated that I couldn't just switch my model from dense, to sparse, to linear whenever I wanted. So, I built a custom Triton kernel to do exactly that.
The result is a single knob called alpha:
Dense, high-fidelity? alpha = 0.0.
Balanced sub-quadratic? alpha = 0.3.
Screaming-fast linear time? alpha = 1.0 and the attention mechanic goes brrrrrr.
3. Chunking & RoPE - "So I got rid of it."
My new systems got me far, but the VRAM bottleneck was still a headache. So I got rid of it.
The idea is simple: chunking. Break a massive context into small pieces, shunt them to system RAM, and use a tiny VRAM buffer for only the most important tokens.
DNA tells us what's important. As a Hail Mary, I added RoPE to preserve where it came from. This combination creates contextual teleportation. It allows the model to create a perfect "highlight reel" and reason over it as if critical facts, separated by thousands of pages, were sitting right next to each other. It's your own little wormhole across data space.
TL;DR: I built an extreme context system that costs less than Minecraft to run. Would love feedback, as I'm still exploring how far it can go.
Github: https://github.com/Zen-Sherbert/Proteus-Attention/tree/main
Zen_Sherbert•4h ago
This whole thing started with me trying to implement sparsity, and getting it totally wrong. The DNA idea came to me in the middle of the night during my shift as an asset protection officer. The rest of it was just fumbling from one discovery to the next, mostly ignoring the "right" way to do things.
I'm an 8-year veteran, a father of three, and I just finished my bachelor's. I am not an AI researcher. If I can build this, you can do something infinitely better.
Please, try the Colab. Break it. Play with it. I implore you to tell me how it breaks. I'm excited to see what the community thinks.