Author here. The core claim: RWKV-7 (2.9B params, RNN) scores 72.8% avg across
standard benchmarks vs LLaMA 3.2's 69.7% — trained on 3.1T tokens vs ~9T.
Same parameter count, one-third the data.
The more interesting result is architectural: RWKV-7 formally exceeds TC⁰,
the complexity class bounding standard Transformers (Merrill & Sabharwal's
proof in the paper). It solves state-tracking problems that fixed-depth
attention provably cannot.
Inference runs in O(1) memory per token — no KV cache. The hybrid variant
(RWKV-X) hits 99.8% passkey retrieval at 64K and 1.37x Flash Attention v3
throughput at 128K.
Happy to discuss the delta rule generalization, the TC⁰ proof, or the
benchmark methodology — I spent 36 sources digging into the caveats.
xml•1h ago
> Specifically, we collected new data created after January 2025, including: [...] new fiction on Archive of Our Own (Various, 2025),
Not sure how to feel about this. From a researcher's point of view, reproducibility is important, but the last time someone publicly collected data from AO3, the community was not very fond of that.
Aedelon•1h ago
The more interesting result is architectural: RWKV-7 formally exceeds TC⁰, the complexity class bounding standard Transformers (Merrill & Sabharwal's proof in the paper). It solves state-tracking problems that fixed-depth attention provably cannot.
Inference runs in O(1) memory per token — no KV cache. The hybrid variant (RWKV-X) hits 99.8% passkey retrieval at 64K and 1.37x Flash Attention v3 throughput at 128K.
Paper: https://arxiv.org/abs/2503.14456 (COLM 2025, peer-reviewed)
Weights: https://huggingface.co/collections/RWKV/rwkv-v7-67d43835efa2...
Code: https://github.com/BlinkDL/RWKV-LM (Apache 2.0)
Happy to discuss the delta rule generalization, the TC⁰ proof, or the benchmark methodology — I spent 36 sources digging into the caveats.
xml•1h ago
https://huggingface.co/datasets/nyuuzyou/archiveofourown/dis...