Were RNNs all we needed? A GPU programming perspective

https://dhruvmsheth.github.io/projects/gpu_pogramming_curnn/

107•omegablues•4mo ago

Comments

DoctorOetker•4mo ago

Is it not much simpler to parallelize by having different "readers" (using the same model parameters/weights) process different parts of the corpus in parallel? reader A is reading book A, while reader B is reading book B etc...?

Is there a deeper reason why more complicated parallelization as in the OP or the article it references is more desirable?

jsharf•4mo ago

If you have independent copies of the network learning gradients, then you’re effectively making the batch size smaller— unless you’re doing an all collect and making them sync, in which case there’s a lot of overhead

When you take a batch and calculate gradients, you’re effectively calculating a direction the weights should move in, and then taking a step in that direction. You can do more steps at once by doing what you say, but they might not all be exactly in the right direction, so overall efficiency is hard to compare

I am not an expert, but if I understand correctly I think this is the answer.

immibis•4mo ago

Batch size is just averaging the gradients from multiple calculations.

zozbot234•4mo ago

AIUI, the thinking when developing transformers might have been that "reading text A vs. text B" just isn't parallel enough for truly large-scale training. The problem was to somehow also parallelize the learning of very long range dependencies within a single sequence, and transformers managed to do that.

TimorousBestie•4mo ago

The single-thread performance of the parallel prefix sum that they use is O(N log N), so the improvement from that to O(log N) on N threads is not as surprising.

The way the headline is written, it sounds like Amdahl’s law was violated. It wasn’t, of course.

casta•4mo ago

How's the prefix sum on a single thread O(N log(N))? Isn't it trivially O(N)? It's just a for loop.

TimorousBestie•4mo ago

Yes, but for loop comes with all those data dependencies that prevent it from being parallelized trivially.

The algorithm with fewer data dependencies is O(N log N).

This is covered in more detail in the article.

gyrovagueGeist•4mo ago

It's from the depth of the computation, not the work

akst•4mo ago

I think the author may have linked to the wrong paper in their opening paragraph (hopefully they see this and update the link) https://arxiv.org/abs/2410.01201

I opened it wondering what RNNs were, and for anyone else wondering they are apparently Recurrent Neural Networks, though you find that answer if you continue reading on (though the lack of a definition kind of stopped me in my tracks).

jszymborski•4mo ago

If you are looking to learn about LSTMs and RNNs, I can't recommend this post enough:

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Another great one (just on RNNs) is:

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

akst•4mo ago

thanks!

omegablues•4mo ago

sorry about that - it should be fixed!

akst•4mo ago

No worries

lettergram•4mo ago

Back in 2016 - 2018 my work at Capital One resulted in a modified C-RNN style architecture that was producing gpt-2 level results. Using that model we were able to build a general purpose system that could generate data for any dataset (with minimal training, from scratch):

https://medium.com/capital-one-tech/why-you-dont-necessarily...

At the time it was clear to all on the team that RNNs, just like transformers later on, are general purpose frameworks that really only require more data and size to function. In the 2018-2020 era and probably today, they are slower to train. They also are less prone to certain pitfalls, but overall had the same characteristics.

In the 2019-2020 I was convinced that transformers would give way to better architecture. The RNNs in particular trained faster and required less data, particularly when combined with several architectural components I won’t get into. I believe that’s still true today, though I haven’t worked on it in the last 2-3 years.

That said, transformers “won” because they are better overall building blocks and don’t require the nuances of RNNs. Combined with the compute optimizations that are now present I don’t see that changing in the near term. Folks are even working to convert transformers to RNNs:

https://medium.com/@techsachin/supra-technique-for-linearizi...

There are also RNN based models beating Qwen 3 8B in certain benchmarks

https://www.rwkv.com/

I suspect over time the other methods my team explored and other types of networks and nodes will continue to expand beyond transformers for state of the art LLMs

algo_trader•4mo ago

> RNN based models beating Qwen 3 8B > https://www.rwkv.com/

Counter consensus is where the alpha is...

Do you think rnn/rwkv have an edge with verifiable domains and tree search inference time? You can use cheaper gpus and do multiple sampling.

(but of course, its hard to beat the sunk cost of a foundation model)

marcosdumay•4mo ago

Well, on the title, our brain seems to be equivalent to a RNN... so yeah, possibly.

Anyway, claiming that they are equivalent to transformers when RNNs are Turing complete, and forward-only NNs are not is such a strange take.

imtringued•4mo ago

It's a much stranger take to associate the brain with RNNs. It's far more likely that the brain does something similar to a path independent spiking equilibrium model trying to find a fix point, because those models are inherently robust with respect to noise and adversarial attacks, do not require more than one layer and inherently contain feedback loops and tend to generalize well to out of distribution data. Of course in practice they end up somewhere between 2x and 3x slower than a finite layer transformer for the same in distribution performance.

RaftPeople•4mo ago

> It's a much stranger take to associate the brain with RNNs

That seems like too strong of a position. The equilibrium model seems to be a good candidate for some activity in the brain, but it doesn't seem like it applies to everything.

For example, the inter-layer recurrence in vision/object detection processing seems to be something different.

zozbot234•4mo ago

My understanding is that RNN and LSTM (potentially augmented with some bespoke attention mechanism) have the potential to be more training- and inference-efficient than the transformer models that are more common today, but transformers were adopted because of their unique ability to greatly scale up and parallelize training in a way that just isn't feasible with the older models. So transformers can get better outcomes from their allowed scale of compute despite possibly being less compute-efficient overall.

spwa4•4mo ago

I wonder about that. I mean, attention is sort-of obvious isn't it? Just calculate across all data available, look at every moment in time simultaneously. I know the point of attention is to select the correct data and process that, but in order to select the correct data we do a truly massive matrix multiplication. That that was going to work, I believe, was not a mystery even in 1960. It just wasn't possible, and even in 2016 it was not really possible to train with such a principle outside of the FANGs.

(not trying to say it wasn't an incredible accomplishment for the authors. There are quite a few details to get right in order to get to the "obvious" advance)

Even today it's pretty obvious how such a thing might be further extended. Create a network so big in input it just contains and does attention across it's entire dataset. Such a network would only need basic understanding of language and would not hallucinate. Also it'd be obvious where anything came from, as the attention vectors would show what data was used by what specific part of the network. But this is a theoretical exercise as all the compute power in the world can't do that for any decent size dataset.

RNN and LSTM are more training and inference efficient because they don't do this. They do compute for every token and more-or-less then just add every thought any part of the network had together, sequentially.

We need to go the opposite direction from attention. It has to be the case that attention is extremely inefficient.

zozbot234•4mo ago

> I know the point of attention is to select the correct data and process that

In a way, the point of attention mechanisms is to bias the model towards long-range dependencies (as seen e.g. in language sentence structure) and away from the very short-term ones that a plain RNN would tend to focus on. LSTM is sort-of in the middle; it manages to persist information beyond the very short run (so "paying attention" to it in a very fuzzy sense) but not as long-term as attention does.

bob1029•4mo ago

> You simply cannot compute the gates for the entire sequence in one shot, because each step requires the output from the one before it. This forces a sequential loop, which is notoriously inefficient on parallel hardware like GPUs.

> The crux of the paper is to remove this direct dependency. The simplified models, minGRU and minLSTM, redefine the gates to depend only on the current input

The entire hypothesis of my machine learning experiments has been that we should embrace the time domain and causal dependencies. I really think biology got these elements correct. Now, the question remains - Which kind of computer system is most ideal to run a very branchy and recursive workload?

Constantly adapting our experiments to satisfy the constraints of a single kind of compute vendor is probably not healthy for science over the long term.

jstanley•4mo ago

> Which kind of computer system is most ideal to run a very branchy and recursive workload?

An analogue one, possibly?

tripplyons•4mo ago

I think of analogue computing as more continuous than branchy from what I have heard about it. I don't know much about it though.

inciampati•4mo ago

It turns out you can use a fused triton kernel for a true RNN GRU and run just as fast as the minGRU model in training. Yeah, it doesn't work for very long context but neither does minGRU (activation memory...)

nickpsecurity•4mo ago

Analog or FPGA. Cerebras' wafer-scale technology could help.

fennecbutt•4mo ago

Absolutely, I remember an article from ages ago about a self learning algo implemented on an fpga (I think) that could modify its own make up on a hardware level.

It ended up optimising in a way that wasn't obvious at first, but turned out to be the noise of one part interacting with another.

Aha: Here's the paper https://osmarks.net/assets/misc/evolved-circuit.pdf

And a fluff article https://www.damninteresting.com/on-the-origin-of-circuits

And as per usual, Google was hopeless in finding the article from a rough description. No chance, at all. Chatgpt thought for 10s and delivered the correct result, first time.

tripplyons•4mo ago

The output of the recurrence is still dependent on previous tokens, but it usually less expressive within the recurrence in order make parallelism possible. In MinGRU the main operation used to share information between tokens is addition (with a simple weighting).

You could imagine after one layer of the recurrence, the tokens already have some information about each other, so the input to the following layers is dependent on previous tokens, although the dependence is indirect compared to a traditional RNN.

vitus•4mo ago

> The GPU implementation's logarithmic scaling becomes evident at longer sequence lengths.

I don't see logarithmic scaling, actually. From the table for GRU performance, going from 16384 -> 65536 (namely: increasing the input by 4x) is roughly a 4x increase in time whether looking at CPU-scan or GPU-scan. Okay, maybe the inputs need to be bigger. Looking at the next plot, which goes up to 524288, we see the same behavior: the delta between CPU-scan and GPU-scan doubles as we double the input. That's a constant multiplicative factor. Same holds for LSTM performance.

Is this an artifact of the benchmark setup? Are we actually measuring the amount of time needed to load the full context into RAM? Or perhaps we're bottlenecked on memory bandwidth?

> Success: The gate extraction kernel, which was a huge bottleneck, now only takes 8% of the total time and is memory-bandwidth bound, saturating L2 bandwidth at 1.9 TB/s. This is a good place to be.

Sounds like that might be the case.

tripplyons•4mo ago

Typically the fastest approaches for associative RNNs combine the advantages of the parallel O(n log n) algorithm with a recurrent non-parallel O(n) approach by computing results for subsequence chunks in parallel and moving to the next chunk in a recurrent manner. This blog post explains the method (chunkwise parallel algorithm) well: https://sustcsonglin.github.io/blog/2024/deltanet-2/

tripplyons•4mo ago

Have you explored chunkwise parallel approaches? They use the O(n log n) parallel algorithm within a subsequence and update bewteen chunks recurrently like the O(n) recurrent algorithm. These are usually the fastest kernels for these kinds of RNNs. Here is the best explanation I have seen: https://sustcsonglin.github.io/blog/2024/deltanet-2/

Brute Force Colors (2022)

Google Translate apparently vulnerable to prompt injection

(Bsky thread) "This turns the maintainer into an unwitting vibe coder"

Software development is undergoing a Renaissance in front of our eyes

Can you beat ensloppification? I made a quiz for Wikipedia's Signs of AI Writing

Spec-Driven Design with Kiro: Lessons from Seddle

Agents need good developer experience too

The Dark Factory

Free data transfer out to internet when moving out of AWS (2024)

Interop 2025: A Year of Convergence

Prejudice Against Leprosy

Slint: Cross Platform UI Library

AI and Education: Generative AI and the Future of Critical Thinking

Maple Mono: Smooth your coding flow

Moltbook isn't real but it can still hurt you

Take Back the Em Dash–and Your Voice

Show HN: 289x speedup over MLP using Spectral Graphs

Teaching Mathematics

3D Printed Microfluidic Multiplexing [video]

Abstractions Are in the Eye of the Beholder

Show HN: Routed Attention – 75-99% savings by routing between O(N) and O(N²)

We didn't ask for this internet – Ezra Klein show [video]

The Real AI Talent War Is for Plumbers and Electricians

Show HN: MimiClaw, OpenClaw(Clawdbot)on $5 Chips

I Maintain My Blog in the Age of Agents

The Fall of the Nerds

Show HN: I'm 15 and built a free tool for reading ancient texts.

How close is AI to taking my job?

You are the reason I am not reviewing this PR

Show HN: FamilyMemories.video – Turn static old photos into 5s AI videos

Brute Force Colors (2022)

Google Translate apparently vulnerable to prompt injection

(Bsky thread) "This turns the maintainer into an unwitting vibe coder"

Software development is undergoing a Renaissance in front of our eyes

Can you beat ensloppification? I made a quiz for Wikipedia's Signs of AI Writing

Spec-Driven Design with Kiro: Lessons from Seddle

Agents need good developer experience too

The Dark Factory

Free data transfer out to internet when moving out of AWS (2024)

Interop 2025: A Year of Convergence

Prejudice Against Leprosy

Slint: Cross Platform UI Library

AI and Education: Generative AI and the Future of Critical Thinking

Maple Mono: Smooth your coding flow

Moltbook isn't real but it can still hurt you

Take Back the Em Dash–and Your Voice

Show HN: 289x speedup over MLP using Spectral Graphs

Teaching Mathematics

3D Printed Microfluidic Multiplexing [video]

Abstractions Are in the Eye of the Beholder

Show HN: Routed Attention – 75-99% savings by routing between O(N) and O(N²)

We didn't ask for this internet – Ezra Klein show [video]

The Real AI Talent War Is for Plumbers and Electricians

Show HN: MimiClaw, OpenClaw(Clawdbot)on $5 Chips

I Maintain My Blog in the Age of Agents

The Fall of the Nerds

Show HN: I'm 15 and built a free tool for reading ancient texts.

How close is AI to taking my job?

You are the reason I am not reviewing this PR

Show HN: FamilyMemories.video – Turn static old photos into 5s AI videos

Were RNNs all we needed? A GPU programming perspective

Comments