Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

https://www.baseten.co/blog/sota-performance-for-gpt-oss-120b-on-nvidia-gpus/

94•philipkiely•4h ago

Comments

tmshapland•1h ago

Such a fascinating read. I didn't realize how much massaging needed to be done to get the models to perform well. I just sort of assumed they worked out of the box.

acters•44m ago

Personally, I think bigger companies should be more proactive and work with some of the popular inference engine software devs with getting their special snowflake LLM to work before it gets released. I guess it is all very much experimental at the end of the day. Those devs are putting in God's work for us to use on our budget friendly hardware choices.

magicalhippo•1h ago

Maybe I'm especially daft this morning but I don't get the point of the speculative decoding.

How does the target model validate the draft tokens without running the inference as normal?

Because if it is doing just that, I don't get the point as you can't trust the draft tokens before they are validated, so you're still stuck waiting for the target model.

joliu•1h ago

It does run inference, but on the batch of tokens that were drafted, akin to the prefill phase.

So your draft model can decode N new tokens, then the real model does one inference pass to score the N new drafted tokens.

Prefill is computation bound whereas decode is bandwidth bound, so in practice doing one prefill over N tokens is cheaper than doing N decode passes.

furyofantares•1h ago

Not an expert, but here's how I understand it. You know how input tokens are cheaper than output tokens? It's related to that.

Say the model so far has "The capital of France". The small model generates "is Paris.", which let's say is 5 tokens.

You feed the large model "The capital of France is Paris." to validate all 5 of those tokens in a single forward pass.

ahmedfromtunis•1h ago

But what would happen if the small model's prediction was "is Rome."? Wouldn't that result in costlier inference if the small model is "wrong" more than it is correct.

Also, if the small model would be sufficiently more "correct" than "wrong", wouldn't be more efficient to get rid of the large model at this point?

cwyers•47m ago

So, the way speculative decoding works, the model begins predicting at the first wrong token, so you still get 'is' for free.

acters•37m ago

I believe that is exactly the downside of using speculative decoding, which is why it is very important to have the models properly sized between each other by making sure the small use is big enough to be mostly correct while also being exceptionally faster than the larger one. However the larger one has to be fast enough that catching flaws won't introduce too manyrandom delays. Also, if the small one is incorrect then the larger one correcting the mistake is miles better than leaving in incorrect output.

It is about improving quality while allowing for faster speed most of the time. The tradeoff is that you consume more memory from having two models loaded vs one of them exclusively.

If you just focus on one then it would make sense to reduce memory usage by just running the smaller model.

acters•27m ago

Another caveat with this method is that both larger and smaller models need to behave very similar because a lot of the savings come from generating the necessary fluff around each detail such as grammar, formatting and words/letters that transition between each other.

Unsurprisingly gpt-oss has both larger and smaller models that work very similarly! Both model sizes are so similar that even if getting a few wrong would not be slowing down the performance enough to equal the speed of the larger model(which is the worst case with this setup). We want the speed of the smaller model as much as possible. That is all

isoprophlex•1h ago

but... do you get any validation during the forward pass? the small model could just as well have generated "is Berlin." or whatever. do these models somehow give you a likelihood for the next token when you're prefilling, that you can compare against? if so why not just... use that always?

or is this a scenario where computation is expensive but validation is cheap?

cristoperb•1h ago

My simplified understanding: The target model can validate the draft tokens all at once, in a single forward pass. The output of that forward pass is a list of probabilities for each draft token which are compared to the probabilities produced by the draft model. If the target model's probabilities are the same or greater than the draft model, the tokens are accepted. Worst case none of the draft tokens are accepted and instead the target model selects the single next token as usual.

robrenaud•56m ago

I think your core misunderstanding is that you are assuming K calls to generate 1 token is expensive as 1 call to generate K tokens. It is actually much more expensive to generate serially than even in small batches.

porridgeraisin•46m ago

Let's say I want to run f2(f1(x)) where f1 and f2 are both a single pass through GPT4.

This takes 2 seconds time, assuming 1 second for every pass.

What I instead do is kick off f1(x) in another thread, and then run f2(g1(x)) where g1 is one pass through GPT-nano.

This takes 1 + 0.1 seconds, assuming gpt nano takes 0.1s for every pass. In this 1.1 seconds, the f1(x) that we kicked off in the 2nd thread would have finished (it takes 1 second).

So in 1.1 seconds we have available to us f1(x), f2(g1(x)), and we store the intermediate g1(x) as well

We compare g1(x) and f1(x)

If they were equal, i.e g1(x) = f1(x), then we have our answer = f2(g1(x)) in just 1.1s.

If they were not, we compute f2(output of f1(x) from 2nd thread) which takes 1 further second, bringing our total to 2.1s.

If the small model is equalling the big model in say 2/3 of cases, you will spend 2/3 * 1.1 + 1/3 * 2.1 = 1.433s on average for this computation. Without speculative decoding, it is always 2s.

arkmm•43m ago

This is a really great explanation.

magicalhippo•36m ago

Thanks, very nice explanation, that makes perfect sense. I guess their graphics confused me for some reason and had me thinking all wrong.

Now I see they tried to point out the obvious thing which is to predict multiple tokens ahead, not just two as in your example.

bhaney•11m ago

> How does the target model validate the draft tokens without running the inference as normal?

It does run the inference as normal, just in parallel with the other inferences

> if it is doing just that, I don't get the point

Running inferences in parallel allows you to only read the model weights out of memory only once for N parallel inferences, as opposed to reading them out of memory N times for N serial inferences. Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot.

littlestymaar•7m ago

> Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot.

Nitpick: it's only bottlenecked by memory bandwidth if the batch size is too low (that is: if you don't have many users calling the same model in parallel).

Speculative decoding is just a way of running a single query as if it was parallel queries.

modeless•1h ago

What's the best speed people have gotten on 4090s?

ActorNightly•56m ago

You can't fit the model into 4090 without quantization, its like 64 gigs.

For home use, Gemma27B QAT is king. Its almost as good as Deepseek R1

modeless•37m ago

The 20B one fits.

asabla•51m ago

I'm on a 5090 so it's not apples to apples comparison. But I'm getting ~150t/s for the 20B version using ~16000 context size.

modeless•36m ago

Cool, what software?

steinvakt2•9m ago

And flash attention doesn't work on 5090 yet, right? So currently 4090 is probably faster, or?

littlestymaar•45m ago

Very fast “Sorry I can't help with that” generator.

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

Claude Code IDE integration for Emacs

Rules by which a great empire may be reduced to a small one (1773)

We replaced passwords with something worse

Project Hyperion: Interstellar ship design competition

A candidate giant planet imaged in the habitable zone of α Cen A

Litestar is worth a look

You know more Finnish than you think

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model

Jules, our asynchronous coding agent

Mac history echoes in current Mac operating systems

We'd be better off with 9-bit bytes

Writing a Rust GPU kernel driver: a brief introduction on how GPU drivers work

A fast, growable array with stable pointers in C

The Bluesky Dictionary

Show HN: Rust framework for advanced file recognition and identification

How ChatGPT spoiled my semester (2024)

SQLite offline sync for Android quick start

Researchers Uncover RCE Attack Chains in HashiCorp Vault and CyberArk Conjur

Herbie detects inaccurate expressions and finds more accurate replacements

FDA approves eye drops that fix near vision without glasses

Multics

Compaq’s Rod Canion broke IBM's hold on the PC market

What is the average length of a queue of cars? (2023)

Comptime.ts: compile-time expressions for TypeScript

Out-Fibbing CPython with the Plush Interpreter

Automerge 3.0

Breaking the sorting barrier for directed single-source shortest paths

Rethinking DOM from first principles

Zig Error Patterns

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

Claude Code IDE integration for Emacs

Rules by which a great empire may be reduced to a small one (1773)

We replaced passwords with something worse

Project Hyperion: Interstellar ship design competition

A candidate giant planet imaged in the habitable zone of α Cen A

Litestar is worth a look

You know more Finnish than you think

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model

Jules, our asynchronous coding agent

Mac history echoes in current Mac operating systems

We'd be better off with 9-bit bytes

Writing a Rust GPU kernel driver: a brief introduction on how GPU drivers work

A fast, growable array with stable pointers in C

The Bluesky Dictionary

Show HN: Rust framework for advanced file recognition and identification

How ChatGPT spoiled my semester (2024)

SQLite offline sync for Android quick start

Researchers Uncover RCE Attack Chains in HashiCorp Vault and CyberArk Conjur

Herbie detects inaccurate expressions and finds more accurate replacements

FDA approves eye drops that fix near vision without glasses

Multics

Compaq’s Rod Canion broke IBM's hold on the PC market

What is the average length of a queue of cars? (2023)

Comptime.ts: compile-time expressions for TypeScript

Out-Fibbing CPython with the Plush Interpreter

Automerge 3.0

Breaking the sorting barrier for directed single-source shortest paths

Rethinking DOM from first principles

Zig Error Patterns

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

Comments