Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

https://www.baseten.co/blog/sota-performance-for-gpt-oss-120b-on-nvidia-gpus/

57•philipkiely•3h ago

Comments

tmshapland•1h ago

Such a fascinating read. I didn't realize how much massaging needed to be done to get the models to perform well. I just sort of assumed they worked out of the box.

davepeck•45m ago

This is a good read, as is the linked-to deeper dive on Baseten’s inference and infrastructure stacks: https://www.baseten.co/resources/guide/the-baseten-inference...

magicalhippo•40m ago

Maybe I'm especially daft this morning but I don't get the point of the speculative decoding.

How does the target model validate the draft tokens without running the inference as normal?

Because if it is doing just that, I don't get the point as you can't trust the draft tokens before they are validated, so you're still stuck waiting for the target model.

joliu•28m ago

It does run inference, but on the batch of tokens that were drafted, akin to the prefill phase.

So your draft model can decode N new tokens, then the real model does one inference pass to score the N new drafted tokens.

Prefill is computation bound whereas decode is bandwidth bound, so in practice doing one prefill over N tokens is cheaper than doing N decode passes.

furyofantares•20m ago

Not an expert, but here's how I understand it. You know how input tokens are cheaper than output tokens? It's related to that.

Say the model so far has "The capital of France". The small model generates "is Paris.", which let's say is 5 tokens.

You feed the large model "The capital of France is Paris." to validate all 5 of those tokens in a single forward pass.

ahmedfromtunis•12m ago

But what would happen if the small model's prediction was "is Rome."? Wouldn't that result in costlier inference if the small model is "wrong" more than it is correct.

Also, if the small model would be sufficiently more "correct" than "wrong", wouldn't be more efficient to get rid of the large model at this point?

isoprophlex•9m ago

but... do you get any validation during the forward pass? the small model could just as well have generated "is Berlin." or whatever. do these models somehow give you a likelihood for the next token when you're prefilling, that you can compare against? if so why not just... use that always?

or is this a scenario where computation is expensive but validation is cheap?

cristoperb•10m ago

My simplified understanding: The target model can validate the draft tokens all at once, in a single forward pass. The output of that forward pass is a list of probabilities for each draft token which are compared to the probabilities produced by the draft model. If the target model's probabilities are the same or greater than the draft model, the tokens are accepted. Worst case none of the draft tokens are accepted and instead the target model selects the single next token as usual.

robrenaud•1m ago

I think your core misunderstanding is that you are assuming K calls to generate 1 token is expensive as 1 call to generate K tokens. It is actually much more expensive to generate serially than even in small batches.

modeless•15m ago

What's the best speed people have gotten on 4090s?

ActorNightly•1m ago

You can't fit the model into 4090 without quantization, its like 64 gigs.

"OpenAI Harmony Response Format": standardize CoT+tools via <|specialtokens|>

Postgres Logical Redplication Slot Invalidations in 16.9 and 17.5

Show HN: Pinterest Videos Downloader

Arab States Call for Hamas to Disarm Amid Push for a Palestinian State

Tennessee school book bans include Calvin and Hobbes and The Magic Tree House

Will Future Civilizations Bother to Excavate Our Remains?

Maruko: Write SwiftUI iOS Apps on Your Phone with No-Code Magic

Show HN: Don't settle for a waitlist. Create a VIP list that drives conversions

Show HN: Complexipy, calculate the cognitive complexity of Python

GPT-5 model descriptions accidentally leaked on GitHub

RIP to the Macintosh HD hard drive icon, 2000–2025

Viral TikTok Challenge Leaves 9-Year-Old with Burns: Police

Writing Your Own Simple Tab-Completions for Bash and Zsh

Waymos of Loving Grace

Bitfrost – LLM gateway 90x faster than Litellm at p99

How ChatGPT spoiled my semester (2024)

Tech company reaches gender quotas by replacing half workers with female AI bots

Ask HN: What's the best career move you made in tech–and why?

Wary of sticker shock, retailers clash with brands on price hikes

Elvis is alive How 'AI' stunts modern mythmaking

Onion-Lang

Apple hit by string of departures in AI talent war

The rise of couples location sharing

Official Reserve Revaluations: The International Experience

Actual LLM agents are coming

Fun Command-Line Tricks You Should Try

New Gemini app tools to help students learn, understand and study better

Sleep Ledger

Your LLM Does Not Care About MCP

AI in production: reflecting on one year, five projects and factories deployed