How does the target model validate the draft tokens without running the inference as normal?
Because if it is doing just that, I don't get the point as you can't trust the draft tokens before they are validated, so you're still stuck waiting for the target model.
So your draft model can decode N new tokens, then the real model does one inference pass to score the N new drafted tokens.
Prefill is computation bound whereas decode is bandwidth bound, so in practice doing one prefill over N tokens is cheaper than doing N decode passes.
Say the model so far has "The capital of France". The small model generates "is Paris.", which let's say is 5 tokens.
You feed the large model "The capital of France is Paris." to validate all 5 of those tokens in a single forward pass.
Also, if the small model would be sufficiently more "correct" than "wrong", wouldn't be more efficient to get rid of the large model at this point?
It is about improving quality while allowing for faster speed most of the time. The tradeoff is that you consume more memory from having two models loaded vs one of them exclusively.
If you just focus on one then it would make sense to reduce memory usage by just running the smaller model.
Unsurprisingly gpt-oss has both larger and smaller models that work very similarly! Both model sizes are so similar that even if getting a few wrong would not be slowing down the performance enough to equal the speed of the larger model(which is the worst case with this setup). We want the speed of the smaller model as much as possible. That is all
The post training fine tuning costs (low thousand dollars) are the main reason why speculative decoding is relatively unpopular. The most effective speculative decoding strategy requires you to train multiple prediction heads ala medusa (or whatever succeeded it). If you don't do any fine tuning, then the probability of the small model being useful is slim. Using a random model as your draft model will probably give you very disappointing results.
or is this a scenario where computation is expensive but validation is cheap?
EDIT: thanks, people, for educating me! very insightful :)
If the small model predicts some tokens correctly, you save some passes, at the expense of doing some extra computations when the tokens were not correct.
In any case, each forward pass will give at least one new token.
This takes 2 seconds time, assuming 1 second for every pass.
What I instead do is kick off f1(x) in another thread, and then run f2(g1(x)) where g1 is one pass through GPT-nano.
This takes 1 + 0.1 seconds, assuming gpt nano takes 0.1s for every pass. In this 1.1 seconds, the f1(x) that we kicked off in the 2nd thread would have finished (it takes 1 second).
So in 1.1 seconds we have available to us f1(x), f2(g1(x)), and we store the intermediate g1(x) as well
We compare g1(x) and f1(x)
If they were equal, i.e g1(x) = f1(x), then we have our answer = f2(g1(x)) in just 1.1s.
If they were not, we compute f2(output of f1(x) from 2nd thread) which takes 1 further second, bringing our total to 2.1s.
If the small model is equalling the big model in say 2/3 of cases, you will spend 2/3 * 1.1 + 1/3 * 2.1 = 1.433s on average for this computation. Without speculative decoding, it is always 2s.
Now I see they tried to point out the obvious thing which is to predict multiple tokens ahead, not just two as in your example.
It does run the inference as normal, just in parallel with the other inferences
> if it is doing just that, I don't get the point
Running inferences in parallel allows you to only read the model weights out of memory only once for N parallel inferences, as opposed to reading them out of memory N times for N serial inferences. Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot.
Nitpick: it's only bottlenecked by memory bandwidth if the batch size is too low (that is: if you don't have many users calling the same model in parallel).
Speculative decoding is just a way of running a single query as if it was parallel queries.
I think this answer is as good as any of the human-generated ones in the thread so far, but the real power is that you can ask it follow-up questions. https://chatgpt.com/share/6894504f-4458-8008-a8c9-f371588259...
For home use, Gemma27B QAT is king. Its almost as good as Deepseek R1
The 120B is running at 20 tokens/sec on my 5060Ti 16GB with 64GB of system ram. Now personally I find 20 tokens/sec quite usable, but for some maybe it's not enough.
Flash attention works with GPT-OSS + llama.cpp (tested on 1d72c8418) and other Blackwell card (RTX Pro 6000) so I think it should work on 5090 as well, it's the same architecture after all.
It really is amazing what ggerganov and the llama.cpp team have done to democratize LLMs for individuals that can't afford a massive GPU farm worth more than the average annual salary.
How many tokens/s do you get for DeepSeek-R1?
R1 starts at about 10t/s on an empty context but quickly falls off. I'd say the majority of my tokens are generating around 6t/s.
Some of the other big MoE models can be quite a bit faster.
I'm mostly using QwenCoder 480b at Q8 these days for 9t/s average. I've found I get better real-world results out of it than K2, R1 or GLM4.5.
* Gigabyte MZ73-LM1 with two AMD EPYC GENOA 9334 QS 64c/128t
* 24 sticks of M321R4GA3BB6-CQK 32GB DDR5-4800 RDIMM PC5-38400R
* 24GB A5000
Note that the RAM price almost doubled since Jan 2024I wonder what makes it work so well on yours! My CPU isn't much slower and my GPU probably faster.
Speed and ease of use is one thing, but it shouldn't be at the cost of accuracy.
I am in the early phases of collecting my thoughts on this topic so bear with me, but it this a bad thing?
AI models will have a world view. I think I prefer them having a western world view, as that has built our modern society and has proven to be most successful in making the lives of people better.
At the very minimum I would want a model to document its world view, and be aligned to it so that it does not try to socially engineer me to surreptitiously change mine.
I think the worry is that there’s no fixed definitions here, so the executive can use this to exert partisan or ideological pressure on model providers.
Every four years the models get RLHF’d to switch between thinking guns are amazing vs thinking guns are terrible.
I may be naive, but on this specific case, I am hoping that an AI could lead us to a somewhat objective truth. There seems to be enough data points to make some conclusion here. For example, most/all counties in Europe have less gun violence than the US, but there are at least two EU counties with high gun ownership (Finland and Austria) that also have low gun violence. The gun ownership issue is so polarized these days, I don’t think we can trust most people to make reason based arguments about it. Maybe an AI could help us synthesize and interpret the data dispassionately.
What worries me is that the current "western world view" of America is not the same as the western world view we've shared with them since the cold war. The trend is towards the same kind of values and behaviour we see in the Islamic Republic and the Russian Federation. If that sort of "western world view" gets baked into the intelligent infrastructure, it may be very hard to change course in the future. For example dissidence and wrongthink is going to get harder and harder.
Highly debatable, and most people anywhere would probably say the same thing about whatever world view they hold.
Even then, there is an important difference between de-facto and de-jure rules. Fun fact: even North Korea has a constitutional guarantee of freedom of speech and the right vote*. They don't do these things as we would understand any of those words, but they have those things right there in the constitution.
So: does the USA, as it exists today, represent the values you want? Can you honestly say, hand on heart, that Alligator Alcatraz should be a thing your AI has been trained to support? Or that it's fine for Qatar to donate a 747 that becomes part of the library of the current president, not the office of the president, when his term in office comes to an end?
I won't list everything, this isn't the place for that, but even if we wind the clock back a few years, do you (/we) want an AI aligned with a political circus of kayfabe that distracts us from the real political machinations?
Of course, this is still USA-focused.
I'd say that what really made a difference to our quality of life wasn't even the American political system: there were massive improvements to human existence starting with the first industrial revolution in the UK in the 1760s, but the social and political nature of the world back then was so bleak that communism got invented a century later and introduced what was at the time controversial ideas like "women are not property" and "universal free education is good", and the USA's systems changed substantially several times since then (at a minimum Civil War, New Deal, and the Civil Rights movement).
The "meta system" that allows change can be considered good, but not uniquely so if you compare this to the Russian Revolution getting rid of the Tzars and a 40 years later they were in orbit (and this despite the Holodomor and WW2) and then threw off these shackles with Glasnost and the fall of the USSR (and note there that in Russia specifically, not all the former soviet countries but specifically Russia, the freedom gained failed to bring material improvements and the lives of those living through it were, in aggregate, made worse despite that freedom), and similar stories with the Chinese starting with dangerous incompetence (Four Pests campaign) and now in a position where "which is more powerful, them or the USA?" is a matter of which measure you use rather than it being obvious.
* https://en.wikipedia.org/wiki/Constitution_of_North_Korea#Ch...
> TensorRT-LLM
It is usually the hardest to setup correctly and is often out of the date regarding the relevant architectures. It also requires compiling the model on the exact same hardware-drivers-libraries stack as your production environment which is a great pain in the rear end to say the least. Multimodal setups also been a disaster - at least for a while - when it was near-impossible to make it work even for mainstream models - like Multimodal Llamas. The big question is whether it's worth it, since when running the GPT-OSS-120B on H100 using vLLM is flawless in comparison - and the throughput stays at 130-140 t/s for a single H100. (It's also somewhat a clickbait of a title - I was expecting to see 500t/s for a single GPU, when in fact it's just a tensor-parallel setup)
It's also funny that they went for a separate release of TRT-LLM just to make sure that gpt-oss will work correctly, TRT-LLM is a mess
But for the kind of traffic we are trying to serve -- high volume and latency sensitive -- it consistently wins head-to-head in our benchmarking and we have invested a ton of dev work in the tooling around it.
A few things I noticed: - it’s only fast with with small context windows and small total token context; once more than ~10k tokens you’re basically queueing everything for a long time - MCPs/web search/url fetch have already become a very important part of interacting with LLMs; when they’re not available the LLM utility is greatly diminished - a lot of CLI/TUI coding tools (e.g., opencode) were not working reliably offline at this time with the model, despite being setup prior to being offline
That’s in addition to the other quirks others have noted with the OSS models.
I seriously doubt it's the throughput of memory during inference that's the bottleneck here.
(A memory-bound workload like token gen wouldn't usually run into the CPU's thermal or power limits, so there would be little or no gain from offloading work to the iGPU/NPU in that phase.)
Even though LM Studio uses llama.cpp as a runtime, the performance differs between them. With LM Studio 0.3.22 Build 2 with CUDA Llama.cpp (Linux) v1.45.0 runtime I get ~86 tok/s on a RTX Pro 6000, while with llama.cpp compiled from 1d72c841888 (Aug 7 10:53:21 2025) I get ~180 tok/s, almost 100 more per second, both running lmstudio-community/gpt-oss-120b-GGUF.
I think 99% of web searches lead to the same 100-1k websites. I assume it's only a few GBs to have a copy of those locally, thus this raises copyright concerns.
LLMs call out to external websites when something isn’t commonly represented in training data, like specific project documentation or news events.
Maybe it's better to have the AI only "reason", and somehow instantly access precise data.
Im spec’ing out a Mac Studio with 512GB ram because I can window shop and wish but I think the trend for local LLMs is getting really good.
Do we know WHY openAI even released them?
Regulations and trying to earn good will of developers using local LLMs, something that was slowly eroding since it was a while ago (GPT2 - 2019) they released weights to the public.
Enterprises can now deploy them on AWS and GCP.
Yes all this has been known since the M4 came out. The memory bandwidth is too low.
Try using it with real tasks like cline or opencode and the context length is too long and slow to be practical
The M4 Max with 128GB of RAM (the part used in the comment) has over 500GB/sec of memory bandwidth.
[1] https://alverstokeaviation.blogspot.com/2016/03/
This page also has a rendered image of the generator:
https://aviation.stackexchange.com/questions/43490/how-much-...
Just looked in the parts drawer at home and dont seem to have a $25,000 GPU for some inexplicable reason.
adjective: available
able to be used or obtained; at someone's disposal
There should be a quicker way to differentiate between 'consumer-grade hardware that is mainly meant to be used for gaming and can also run LLMs inference in a limited way' and 'business-grade hardware whose main purpose is AI training or running inference for LLMs".
Defining GPU as "can output contemporary display connector signal and is more than just a ramdac/framebuffer-to-cable translator, starting with even just some 2D blitting acceleration.
I think it will also make sense to replace "H" with a brand number, sort of like they already do for customer GPUs.
So then maybe one day we'll have a math coprocessor called "Nvidia 80287".
"Accelerator card" makes a lot of sense to me.
Maybe renaming the device to an MPU, where the M stands for "matrix/math/mips" would make it more semantically correct?
I looked around briefly and could find no evidence that it's been renamed. Do you have a source?
Last consumer GPU with NVLink was the RTX 3090. Even the workstation-grade GPUs lost it.
https://forums.developer.nvidia.com/t/rtx-a6000-ada-no-more-...
Unless you’re running it 24/7 for multiple years, it’s not going to be cost effective to buy the GPU instead of renting a hosted one.
For personal use you wouldn’t get a recent generation data center card anyway. You’d get something like a Mac Studio or Strix Halo and deal with the slower speed.
So I wonder what I could be doing wrong. In the end I just use RTX 5080 as my models fit neatly in the available RAM.
* by not working at all, I mean the scripts worked, but results were wrong. As if H100 couldn't do maths properly.
It just means you CAN buy one if you want, as in they're in stock and "available", not that you can necessarily afford one.
Baseten: 592.6 tps Groq: 784.6 tps Cerebras: 4,245 tps
still impressive work
That said, we are serving the model at its full 131K context window, and they are serving 33K max, which could matter for some edge case prompts.
Additionally, NVIDIA hardware is much more widely available if you are scaling a high-traffic application.
Do you guys know a website that clearly shows which OS LLM models run on / fit into a specific GPU(setup)?
The best heuristic i could find for the necessary VRAM is Number of Parameters × (Precision / 8) × 1.2 from here [0].
[0] https://medium.com/@lmpo/a-guide-to-estimating-vram-for-llms...
So in the end, trying to actually run them seems to be the only fool-proof way of knowing for sure :)
https://huggingface.co/settings/local-apps
Then on the model pages, it will show you whether you can use it.
Also most of the times they are split up and, sometimes, you’ll get an indicator on the splits.
It’s still a work in progress to check all hardware and model format compatibility but it’s a great start until GGUF becomes the standard.
Your equation is roughly correct, but I tend to multiply by a factor of 2 not 1.2 to allow for highly concurrent traffic.
While it is seemingly hard to calculate it, maybe one should just make a database website that tracks specific setups (model, exact variant / quantisation, runner, hardware) where users can report, which combination they got running (or not) along with metrics like tokens/s.
Visitors could then specify their runner and hardware and filter for a list of models that would run on that.
If anyone is working on training or inference in Rust, I'm currently working on adding fp8 and fp4 support to cudarc[0] and candle[1]. This is being done so I can support these models in our inference engine for Mixlayer[2].
[0] https://github.com/coreylowman/cudarc/pull/449 [1] https://github.com/huggingface/candle/pull/2989 [2] https://mixlayer.com
I'm just trying to figure out how wide the datastream through this is, in particular, the actual data (not the weights) that flow through all of it. The width of the output stream. Just how big is a token at the output, prior to reducing it with "temperature" to a few bytes?
Assume infinitely fast compute in a magic black box, but you have to send the output through gigabit ethernet... what's the maximum number of tokens per second?
I'm just trying to calculate the actual bandwidth required for the full output of the model, not just a token to be handed off to the user.
I need this so I can compute just what bandwidth a fully FPGA (later ASIC) based implementation of the model would result in.
Edit/Append: I asked GPT-5, and it estimated:
Total bytes = 50,000 tokens × 4 bytes/token = 200,000 bytes
Which sounds about right to me. This yields a maximum of about 500 logits/second on Gigabit ethernet.The actual compute of the model is peanuts compared to just shuffling the data around.
That’s 2880 values (so multiply by dtype)
It's sad that MLPerf takes a long time to catch up to SOTA models.
tmshapland•6mo ago
acters•6mo ago
eric-burel•6mo ago
CMCDragonkai•6mo ago
eric-burel•6mo ago
mutkach•6mo ago
diggan•6mo ago
Yeah, according to the architecture it doesn't seem like a snowflake, but they also decided to invent a new prompting/conversation format (https://github.com/openai/harmony) which definitely makes it a bit of a snowflake today, can't just use what worked a couple of days ago, but everyone needs to add proper support for it.
diggan•6mo ago