vLLM large scale serving: DeepSeek 2.2k tok/s/h200 with wide-ep

https://blog.vllm.ai/2025/12/17/large-scale-serving.html

147•robertnishihara•3w ago

Comments

kingstnap•3w ago

Impressive performance work. It's interesting that you still see these 40+% perf gains like this.

Makes you think that you will continue to see the costs for a fixed level of "intelligence" dropping.

whoevercares•3w ago

Absolutely. LLM inference is still a greenfield — things like overlap scheduling and JIT CUDA kernels are very recent. We’re just getting started optimizing for modern LLM architectures, so cost/perf will keep improving fast.

davidhyde•3w ago

vLLM needs to perform similar operations to an operating system. If you write an operating system in Python you will have scope for many 40% improvements all over the place and in the end it won’t be Python anymore, at least under the hood it won’t be.

menaerus•3w ago

It's not about the python at all. Optimization techniques are on a completely different level, on the level of the chip and/or hw platform and finding ways to utilize them in a max manner by exploiting the intrinsic details about their limitations.

danielhanchen•3w ago

Love vLLM!

androiddrew•3w ago

Now all we need is better support for AMD gpus, both CDNA and RDNA types

mappu•3w ago

ZLUDA implements CUDA on top of AMD ROCm - they are explicitly targetting vLLM as their PyTorch compatibility test: https://vosen.github.io/ZLUDA/blog/zluda-update-q4-2025/#pyt...

(PyTorch does also support ROCm generally, it shows up as a CUDA device.)

ikari_pl•3w ago

I feel like these technologies are named by the Polish at the companies. "CUDA" means "WONDERS" and "ZŁUDA" would be an "ILLUSION".

Gracana•3w ago

ZLUDA was definitely intentional: https://github.com/vosen/ZLUDA/discussions/192

sofixa•3w ago

You can run vLLM with AMD GPUs supported by ROCm: https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/infer...

However from experience with an AMD Strix Halo, a couple of caveats: it's drastically slower than Ollama (tested over a few weeks, always using the official AMD vLLM nightly releases), and not all GPUs were supported for all models (but that has been fixed).

bildung•3w ago

vLLM ususally only plays out its strength when serving multiple users in parallel, in contrast to llama.cpp (Ollama is a wrapper around llama.cpp).

If you want more performance, you could try running llama.cpp directly or use the prebuilt lemonade nightlies.

sofixa•3w ago

But vLLM was half the t/s of Ollama, so something was obviously not ok.

vessenes•3w ago

As a user of a lot of coding tokens I’m most interested in latency - these numbers are presumably for heavily batched workloads. I dearly wish Claude had a cerebras endpoint.

I’m sure I’d use more tokens because I’d get more revs, but I don’t think token usage would increase linearly with speed: I need time to think about what I want to and what’s happened or is proposed. But I feel like I would be able to stay in flow state if the responses were faster, and that’s super appealing.

snakepit•3w ago

Still have to update it for snakepit 0.11.0, but I did start a vLLM wrapper for Elixir

https://hex.pm/packages/vllm

behnamoh•3w ago

I couldn't care less, tbh. This speed is ridiculously high, to the point where tool calls, not inference, become the bottleneck.

Also, I'd rather run a large model at slower speeds than a smaller at insanely high speeds.

spiderfarmer•3w ago

You care enough to comment, so you could in fact have cared even less.

Also, the entire industry profits from the work that’s done at the bleeding edge. That’s the case in every industry.

menaerus•3w ago

Well the thing is that the trajectory of people utilizing the models is only increasing so getting the most out of your HW becomes a particularly interesting optimization point for companies doing the inference at massive scale.

est•3w ago

Have you considered parrallel processing? I always have 2-3 Cursor IDE open because I don't like wait either.

bob1029•3w ago

Parallel tool calls do not work for my scenario. I can't ask a copy of my agent a question about something until a dependent call has resolved.

Tool use that changes the mode of the environment is a good example where you cannot go parallel. I've built a recursive agent that can run a Unity editor and I can't just blindly run whatever it wants in parallel or combos like SwitchScene -> GetSceneOverview won't interleave correctly. You'll wind up with 15 calls that loop over every scene and then you grab the overview from the last scene you switched to 15 times.

There are ways to hack around it a bit, but at some level the underlying narrative does need to be serialized or you'll be wasting an incredible amount of resources.

Depth-first search doesn't guarantee the best solution, but on average it's guaranteed to find a solution faster than breadth-first search. It's worth waiting for those dependent calls and going super deep if you want some reasonable answer quickly.

dust42•3w ago

If I followed the links correctly this benchmark was made on a 16xH200. At current prices I'd assume that is a system price of around $750,000.

The year has 86400*365 = 31536000 seconds. Thus 63072000000 tokens can be generated. As pricing is usually given per 1M tokens generated, this is 63072 such packages.

Now lets write off the investment over 3 years, 250,000/63072 = 3.96. So almost $4 per 1M tokens generated with prompt processing included.

Model was a Deepseek 671B 32B MoE.

Looks to me that $20 for a month of coding is not very sustainable - let's enjoy the party while VCs are financing it! And keep an eye on your consumption...

Electricity costs seem negligable with ~$10,000 per year at 10cts per kWh but overall cost would be ~10% higher if electricity is more like 30cts like it is in Europe.

Edit: like it is pointed out by other commenters it is 2200t/s per single GPU thus the result needs to be divided by 16: $4/16 = $0.25. This actually somewhat matches the deepseek API pricing.

yorwba•3w ago

It's 2.2k tokens per second and GPU, so you have to multiply the token output by 16 and the price per million tokens works out to 22.5 cents.

aurareturn•3w ago

I think they're also running this at 16 bit quant. If they lower it to 8bit, they might double their output which might come out to be 11 cents per million tokens.

Now take into account that modern LLMs tend to use 4bit inference, and Blackwell is significantly more optimized for 4 bit, we can see much less than 11 cents. Maybe a speed up of 5x if using 4bit and Blackwell vs H100 and 8 bit?

So we're looking at potentially 2.2 cents per million tokens.

kicks66•3w ago

I think you missed something here - its 2.2k tokens _per_ GPU

So if you work that through its $0.225 per 1M output tokens.

edf13•3w ago

> let's enjoy the party while VCs are financing it!

The VC money is there until they can solve the optimization problems

supermatt•3w ago

> more like 30cts like it is in Europe

Nope - i live in one of the most expensive areas, and even the residential price has averaged 18c/kWh delivered including taxes. Businesses get a lower basic rate and also don't pay the VAT, so it works out around 13c/kWh for them.

https://data.nordpoolgroup.com/auction/day-ahead/prices?deli...

t0mas88•3w ago

That's excluding tax, net prices around 0.20-0.30 EUR / Kwh we common.

supermatt•3w ago

I updated my comment to include my personal delivered rate including VAT - also note that businesses (like a data center) don't pay the VAT and have substantially reduced delivery fees at high voltage

ffsm8•3w ago

Then you're living in one of the cheapest areas for electricity prices in Europe, the opposite of what you said.

https://ec.europa.eu/eurostat/statistics-explained/index.php...

Scroll a little down and you see a breakdown by country

E.g.

https://ec.europa.eu/eurostat/statistics-explained/index.php...

supermatt•3w ago

I am in Lithuania, which has one of the highest wholesale energy prices in Europe (as per nord pool): https://data.nordpoolgroup.com/auction/day-ahead/prices?deli...

That it is not translating into a higher cost to the consumer (as evidenced on your link) is likely indicative of other costs being incurred by the “average” consumer in those countries with a higher domestic rate - like massive markup from users being tied into inflated contracts due to the 2022 shock where rates across Europe were more than double what they are now.

Also, these are residential prices - business prices are usually much lower (wholesale discounts, subsidies, no VAT, lower delivery charges).

As per my response to the initial comment - there is no way a datacentre in Europe is paying 30c/kWh

menaerus•3w ago

Some countries also employ progressive electricity pricing such that higher energy consumption leads to elevated kWh rates incentivizing conservation. This is also not visible in the stats above. I also think that business kWh rates are actually higher than for the households in some instances.

supermatt•3w ago

Yeah, strictly business vs residential isn’t a good comparison either really, as the lower transmission fees for medium (10kV+) and higher voltage are where a lot of the savings are - and obv a lot of business don’t use such power.

ffsm8•3w ago

Business prices should be figure 6 in my link, while the difference is a lot smaller, Lithuania is definitely one of the cheaper countries, beating the EU average slightly.

> As per my response to the initial comment - there is no way a datacentre in Europe is paying 30c/kWh

Hetzner prices it at 33c/wh as of last year I believe, previously it was 40c (after the pipeline was destroyed)

But Germany is pretty much in the 3 most expensive countries wrt electricity cost in the EU - both for consumers and commercial pricing

supermatt•3w ago

> Lithuania is definitely one of the cheaper countries

And yet has one of the highest wholesale rates...

> Hetzner prices it at...

Hertzner are reselling. They make a profit on energy resale. Their rate also includes a substantial buffer on the actual rate to account for volatility. Their rate is most likely less than half of what they are passing on for colo.

For reference, last year German industrial energy prices were around 10c/kWh INCLUDING taxes and network fees - and the government are looking to subsidize that further to target 5c/kWh: https://www.gleisslutz.com/en/know-how/germany-cuts-costs-el...

ffsm8•3w ago

You're talking about select industries which are being supported via subventions, data centers are not included. If you pay attention to the wording in your cited article, they've said so as well.

And hetzner does not have a large upsell for their energy prices, they're pretty much passing in the price as-is according to their own statements (from the large increase to 40c)

Almost all commercial applications need to pay the quoted prices around what's shown in figure 6

supermatt•3w ago

Ok, it seems I am mistaken that this subsidy applies to datacenter (apparently there is ongoing discussions to include them for this reason).

That said - I 100% don't believe that hertzner are simply passing on the price for their colo clients. Where did you read that they are not making a profit off electricity resale?

Here is another link discussing industrial energy prices WITHOUT reductions: https://www.smard.de/page/en/topic-article/213922/216044

So less than 17c/kWh in 2024, and likely another 2c when adjusted for current wholesale prices and network fees.

ffsm8•3w ago

> That said - I 100% don't believe that hertzner are simply passing on the price for their colo clients. Where did you read that they are not making a profit off electricity resale?

That's indeed probably untrue, you're most likely correct there.

The statement was wrt the increase (they're passing on the increase in cost, not that they're mirroring the cost the energy provider!)

And after thinking about it some more, they absolutely have to make a significant upcharge, as they need to pay for wiring to the rented devices, Large battery banks for electricity temporary fail over and finally diesel generators if power is down for an extended period of time (that has all been demoed via YouTubers like derBauer )

bonoboTP•3w ago

Net is excluding tax, you mean gross.

glemion43•3w ago

Private / endconsumer in Germany is 34

menaerus•3w ago

How did you arrive to $10,000 electricity costs figure?

8xH200 enclosed in DGX H200 system power draw is ~14kW in its peak (CTS) configuration/utilization. Over one year, and assuming maximum utilization, this is 123,480 kWh per single DGX H200 unit. We need 2x such units for 16xH200 system configuration under subject so it's 246,960 kWh/year. This is ~$25,000 at 10cts per kWh and ~$74,000 at 30cts per kWh. At ~1,110,000 1M batches this gives us: (1) ~$0.02 - $0.07 per 1M of energy cost and (2) ~$0.25 per 1M assuming the same HW depreciation rate. In total, this is ~$0.3 per 1M tokens.

Seems sustainable?

dust42•3w ago

I used 700W per H200 = 11.2 per 16 GPUs. I didn't include CPU and rest of the rack. So yours is a better approximation.

One has to keep in mind that the benchmark that was done is synthetic. This makes sense because it makes it reproducible but real world usage may differ - i.e. by the amount of context and the number of concurrent users. Also there are use cases where smaller models or smaller quants will do.

The key take away for me for this type of back of the envelope calculation is to get a good idea where we stand long term, i.e. when VC money stops subsidizing.

So for me $0.3 per 1M tokens for a decent model looks pretty good too. Seeing that OpenAI API charges $21 per 1M tokens input and $168 output for GPT-5.2 pro I was wondering what the real sustainable pricing is.

Palmik•3w ago

APIs are usually very profitable. As for subscriptions, it would depend on how many tokens average subscriber uses per month. Do we have some source of info on this?

Some notes:

- # Input tokens & # output tokens per request matters a lot.

- KV Cache hit rate matters a lot.

- vLLM is not the necessarily most efficient engine.

- You are looking at API cost for DeepSeek V3.2, which is much cheaper than DeepSeek R1 / V3 / V3.1. DeepSeek V3.2 is different architecture (sparse attention) that is much more efficient. DeepSeek V3 cheapest option (fp8) tends to be ~$1/mil output tokens while R1 tends to be ~$2.5/mil (note that for example Together AI charges whopping $7/mil output tokens for R1!)

As for the cost: You can also get H200s for ~ $1.6/hr and H100s for ~ $1.2/hr. That somewhat simplifies the calculations :)

Ignoring the caveats and assuming H200s, with their setup you will:

- Process 403200000 input tokens.

- Generate 126720000 output tokens.

- Spend $25.6.

- On Together with DS R1 it would cost you $3 * 403.2 + $7 * 126.7 = ~$2096. Together does not even offer discount for KV cache hits (what a joke :)).

- On NovitaAI with DS R1 it would cost you $0.7 * 403.2 + $2.5 * 126.7 = ~$600 (with perfect cache hit rate, which gives 50% discount on input tokens here, it would be ~$458).

rbanffy•3w ago

Very impressive numbers - I'd expect 2K tok/s on Cerebras hardware, not H200's.

mzl•3w ago

I don't think it would be economically viable to serve the full DeepSeek models on Cerebras hardware.

rbanffy•3w ago

I'm a huge fan of their hardware - we've been promimsed wafer-scale integration since the 1980s and they delivered it. It'd be a shame if their tech ended up a dead-end.

On the bright side, they haven't started exploring stacking chips on top of their wafers to increase local memory, and every process change will bring increased bandwidth in and out of their "pizza". I really wish they succeed.

LTL_FTC•3w ago

Well it looks like as you were typing your comment, a press release was going out announcing OpenAi’s $10B investment in Cerebras.

mzl•3w ago

Oh, I'm also a fan. It is really cool to see what they've done. However, in the current systems they have available, they would (as far as I've understood it) just need way to many racks to be able to serve the full Deepseek model for it to have any kind of economics. The main limiting factor is the amount of sram available per wafer.

mycelia•3w ago

Hey HN! I’m Seiji Eicher from Anyscale, one of the authors of this post :) Feel free to ask questions here.

menaerus•3w ago

Do you use agentic AI yet for this type of optimization work or no?

mycelia•3w ago

For my work personally, agentic AI usage is pretty standard SWE fare (Cursor/CC). Even within the engine, optimizations are often centered around things like increasing communication/compute overlap (this is called Dual-Batch Overlap in vLLM).

Probably there are more interesting/easily verifiable agent loops you could try for kernel optimizations. At this point, the best are still written by hand, though. Ex: DeepEP kernels https://github.com/deepseek-ai/DeepEP

Palmik•3w ago

Great work! What optimizations are you most excited about for 2026?

mycelia•3w ago

Lot of cool stuff coming up! As a Ray developer, I focus more on the orchestration layer, so I'm excited about things like Elastic Expert Parallelism, posttraining enhancements like colocated trainer/engines, and deploying DSV4 (rumors are the architecture will be complex). vLLM roadmap is here for reference: http://roadmap.vllm.ai/

aurareturn•3w ago

Are you using 16bit for inference? How many tokens/second if you use 8bit?

Given that SOTA models now use 4bit inference, can you do an estimation for 4bit + Blackwell?

mycelia•3w ago

Hi! This benchmarking was done w/ DeepSeek-V3's published FP8 weights. And Blackwell performance is still being optimized. SGLang hit 14k/s/B200 though, pretty cool writeup here: https://lmsys.org/blog/2025-09-25-gb200-part-2/

Palmik•3w ago

I wish there were more open benchmarks comparing different setups and different engines. There are so many knobs to tune (TP / DP / PP / PD / spec. decoding / etc.) and while the optimal setup will be highly dependent on the model, the environment and the traffic, it's likely some useful conclusions could be drawn.

It almost feels like in the past year there is some unwritten agreement between the 3 main open-source engines (vLLM, sglang, TRT-LLM) to not compare to each other directly :) They used to publish benchmarks comparing against each other quite regularly.

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

How we made geo joins 400× faster with H3 indexes

What Is Ruliology?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Microsoft open-sources LiteBox, a security-focused library OS

Sheldon Brown's Bicycle Technical Info

Hackers (1995) Animated Experience

Delimited Continuations vs. Lwt for Threads

Dark Alley Mathematics

PC Floppy Copy Protection: Vault Prolok

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Was Benoit Mandelbrot a hedgehog or a fox?

How to effectively write quality code with AI

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Why I Joined OpenAI

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Understanding Neural Network, Visually

Introducing the Developer Knowledge API and MCP Server

Learning from context is harder than we thought

I now assume that all ads on Apple news are scams

FORTH? Really!?

Show HN: ARM64 Android Dev Kit

Show HN: Smooth CLI – Token-efficient browser for AI agents

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

How we made geo joins 400× faster with H3 indexes

What Is Ruliology?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Microsoft open-sources LiteBox, a security-focused library OS

Sheldon Brown's Bicycle Technical Info

Hackers (1995) Animated Experience

Delimited Continuations vs. Lwt for Threads

Dark Alley Mathematics

PC Floppy Copy Protection: Vault Prolok

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Was Benoit Mandelbrot a hedgehog or a fox?

How to effectively write quality code with AI

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Why I Joined OpenAI

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Understanding Neural Network, Visually

Introducing the Developer Knowledge API and MCP Server

Learning from context is harder than we thought

I now assume that all ads on Apple news are scams

FORTH? Really!?

Show HN: ARM64 Android Dev Kit

Show HN: Smooth CLI – Token-efficient browser for AI agents

vLLM large scale serving: DeepSeek 2.2k tok/s/h200 with wide-ep

Comments