Deploying DeepSeek on 96 H100 GPUs

https://lmsys.org/blog/2025-05-05-large-scale-ep/

60•GabrielBianconi•1h ago

Comments

34679•1h ago

"By deploying this implementation locally, it translates to a cost of $0.20/1M output tokens"

Is that just the cost of electricity, or does it include the cost of the GPUs spread out over their predicted lifetime?

dragonslayer56•1h ago

” Our implementation, shown in the figure above, runs on 12 nodes in the Atlas Cloud, each equipped with 8 H100 GPUs.”

Maybe the cost of renting?

34679•52m ago

I'm confused because I wouldn't consider a cloud implementation to be local.

randomjoe2•22m ago

Local doesn't refer to "on metal" anymore to many people

monsieurbanana•11m ago

I missed that train

mwcz•10m ago

"On metal" is muddied too. I've heard people refer to web apps running in an OCI container as being "bare metal" deployment, as opposed to AWS or whatever hosting platform.

That's silly, but the idea that "local" is not the opposite of remote is even sillier.

ffsm8•5m ago

You can run an OCI container on bare metal though. It doesn't stop being run on bare metal just because you're running in kernel namespaces, aka docker container

Lots of people were advocating for running their k8s on bare metal servers to maximize the performance of their containers

Now wherever that's applied to your conversation... I've no clue, too little context ( ｡ ŏ ﹏ ŏ )

DSingularity•19m ago

I guess local for him is independent/private.

ollybee•23m ago

H100's can be $2 and hour, so $192 an hour for the full cluster. They report 22k tokens per second, so ~ 80 million an hour, thats $16 an hour at $0.2 per million. Maybe a bit more for input tokens, but it seems a long way off.

abdellah123•54m ago

Wow, please edit the title to include Open-source !

Blahah•39m ago

Why? Open source isn't in the original title

SV_BubbleTime•31m ago

Also “open source” I feel covers for “open weights” which is not the same thing.

caminanteblanco•38m ago

There was some tangentially related discussion in this post: https://news.ycombinator.com/item?id=45050415, but this cost analysis answers so many questions, and gives me a better idea of how huge the margin on inference a lot of these providers could be taking. Plus I'm sure that Google or OpenAI can get more favorable data center rates than the average Joe Scmoe.

A node of 8 H100s will run you $31.40/hr on AWS, so for all 96 you're looking at $376.80/hr. With 188 million input tokens/hr and 80 million output tokens/hr, that comes out to around $2/million input tokens, and $4.70/million output tokens.

This is actually a lot more than Deepseek r1's rates of $0.10-$0.60/million input and $2/million output, but I'm sure major providers are not paying AWS p5 on-demand pricing.

s46dxc5r7tv8•30m ago

Separation of the prefill and decoding layers with sglang is quite nifty! Normally 8xH100 would barely be able to hold the 4bit quantization of the model without even considering the KV cache. One prefill node for 3 decode nodes is also fascinating, nice writeup.

arnaudsm•8m ago

Interestingly, this is 10x cheaper than the cheapest provider on OpenRouter : https://openrouter.ai/deepseek/deepseek-r1?sort=price

Inference is more profitable than I thought.

AbuseIPDB

Project Ire autonomously identifies malware at scale

Amtrak's New Acela Trains Are Here. They're Moving Slower Than the Old Ones

Essential Coding Theory [pdf]

PgDog adds support for Rust plugins

America Educates the Best and Brightest–Then Shows Them the Door

FTC chair accuses Google of treating GOP's emails as spam

A Brief, Incomplete and Mostly Subjective History of Chinese Internet Censorship

Mainstream Websites that Provide Onion Services

Expert Analysis and 2030 Price Forecast for GSAT Stock

Show HN: WASM Quest, an open source game by Tortured Metaphor

Austrian regulator sides with noyb in data access case against YouTube

Compiling SvelteKit to an Executable

Illusion of Explanatory Depth

Why n8n gives AI features away for free

Skynet: Control robots and drones with LLMs via MCP using Bash

An Analog Solution for Mindful Living

Effective short intervals containing primes

Accusing Someone of "Support[Ing] Neo-Nazi Causes" May Be Libelous

Doubling CO2 to 840 ppm will increase the food supply by 40%

Jam – Zero Hallucination Big Data Storage Engine

SQLite Is Edge Scale

This class is primarily for Python support (hence the "Retarded" prefix).

Love Is Freedom

Show HN: Ec2instances.info alerts for AWS pricing changes

Why auroras are so much brighter and more easily visible recently

Show HN: Manipulate NumPy arrays in Python using Uiua

AI Is a Hype-Fueled Dumpster Fire [YouTube]

Engineers send quantum signals with standard Internet Protocol

Countering Chinese State-Sponsored Actors Compromise of Networks Worldwide