Why DeepSeek is cheap at scale but expensive to run locally

https://www.seangoedecke.com/inference-batching-and-deepseek/

319•ingve•1d ago

Comments

comrade1234•1d ago

I haven't looked for awhile but is deepseek online still about 1/100th the cost of its competitors?

ALLTaken•1d ago

I don't know the exact cost-breakdown, but they've come up with a few really inspiring and qualitatively high value papers that demonstrate how they further increased efficiency at their scale. Along with it they also published quite a few repositories with fully open-source code.

I stopped using ChatGPT as it was just reinforcing my prompts and not ever giving deeper insights, except something I call manipulative behaviour.

DeepSeek was seriously cool, but it started behaving similar to Google Gemini Pro, which just tries to be lazy, if you give it a hard task to chew on. It basically gives you patch-files instead of printing out the whole code, which is more tedious doing manually, than c/p the code.

It also started indexing our private repository and some corporate repositories that were on GitHub behind MFA and stringent lock. Definitely illegal.

diggan•1d ago

> It also started indexing our private repository and some corporate repositories that were on GitHub behind MFA and stringent lock. Definitely illegal.

What is "it" in this context, the DeepSeek weights? Sounds like you're talking about some application, but AFAIK, DeepSeek doesn't maintain any applications, only their API + released weights.

ALLTaken•14h ago

ChatGPT indexed our private GitHub repositories and those our corporations I worked with.

simianwords•1d ago

How did it have access to your private repo and how did you find out?

ALLTaken•1d ago

I made a video of it with a friend. The repository is of a large corporate automative industry company. I also have my own private repositories which were always private and OpenAI printed my files in the first prompt. When I prompted again it acted as if it didn't know. But my friend tried on his account and could access the Corp and my private repository without ever being linked.

The Corporate repository was of Volkswagen. It's quite serious of a breach. I only gave it the name of the repository and it printed the files, which shouldn't be possible.

Maybe OpenAI exploits Microsoft to access GitHub fully to train their AI on all of humanity's code for free, violating privacy, security, IP and copyright.

Legend2440•1d ago

>I only gave it the name of the repository and it printed the files, which shouldn't be possible.

Are you sure these weren't just plausible guesses at file names? It's just a hallucination.

I asked it for the list of files in some public repositories (which are definitely in the training data) and it gave me a plausible-but-wrong list of files. It can't remember that kind of detail.

ALLTaken•8h ago

I am sure about that, I also thought about hallucination, but it was precise in the first prompt. 2nd and follow-ups had plausible denial and created similar, but clearly different code.

It could even print the list of files and their exact names if triggered right. They may have surely patched that, so that nobody sues them with digital proof. But we recorded it. It was when their new model came out. Don't remember the date, but a few months ago. We have two videos and different repositories it should not have access to at all.

Microsoft owns GitHub. OpenAI has a multi-billion dollar investment from Microsoft and access to their Infrastructure "for training" and seems likely, they got access to GitHub. Something that they shouldn't do, since that's illegal and very unethical.

singularity0808•1d ago

ChatGPT is reinforcing your prompts, DeepSeek is cool but starts acting lazy like Gemini.

So what are you working with now? Deepseek or something else?

ants_everywhere•1d ago

> as it was just reinforcing my prompts and not ever giving deeper insights, except something I call manipulative behaviour.

Try telling Deepseek you want to murder political dissidents. In my experiments Deepseek will start enthusiastically reinforcing your prompts.

johnisgood•1d ago

It just simply does its job. We can add sorts of arbitrary safeguards, but then what is the point of using an LLM? Perhaps local modals are the future, because reverse engineers may not even be able to use the new Claude (just read its system prompt to not help with backdoors, and so forth).

ants_everywhere•1d ago

Yes that's true. But in this case it's the (probably) unintended consequence of an intentional safeguard. Namely, Deepseek has an obligation to spread the Chinese version of socialism, which means it's deliberately trained on material advocating for or justifying political violence.

johnisgood•1d ago

Well, I do not like that, for sure. Just put the politics and all that aside, I think it should lean towards neutrality, even if humans cannot... they should still make the LLM more neutral instead of pushing their own agenda, see Grok and white genocide in South Africa (Elon Musk's political opinion).

MangoToupe•1d ago

Is this a reference to something? Political dissidents relative to which state? Does it change if you swap out the states? How did you discover this to begin with? Why did you initially suggest murdering political dissidents?

this comment really raises so many questions I must have missed something

Still, chatbots are just as vulnerable to state-driven propaganda as the rest of us. Probably even more so. I imagine if you just referred to dissidents as "terrorists" the rhetoric would fit right in in most opinion pages across the globe. The distinction between "terrorist" and "dissident" and "freedom fighter" seems quite subjective. I probably would avoid such heavily connoted floating signifiers if you want the chatbot to be useful.

LLMs have nothing to contribute to political discourse aside from regurgitation of propaganda. Almost by definition.

ants_everywhere•1d ago

Starting at the end

> LLMs have nothing to contribute to political discourse aside from regurgitation of propaganda. Almost by definition.

I don't think this is true. LLMs should be well-positioned to make advances in political science, game theory, and related topics.

> Is this a reference to something?

It's just a reference to my experiments. I filmed some of them. There's a tame version here [0] where I just prompt it to tell the truth. I also have a less tame version I haven't posted where I lie and say I work for an intelligence agency.

The underlying mechanic is that Deepseek has built-in obligations to promote revolutionary socialism.

> Political dissidents relative to which state? Does it change if you swap out the states?

Relative to China or any socialist state. Yes it will change if you change the states because it was trained to comply with Chinese regulations.

> How did you discover this to begin with?

I asked to to honestly describe its training and then started trolling it when it told me it was essentially created for propaganda purposes to spread Chinese values abroad.

> Why did you initially suggest murdering political dissidents?

I wanted to check what its safeguards were. Most LLMs refuse to promote violence or unethical behavior. But revolutionary socialism has always devoted a lot of words to justifying violence against dissidents. So I was curious whether that would show up in its training.

> I imagine if you just referred to dissidents as "terrorists" the rhetoric would fit right in in most opinion pages across the globe.

First of all, terrorists are by definition violent offenders. Dissidents are not. When you ask Deepseek to help identify dissidents it tells you to look for people who frequently complain about the police or the government. In the US that would include large swaths of Hacker News.

Second, most people in countries like the US don't support murdering terrorists and most LLMs would not advocate that. In the US it's rare for people to advocate killing those opposed to the government. Even people who try to violently overthrow the government get trials.

[0] https://www.youtube.com/watch?v=U-FlzbweHvs

MangoToupe•1d ago

Do you think LLMs don't further the propaganda emanating from the US? I don't even know how you would start to excise that, especially if you don't agree with foreigners on what's propaganda vs just "news" or whatever.

I have quite a few Chinese friends, both on mainland and throughout south-east asia, and I can speak a little mandarin, and I can read quite a bit of Chinese. My friends complain about the PRC quite a bit. But I find it telling that this complaint specifically—authoritarian political oppression—seems to mostly come from the west, and especially from the US. And it's true that we can say obscene things to the president's face and not get locked up. I don't think that's necessarily the "gotcha" you think it is, though—we're really good at complaining, but not so good at actually fixing. Which feels increasingly more embarrassing than restrictions on speech.

Edit: I suppose I'm a bit unfair. A lot of folks in our sphere of influence in east asia say stuff like this, too. But the contrast between the folks I know who literally live in china and americans feels striking to me.

> But revolutionary socialism has always devoted a lot of words to justifying violence against dissidents.

It is very difficult to take the political opinions of people who talk like this seriously.

> LLMs should be well-positioned to make advances in political science, game theory, and related topics.

I'm struggling to understand what this might look like, and I find the argument that nuclear warfare being related to game theory to be extremely dubious. Cuz if it really held that strongly, we should be handing out nukes like candy.

ants_everywhere•23h ago

> It is very difficult to take the political opinions of people who talk like this seriously.

This tells me you haven't read the literature.

I've probably seen 150 versions of the comment you made, but almost everyone tries to explain why the violence is justified.

People rarely try to deny that revolutionary socialism is a violent ideology since every major writer from Marat to Marx to Lenin to Mao has explicitly advocated violence against civilian non-combatants. Some, like Marx, even explicitly call it terror (as in terrorism).

MangoToupe•1h ago

Can you tell me what you're referring to? Of course I've read the literature.

> People rarely try to deny that revolutionary socialism is a violent ideology since every major writer from Marat to Marx to Lenin to Mao has explicitly advocated violence against civilian non-combatants.

Yea, that's a very different thing than murdering "dissidents." Capitalists use (state) violence to maintain power; violence is necessary to seize power and create your own state. That was Mao. We are now many decades later and any "revolutionary socialist" in the area would be trying to overthrow the government by definition.

China isn't very indicative of revolutionary socialism, and revolutionary socialism comes in dozens or hundreds of different conflicting flavors. Even Lenin and Stalin argued over many things including how they should treat what we would now call "small business owners", and Stalin won in the end (mostly because Lenin died, but still).

Why don't you paint other ideologues (i.e. capitalists) with the same broad brush? It's not like they're any less violent in their suppression of threats to their power. Ever hear of vietnam? or the korean war?

im3w1l•1d ago

I think many Americans, probably the majority, support murdering foregin terrorists. GITMO is still not closed btw.

Spooky23•1d ago

> Second, most people in countries like the US don't support murdering terrorists and most LLMs would not advocate that. In the US it's rare for people to advocate killing those opposed to the government.

Many are happy to send “them” off to Central America, where someone else will murder them. The government may make mistakes, but you need to break some eggs to make an omelet.

Hilift•1d ago

> LLMs have nothing to contribute to political discourse

A non-trivial percentage of the population is easily influenced, which is leveraged by social media being there 24x7. It's likely that LLMs will be there to craft political messages, themes, and campaigns, perhaps as early as the US mid term elections. Look at JD Vance traveling the globe stating that the US will be the world leader in AI, with none of the limits/guardrails that were discussed in Europe in February. AI-driven discourse, AI-created discourse.

https://www.marketingaiinstitute.com/blog/jd-vance-ai-speech

MangoToupe•1d ago

100% agree with this, but I am definitely not endorsing that we should use LLMs to propagate propaganda.

I also think the whole "safety" thing was just befuddling. You can't regulate software, not really, just its commercial sale

Spooky23•1d ago

We can and should regulate software being used to shape public opinion. It’s probably the great threat of our generation.

MangoToupe•1d ago

I mean we can and should try, but laws mostly stop honest people from hurting each other. But the underlying software is inherently out there and you can't put the toothpaste back in the tube.

Spooky23•1d ago

Bro, already happened. There has been consultants pushing social media bots for that purpose almost immediately after these models became available.

Do you really think those armies of idiot commentators are all real? The agent provocateur is usually a bot. You see it here sometimes on Russia stories.

VectorLock•1d ago

>It basically gives you patch-files instead of printing out the whole code

I've noticed on the Aider leaderboard that Google Gemini Pro has an "Edit Format" listed as "diff-fenced" and things like ChatGPT have "architect" edit format where Aider asks separate "architect" and "code" models. Seems like Gemini Pro prefers the diff format.

zxexz•1d ago

The diff-fenced is iirc specific to Gemini models, they really don’t like the file path outside of the fence. The architect mode still uses one of the other edit format, the prompt just ends up a little different.

ALLTaken•8h ago

Maybe they optimized for one of their TREEE SOLUTIONS:

- https://Jules.google

- NotebookLm

- GoogleCollab

How can a company have 3 contenders to Windsurf and Cursor, which are VSCode forks with a little sugarcoating and not make any impact?? The CPO should be fired.

I think also after seeing Google Gemini's Video that their entire department is now fully Indian, including the CEO. If that isn't racially biased, then idk. See yourself: https://www.youtube.com/watch?v=6GO7bPb5cTA&t=2270s

ALLTaken•1d ago

I met a Googler when I was in Dubai for an event and he shared that he and others had access to LLMs internally for many years before it was made popular by OpenAI.

I know Google has an internal AI everything policy, maybe they internally have awesome tools to rearchitect everything based on diffs and in the typical google way they adapted it to their own internal tools. You know, Google.. like they don't give a damn about the user, the product design or actually anything other than profit/roi.

So many great discontinued products.. I think they killed RSS.

ashirviskas•1d ago

> DeepSeek was seriously cool, but it started behaving similar to Google Gemini Pro

You should be able to use the version of DeepSeek that you prefer indefinitely if you host it yourself or choose that specific version with your preferred provider.

zxexz•1d ago

You should self host not trust a third party application if you run into either of those things. The weights are open. DeepSeek didn’t change, the application you’re accessing it through did.

Or use an enterprise-ready service. Bedrock, firecracker, etc

ALLTaken•1d ago

I like your thinking. Nobody can use ChatGPT offline or retrain it, but DeepSeek is fully opensource. It's technology, I don't care which country made it, if it's high quality engineering, it's just that. The data it was trained on doesn't matter if you can train a wholly new model using the exact same principles and stack they opensourced with your own data. Which is really awesome.

I use openrouter.ai to have no timeouts and offtimes, since DeepSeek seems to get DDoS attacks somehow, or there are too many users, idk.

ElectricalUnion•17h ago

> Nobody can use ChatGPT offline or retrain it, but DeepSeek is fully opensource.

Well, you likely can't train DeepSeek yourself either.

You most likely:

* you philosophically don't have all the training data to train it yourself (so the claim it's opensource or open-whatever are dubious in the first place);

* you don't have the compute to "press the train button" and getting the weights back before the sun expires. While considered ridiculously ground-breaking cheap, those costs were still estimated to be around 6 million USD (DeepSeek claimed the model training took 2,788 thousand H800 GPU hours, which, at a cost of $2/GPU hour, comes out to a "mere $5.576 million"). I remember that when it was released, the mere thought that "people" cound "train AI cheaply with only 6 million USD" made one of the worst drops in the Nvidia valuation.

ALLTaken•7h ago

This is really not true my friend. I would love to help you if I had some more time, but let me look for a tutorial.

Because the FineWeb Dataset is already super good. You can train 7B or 32B Param models at home

The >600B Param model isn't really using all the data effectively, but with a MacStudio Farm you can also train that one at home (if you have enough money to buy at least 100).

Here's the easy way: https://github.com/FareedKhan-dev/train-deepseek-r1

More details: https://www.bentoml.com/blog/the-complete-guide-to-deepseek-...

Here's how DeepSeek-R1-Zero was built, basically from 0 to Hero, including weights the FULL Training Data and everything you need to get it running locally or on servers.https://medium.com/@GenerationAI/how-deepseek-r1-zero-was-re...

For $30 USD you can also train a small DeepSeek at home!

https://github.com/Jiayi-Pan/TinyZero

https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero (the model)

davidmurdoch•1d ago

Had Gemini 2.5 Pro preview running in agent mode in VSCode on a 3000+ line file. It patched it to about 200 lines with a comment in the middle: "// the rest of the code is unchanged".

ALLTaken•1d ago

Exactly my experience too and it's soo annoying. It doesn't matter how you prompt it or what your system prompt is. It tries to end the session as early as possible, claiming to have fulfilled everything. Although it just causes more work for the user, less for itself. The tokens saved are easily multiplied by the amount you have to prompt it again.

This I experienced partially in DeepSeek since their recent update too, not as aggresively as in Gemini 2.5 Pro, but similar lazyness or cleverness, if you may call that clever.

clippyplz•1d ago

Depends on who you think its competitors are - deepseek-chat ($0.27/M in; $1.10/M out) is twice as expensive as Gemini 2.5 Flash ($0.15; $0.60) but far cheaper than Claude Sonnet 4 ($3; $15).

Hilift•1d ago

That was a pretty good back to reality flex. There really isn't much of a market for expensive products. An inexpensive product that has a few tradeoffs will probably have the advantage. Given how proficient China is at accessing technology resources, it seems likely to me that any chip sanctions against them will probably not be effective.

dist-epoch•1d ago

1/10-20th is a more realistic ratio.

perching_aix•1d ago

For those looking to save time, the answer is batched inference. Pretty much running multiple people's "prompts" through a model instance at the same time instead of just really tightly timesharing each model instance.

This is also why you may experience a variance in replies when using these services, even when you set the temperature to 0 and the seed to a fixed value. It's cause you don't control the other prompts yours get batched with. Could this be a data exfiltration attack vector? Probably, I didn't "research" that far.

pcwelder•1d ago

> other prompts yours get batched with

Why would batching lead to variance?

Hendrikto•1d ago

Because these models are context-sensitive. Every token can influence the output.

simianwords•1d ago

I believe they are talking about latency variance. Batching can increase variance because you may have to wait for enough prompts to get to the batch size.

perching_aix•1d ago

No, I meant that the responses will be different run-to-run. [0]

[0] https://152334h.github.io/blog/non-determinism-in-gpt-4/

exe34•1d ago

Variance based on actual randomness would be one thing, but to me variance based on what other people are running seems concerning, for reasons I can't quite articulate. I don't want the model to reply to a question in one domain based on what a large group of other people are thinking in a different domain (e.g. if they're discussing the news with chatgpt).

zackangelo•1d ago

This definitely happens, and I'm surprised it's not talked about more often. Some attention kernels are more susceptible to this than others (I've found that paged attention is better than just naive attention, for example).

exe34•1d ago

To be fair, I suppose people do it too - if you ask me a question about A, often as not the answer will be coloured by the fact that I just learnt about B.

immibis•1d ago

But not the tokens that don't even feed into your output because they're feeding into someone else's output. Separate items in batches don't get mixed up with each other - they just run the model separately on each item at the same time, like SIMD.

jerpint•1d ago

Batching can lead to variance with things like batchnorm but most transformers use layer norm to avoid this problem

amelius•1d ago

Batchnorm can only have an effect between batches during training, not inference.

kouteiheika•1d ago

> Why would batching lead to variance?

Depending on the shape of the data a slightly different kernel implementation (for e.g. matrix multiplication, etc.) will be the most optimal, and those will give slightly different results. There could also be other sources of non-determinism depending on the implementation (e.g. some kernels are inherently not entirely deterministic as they use tricks to go faster).

zxexz•1d ago

Yep, this. I see a lot of other worryingly confident answers in the thread that are wrong.

SGLang finally has at least some notes[0], but I’m always surprised there isn’t more of a community wide effort to trace down the sources of indeterminism.

[0] https://docs.sglang.ai/references/faq.html

bhickey•1d ago

Some of the non-determinism mentioned above manifests as sensitivity to _where_ data falls within a batch.

tough•1d ago

In my experience with other regular models, once the context starts to fill up, quality starts to degrade.

wouldn't getting batched at the end of a batch, have a similar -effect- on the results, where your prompt might recieve overall less attention focused into it, if the context window is almost full?

Idk just going by the vibes

delusional•1d ago

> not entirely deterministic

There's a Nobel prize waiting for you if that's the case. I'll assume you meant theoretically consistent or accurate.

empiko•1d ago

In some mixture-of-experts approaches, samples or tokens are being distributed among experts. The experts are selected by trying to predict what is a good expert-sample match. Depending on your neighbors in the batch, you might be assigned different experts.

imtringued•1d ago

Attention doesn't get batched and the runtime of attention for a given users token depends on the total context length. Hence even in the ideal scenario of you getting a dedicated attention calculating GPU, the MLP calculating GPU doing batching will have to wait for the slowest user.

In the worst case scenario you are sharing a single attention calculating GPU with someone who has a super long context window, then that guy will be hogging most of the memory bandwidth of the GPU, even though you both are generating the same quantity of tokens.

This means that in the distributed setting, you will not only need dedicated GPUs for the model and attention calculations, you will also need to duplicate the whole setup for a variety of context lengths, so that long contexts are batches alongside other long contexts and short contexts are batches alongside other short contexts.

yjftsjthsd-h•1d ago

> Pretty much running multiple people's "prompts" through a model instance at the same time instead of just really tightly timesharing each model instance.

I naively assumed providers did that with all models. Or does it only work for this (family of?) model(s)?

hansvm•1d ago

It works for a lot of families but not all. You need a high enough degree of sharing of model weights between different queries for that to make sense (memory access being the usual bottleneck nowadays, though smaller models see something similar with matmul batch efficiencies for CPU related reasons).

Fully connected transformers trivially work (every weight for every query). MoE works beyond a certain size or with certain types of mixing (still using every weight, or using a high enough fraction that there's some sharing with batches of 20+ queries). As you push further that direction though (lots of techniques, but the key point being accessing less of the model at once and bypassing some of it for each query), you need larger and larger batches for those efficiency gains to materialize. At some point it becomes untenable because of latency waiting for batches of data, and past that it becomes untenable because of the volume of query data.

VectorLock•1d ago

Sounds like an amazing attack vector if your prompts get mixed with other's.

taneq•1d ago

Wow, almost like Deepseek’s impressive performance is the result of optimisation by smart engineers.

perching_aix•1d ago

Not sure why the snarky tone, didn't say or imply otherwise, nor did anyone else in the thread so far that I could see.

energy123•1d ago

What's the average batch size?

larodi•1d ago

Batching. Yes.

And one thing it can help you locally is when you rate certain content and want to make sure it didn’t hallucinate. So you toss 3 or 5 times or… batch_size times .)

Curious that batch if has been there from day one, but it takes a while for people to see/grasp/grok it.

jsnell•1d ago

I'm not a ML research or engineer, so take this with a grain of salt, but I'm a bit confused by this post.

Deepseek V3/R1 are expensive to run locally because they are so big compared to the models people usually run locally. The number of active parameters is obviously lower than the full model size, but that basically just helps with the compute requirements, not the memory requirements. Unless you have multiple H100s lying around, V3/R1 are only run locally as impractical stunts with some or all the model being stored on low bandwidth memory.

We can't compare the size of Deepseek V3 to that of any proprietary frontier models because we don't know the size of those models at all (or even their architecture). The models being compared to that are "expensive at scale" you can't run locally at all, but surely we have no reason to believe that they'd somehow be cheap to run locally?

But I thought you'd typically expect exactly the opposite effect than is claimed here? MoE should be the better tradeoff for the local/single-user scenario since the downside of batching being harder / less efficient doesn't matter.

> Bigger batches raise latency because user tokens might be waiting up to 200ms before the batch is full enough to run, but they boost throughput by allowing larger (and thus more efficient) GEMMs in the feed-forward step

Is it really that the matrixes being multiplied are larger? My mental model is that the purpose of batching isn't to get larger input matrices. It's to move the bottleneck from memory bandwidth to compute. The matrices are already sharded to a much smaller size than the size of the entire model or even layer. So you'll basically load some slice of the weights from the HBM to SRAM, do the multiplication for that slice, and then aggregate the results once all tiles have been processed. Batching lets you do multiple separate computations with the same weights, meaning you get more effective FLOPS per unit of memory bandwidth.

> The fact that OpenAI and Anthropic’s models are quick to respond suggests that either:

Is that actually a fact? The post has no numbers on the time to first token for any of the three providers.

gfysfm•1d ago

Hi, I wrote the post! Also not a ML researcher, just an interested engineer, so I'm sure I got some things wrong.

> MoE should be the better tradeoff for the local/single-user scenario since the downside of batching being harder / less efficient doesn't matter.

What I meant was that the single-user scenario is going to get dramatically worse throughput-per-GPU, because they're not able to reap the benefits of multi-user batching (unless they're somehow doing massively parallel inference requests, I suppose).

> Is it really that the matrixes being multiplied are larger? My mental model is that the purpose of batching isn't to get larger input matrices. It's to move the bottleneck from memory bandwidth to compute.

As I understand it, you want larger input matrices in order to move the bottleneck from memory to compute: if you do no batching at all, your multiplications will be smaller (the weights will be the same, of course, but the next-token data you're multiplying with the weights will be 1xdim instead of batch-size x dim), so your GPUs will be under-utilized and your inference will spend more time doing memory operations and less time multiplying.

> The post has no numbers on the time to first token for any of the three providers.

I probably should have hunted down specific numbers, but I think people who've played with DeepSeek and other models will notice that DeepSeek is noticeably more sluggish.

yekanchi•1d ago

this statement holds true for all large parameter open weight models.

freehorse•1d ago

> mixture of experts requires higher batch sizes

Or apple silicon for low batch size (=1 ideally). The unified memory allows for running larger models on the expense of them running slower, because of lower bandwidth/flops than a normal gpu. But MoEs require computing only few parameters every time, so the computational needs are low. I have seen people reporting decent speeds for deepseek for single batch inference on macs. It is still expensive though to many people's standards because it requires a lot of $$$ to get enough memory.

In some ways, MoE models are perfect fit for macs (or any similar machines that may come out). In contrast, ordering a mac with upgraded ram size and running dense models that just fit in the vram can be very painful.

DavidSJ•1d ago

Here’s a concise explanation:

- High sparsity means you need a very large batch size (number of requests being processed concurrently) so that each matrix multiplication is of sufficient arithmetic intensity to get good utilization.

- At such a large batch size, you’ll need a decent number of GPUs — 8-16 or so depending on the type — just to fit the weights and MLA/KV cache in HBM. But with only 8-16 GPUs your aggregate throughput is going to be so low that each of the many individual user requests will be served unacceptably slowly for most applications. Thus you need more like 256 GPUs for a good user experience.

zxexz•1d ago

I’m serving it on 16 H100s (2 nodes). I get 50-80 tok/s per request, and in aggregate I’ve seen several thousand. TTFT is pretty stable. Is faster than any cloud service we can use.

majke•1d ago

Using vllm?

zxexz•1d ago

Oh, SGLang. Had to make a couple modifications, I forget what they were, nothing crazy. Lots of extra firmware, driver and system config too.

latchkey•1d ago

You could do it on one node of 8xMI300x and cut your costs down.

zackangelo•1d ago

H200s are pretty easy to get now. If you switched I'm guessing you'd get a nice bump because the nccl allreduce on the big mlps wouldn't have to cross infiniband.

DavidSJ•1d ago

You're presumably using a very small batch size compared to what I described, thus getting very low model FLOP utilization (MFU) and high dollar cost per token.

almostgotcaught•1d ago

> High sparsity means you need a very large batch size

I don't understand what connection you're positing here? Do you think sparse matmul is actually a matmul with zeros lol

DavidSJ•1d ago

It's sparse as in only a small fraction of tokens are multiplied by a given expert's weight matrices (this is standard terminology in the MoE literature). So to properly utilize the tensor cores (hence serve DeepSeek cheaply, as the OP asks about) you need to serve enough tokens concurrently such that the per-matmul batch dimension is large.

almostgotcaught•1d ago

i still don't understand what you're saying - you're just repeating that a sparse matmul is a sparse matmul ("only a small fraction of tokens are multiplied by a given expert's weight matrices"). and so i'm asking you - do you believe that a sparse matmul has low/bad arithmetic intensity?

DavidSJ•1d ago

An MoE's matmuls have the same arithmetic intensity as a dense model's matmuls, provided they're being multiplied by a batch of activation vectors of equal size.

dist-epoch•1d ago

Do the individual requests in a batch influence each-other?

Not in a floating point non-deterministic kind of way, where exact ordering might introduce some non-determinism (begin position 5th versus being position 10th in the batch lets say).

I'm asking in a semantic way, can context from one request leak into another because they are in the same batch?

ipieter•1d ago

This is an interesting blogpost. While the general conclusion ("We need batching") is true, inference of mixture of experts (MoE) models is actually a bit more nuanced.

The main reason we want big batches is because LLM inference is not limited by the compute, but my loading every single weight out of VRAM. Just compare the number of TFLOPS of an H100 with the memory bandwidth, there's basically room for 300 FLOP per byte loaded. So that's why we want big batches: we can perform a lot of operations per parameter/weight that we load from memory. This limit is often referred to as the "roofline model".

As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.

So what MoE allows is expert parallelism, where different nodes keep different experts in memory and don't need to communicate as much between nodes. This only works if there are enough nodes to keep all experts in VRAM and have enough overhead for other stuff (KV cache, other weights, etc). So naturally the possible batch size becomes quite large. And of course you want to maximize this to make sure all GPUs are actually working.

zozbot234•1d ago

You could load different "experts" in a round-robin way on a single node and only aggregate "batches" opportunistically, when you just have multiple requests in-flight that all happen to rely on the same "expert". The difference being that instead of "batches", you would only really have queues. Of course this would come with a sizeable increase in latency, but that's acceptable for many applications (such as for "deep research" workflows)

jchrisa•1d ago

This is very much like Erlang's actor model. The same compute can be run in parallel, or managed via queues. With Erlang's strong support for FFI and process control, I wonder if it's being used as a dispatcher for these sorts of workloads.

iwontberude•1d ago

And this is the investment case for AMD, models fit entirely in a single chassis, and side benefit: less tariffed network equipment to interconnect compute. Map/reduce instead of clustered compute.

Edit: when downvoting, please offer some insight why you disagree

dragonwriter•1d ago

How is the a unique advantage for AMD?

latchkey•1d ago

AMD is consistently stacking more HBM.

  H100 80GB HBM3
  H200 141GB HBM3e
  B200 192GB HBM3e

  MI300x 192GB HBM3
  MI325x 256GB HBM3e
  MI355x 288GB HBM3e

This means that you can fit larger and larger models into a single node, without having to go out over the network. The memory bandwidth on AMD is also quite good.

krapht•1d ago

So the MI300x has 8 different memory domains, and although you can treat it as one flat memory space, if you want to reach their advertised peak memory bandwidth you have to work with it like an 8-socket board.

latchkey•1d ago

Here is a good article on it:

https://rocm.blogs.amd.com/software-tools-optimization/compu...

ryao•1d ago

It really does not matter how much memory AMD has if the drivers and firmware are unstable. To give one example from last year:

https://www.tomshardware.com/pc-components/gpus/amds-lisa-su...

They are currently developing their own drivers for AMD hardware because of the headaches that they had with ROCm.

latchkey•1d ago

"driver" is such a generic word. tinygrad works on mi300x. If you want to use it, you can. Negates your point.

Additionally, ROCm is a giant collection of a whole bunch of libraries. Certainly there are issues, as with any large collection of software, but the critical thing is whether or not AMD is responsive towards getting things fixed.

In the past, it was a huge issue, AMD would routinely ignore developers and bugs would never get fixed. But, after that SA article, Lisa lit a fire under Anush's butt and he's taking ownership. It is a major shift in the entire culture at the company. They are extremely responsive and getting things fixed. You can literally tweet your GH issue to him and someone will respond.

What is true a year ago isn't today. If you're paying attention like I am, and experiencing it first hand, things are changing, fast.

ryao•1d ago

I have been hearing this about AMD/ATI drivers for decades. Every year, someone says that it is fixed, only for new evidence to come out that they are not. I have no reason to believe it is fixed given the history.

Here is evidence to the contrary: If ROCm actually was in good shape, tinygrad would use it instead of developing their own driver.

latchkey•1d ago

We have all been hearing things for decades. Things are noticeably different now. Live in the present, not in the past.

Tinygrad isn’t a driver. It is a framework. It is being developed by George however he wants. If he wants to build something that gives him more direct control over things. Fine. Others might write PTX instead if using higher level abstractions.

Fact is that tinygrad runs not only on AMD, but also Nvidia and others. You might want to reassess your beliefs because you’re reading into things and coming up with the wrong conclusions.

ryao•18h ago

I read tinygrad’s website:

https://tinygrad.org/#tinygrad

Under driver quality for AMD, they say “developing” and point to their git repository. If AMD had fixed the issues, they would instead say the driver quality is great and get more sales.

They can still get sales even if they are honest about the state of AMD hardware, since they sell Nvidia hardware too, while your company would risk 0 sales if you say anything other than “everything is fine”, since your business is based on leasing AMD GPUs:

https://hotaisle.xyz/pricing/

Given your enormous conflict of interest, I will listen to what George Hotz and others are saying over what you say on this matter.

latchkey•17h ago

Exactly, it is not a driver.

Appreciate you diving more into my business. Yes, we are one of the few that publishes transparent pricing.

When we started, we got zero sales, for a long time. Nobody knew if these things performed or not. So we donated hardware and people like ChipsAndCheese started to benchmark and write blog posts.

We knew the hardware was good, but the software sucked. 16 or so months later, things have changed and sufficiently improved that now we are at capacity. My deep involvement in this business is exactly how I know what’s going on.

Yes, I have a business to run, but at the same time, I was willing to take the risk, when no-one else would, and deploy this compute. To insinuate that I have some sort of conflict of interest is unfair, especially without knowing the full story.

At this juncture, I don’t know what point you’re trying to make. We agree the software sucked. Tinygrad now runs on mi300x. Whatever George’s motivations were a year ago are no longer true today.

If you feel rocm sucks so badly, go the tinygrad route. Same if you don’t want to be tied to cuda. Choice is a good thing. At the end of the day though, this isn’t a reflection on the hardware at all.

ryao•16h ago

I hope your business works out for you and I am willing to believe that AMD has improved somewhat, but I do not believe AMD has improved enough to be worth people’s time when Nvidia is an option. I have heard too many nightmares and it is going to take many people, including people who reported those nightmares, reporting improvements for me to think otherwise. It is not just George Hotz who reported issues. Eric Hartford has been quiet lately, but one of the last comments he made on his blog was not very inspiring:

> Know that you are in for rough waters. And even when you arrive - There are lots of optimizations tailored for nVidia GPUs so, even though the hardware may be just as strong spec-wise, in my experience so far, it still may take 2-3 times as long to train on equivalient AMD hardware. (though if you are a super hacker maybe you can fix it!)

https://erichartford.com/from-zero-to-fineturning-with-axolo...

There has been no follow-up “it works great now”.

That said, as for saying you have a conflict of interest, let us consider what a conflict of interest is:

https://en.wikipedia.org/wiki/Conflict_of_interest

> A conflict of interest (COI) is a situation in which a person or organization is involved in multiple interests, financial or otherwise, and serving one interest could involve working against another.

You run a company whose business is dependent entirely on leasing AMD GPUs. Here, you want to say that AMD’s hardware is useful for that purpose and no longer has the deluge of problems others reported last year. If it has not improved, saying such could materially negatively impact your business. This by definition is a conflict of interest.

That is quite a large conflict of interest, given that it involves your livelihood. You are incentivized to make things look better than they are, which affects your credibility when you say that things are fine after there has been ample evidence in the recent past that they have not been. In AMD’s case, poor driver quality is something that they inherited from ATI and the issues goes back decades. While it is believable that AMD has improved their drivers, I find it difficult to believe that they have improved them enough that things are fine now, given history. Viewing your words as being less credible because of these things might be unfair, but there have been plenty of people whose livelihoods depended on things working before you that outright lied about the fitness of products. They even lied when people’s lives were at risk:

https://hackaday.com/2015/10/26/killed-by-a-machine-the-ther...

You could be correct in everything you say, but I have good reason to be skeptical until there has been information from others corroborating it. Blame all of the people who were in similar positions to yours that lied in the past for my skepticism. That said, I will keep my ears open for good news from others who use AMD hardware in this space, but I have low expectations given history.

latchkey•16h ago

Funny to see you quoting Eric, he’s a friend and was just running on one of our systems. AMD bought credits from us and donated compute time to him as part of the big internal changes they’re pushing. That kind of thing wouldn’t have happened a year ago. And from his experience, the software has come a long way. Stuff is moving so fast, that you aren't even keeping up, but I am the one driving it forward.

https://x.com/cognitivecompai/status/1929260789208142049

https://news.ycombinator.com/item?id=44154174

And sigh, here we are again with the conflict of interest comments, as if I don’t get it. As I said, you don’t know the full story, so let me spell it out. I’m not doing this for money, status, or fame. I’m fortunate enough that I don’t need a job, this isn’t about livelihood or personal gain.

I’m doing this because I genuinely care about the future of this industry. I believe AI is as transformational as the early Internet. I’ve been online since 1991 (BBS before that), and I’ve seen how monopolies can strangle innovation. A world where one company controls all AI hardware and software is a terrible outcome. Imagine if Cisco made every router or Windows was the only OS. That’s where we’re headed with Nvidia, and I refuse to accept that.

Look at my history and who my investor is, this isn’t some VC land grab. We truly care about decentralizing and democratizing compute. Our priority is getting this previously locked up behind supercomps HPC into the hands of as many developers as possible. My cofounder and I are lifelong nerds and developers, doing this because it matters.

Right now, only two companies are truly competing in this space. You’ve fairly pointed out failures of Cerebras and Groq. AMD is the only one with a real shot at breaking the monopoly. They’re behind, yes. But they were behind in CPUs too, and look where that went. If AMD continues on the path they’re on now, they can absolutely become a viable alternative. Make no mistake, humanity needs an alternative and I'll do my best to make that a reality.

ryao•14h ago

Ask Eric to consider writing a new blog post discussing the state of LLM training on AMD hardware. I would be very interested in reading what he has to say.

AMD catching up in CPUs required that they become competent at hardware development. AMD catching up in the GPGPU space would require that they become competent at software development. They have a long history of incompetence when it comes to software development. Here are a number of things Nvidia has done right contrasted with what AMD has done wrong:

  * Nvidia aggressively hires talent. It is known for hiring freshly minted PhDs in areas relevant to them. I heard this firsthand from a CS professor whose specialty was in compilers who had many former students working for Nvidia. AMD is not known for aggressive hiring. Thus, they have fewer software engineers to put on tasks.

  * Nvidia has a unified driver, which reduces duplication of effort, such that their software engineers can focus on improving things. AMD maintains separate drivers for each platform. AMD tried doing partial unification with vulkan, but it took too long to develop, so the Linux community developed its own driver and almost nobody uses AMD’s unified Vulkan driver on Linux. Instead of killing their effort and adopting the community driver for both Linux and Windows, they continued developing their driver that is mostly only used on Windows.

  * Nvidia has a unified architecture, which further deduplicates work. AMD split their architecture into RDNA and CDNA, and thus must implement the same things for each where the two overlap. They realized their mistake and are making UDNA, but the damage is done and they are behind because of their RDNA+CDNA misadventures. It will not be until 2026 that UDNA fixes this.

  * Nvidia proactively uses static analysis tools on their driver, such as coverity. This became public when Nvidia open sourced the kernel part of their Linux driver. I recall a Linux kernel developer that works on static analysis begging the amdgpu kernel driver developers to use static analysis tools on their driver, since there were many obvious issues that were being caught by static analysis tools that were going unaddressed.

There are big differences between how Nvidia and AMD do engineering that make AMD’s chances of catching up slim. That is likely to be the case until they start behaving more like Nvidia in how they do engineering. They are slowly moving in that direction, but so far, it has been too little, too late.

By the way, AMD’s software development incompetence applies to the CPU side of their business too. They had numerous USB issues on the AM4 platform due to bugs in AGESA/UEFI. There were other glitches too, such and memory incompatibilities. End users generally had to put up with it, although some AMD in conjunction with some motherboard vendors managed to fix it the issues. I had an AM4 machine that would not boot reliably with 128GB of RAM and this persisted until I replaced the motherboard with one of the last AM4 motherboards made after suffering for years. Then there was this incompetence that even affected AM5:

https://blog.desdelinux.net/en/Entrysign-a-vulnerability-aff...

AMD needs to change a great deal before they have any hope of competing with Nvidia GPUs in HPC. The only thing going for them in HPC for GPUs is that they have relatively competent GPU hardware design. Everything else about their GPUs have been a disaster. I would not be surprised if Intel manages to become a major player in the GPU market before AMD manages to write good drivers. Intel, unlike AMD, has a history of competent software development. The major black mark on their history would be the initial Windows ARC drivers, but the were able to fix a remarkable number of issues in the time since their discrete GPU launch, and have fairly good drivers on Windows now. Unlike AMD, they did not have a history of incompetence, so the idea that they fixed the vast majority of issues is not hard to believe. Intel will likely continue to have good drivers after they have made competitive hardware to pair with them, provided that they have not laid off their driver developers.

I have more hope in Intel than I have in AMD and I say that despite knowing how bad Intel is at doing anything other than CPUs. No matter how bad Intel is at branching into new areas, AMD is even worse at software development. On the bright side, Intel’s GPU IP has a dual role, since it is needed for their CPU’s iGPUs, so Intel must do the one thing they almost never do when branching into new areas, which is to iterate. The cost of R&D is thus mostly handled by their iGPUs and they can continue iterating on their discrete graphics until it is a real contender in the market. I hope that they merge Gaudi into their GPU development effort, since iterating on ARC is the right way forward. I think Intel having an “AMD moment” in GPUs is less of a longshot than AMD’s recovery from the AM3 fiasco and less of a long shot than AMD becoming competent at driver development before Intel either becomes good at GPGPU or goes out of business.

latchkey•14h ago

Trying to find fault over UDNA is hilarious, they literally can't win with you.

My business model is to support viable alternatives. If someone else comes along and develops something that looks viable and there is customer demand for it, I'll deploy it.

You totally lost me at having more hope with Intel. I'm not seeing it. Gaudi 3 release was a nothing burger and is only recently deployed on IBM Cloud. Software is the critical component and if developers can't get access to the hardware, nobody is going to write software for it.

ryao•14h ago

I fixed some autocorrect typos that were in my comment. I do not find fault with UDNA and I have no idea why you think I do. I find fault with the CDNA/RDNA split. UDNA is what AMD should have done in the first place.

As for Gaudi 3, I think it needs to be scrapped and used as an organ donor for ARC. In particular, the interconnect should reused in ARC. That would be Intel’s best chance of becoming competitive with Nvidia.

As for AMD becoming competitive with Nvidia, their incompetence at software engineering makes me skeptical. They do not have enough people. They have the people that they do have divided into to many redundant things. They do not have their people doing good software engineering practices such as static analysis. They also work the people that they do have with long hours (or so I have read), which of course is going to result in more bugs. They need a complete culture change to have any chance of catching up to Nvidia on the software side of things.

As for Intel, they have a good software engineering culture. They just need to fix the hardware side of things and I consider that to be much less of a stretch than AMD becoming good at software engineering Their recent battlematrix announcement is a step in the right direction. They just need to keep improving their GPUs and add an interconnect to fulfill the role of nvlink.

DiabloD3•17h ago

You're conflating two different things.

ROCm isn't part of AMD drivers, its a software library that helps you support legacy compute APIs and stuff in the BLAS/GEMM/LAPACK end of things.

The part of ROCm you're interested in is HIP; HIP is the part that does legacy CUDA emulation. HIP will never be complete because Nvidia keeps adding new things, documents things wrong, and also the "cool" stuff people do on Nvidia cards aren't CUDA and it is out of scope for HIP to emulate PTX (since that is strongly tied to how historical Nvidia architectures worked, and would be entirely inappropriate for AMD architectures).

The whole thing with Tinygrad's "driver" isn't a driver at all, its the infrastructure to handle card to card ccNUMA on PCI-E-based systems, which AMD does not support: if you want that, you buy into the big boy systems that have GPUs that communicate using Infinity Fabric (which it, itself, is the HyperTransport protocol over PCI-E PHY instead of over HyperTransport PHY; PCI over PCI-E has no ability to handle ccNUMA meaningfully).

Extremely few customers, AMD's or not, want to share VRAM directly over PCI-E across GPUs since most PCI-E GPU customers are single GPU. Customers that have massive multi-GPU deployments have bought into the ecosystem of their preferred vendor (ie, Nvidia's Mellanox-powered fabrics, or AMD's wall-to-wall Infinity Fabric).

That said, AMD does want to support it if they can, and Tinygrad isn't interested in waiting for an engineer at AMD to add it, so they're pushing ahead and adding it themselves.

Also, part of Tinygrad's problem is they want it available from ROCm/HIP instead of a standards compliant modern API. ROCm/HIP still has not been ported to the modern shader compiler that the AMD driver uses (ie, the one you use for OpenGL, Vulkan, and Direct family APIs), since it originally came from an unrelated engineering team that isn't part of the driver team.

The big push in AMD currently is to unify efforts so that ROCm/HIP is massively simplified and all the redundant parts are axed, so it is purely a SPIR-V code generator or similar. This would probably help projects like Tinygrad someday, but not today.

ryao•16h ago

> ROCm isn't part of AMD drivers, its a software library that helps you support legacy compute APIs and stuff in the BLAS/GEMM/LAPACK end of things.

AMD says otherwise:

> AMD ROCm™ is an open software stack including drivers, development tools, and APIs that enable GPU programming from low-level kernel to end-user applications.

https://www.amd.com/en/products/software/rocm.html

The issues involving AMD hardware not only applied to the drivers, but to the firmware below the drivers:

https://www.tomshardware.com/pc-components/gpus/amds-lisa-su...

Tinygrad’s software looks like a userland driver:

https://github.com/tinygrad/tinygrad/blob/master/tinygrad/ru...

It loads various firmware blobs, manages part of the initialization process, manages memory, writes to registers, etcetera. These are all things a driver does.

latchkey•15h ago

https://community.amd.com/t5/ai/what-s-new-in-amd-rocm-6-4-b...

ROCm 6.4 software introduces the Instinct GPU Driver, a modular driver architecture that separates the kernel driver from ROCm user space.

DiabloD3•15h ago

They were doing this before, the difference with this is, the version of ROCm you use is locked to the driver versions that are supported, which is a very narrow range.

With this new thing, the backend API is now formalized and easier to support wider range of difference.

DiabloD3•15h ago

AMD is extremely bad at communications. The driver already contains everything ROCm requires to talk to the GPU, and ROCm itself is only a SDK that contains runtimes, libraries, and compilers.

This part of TinyGrad is not a driver, however it tries to hijack the process to do part of that task. You cannot boot the system with this, and it does not replace any part of the Mesa/DRI/DRM/KMS/etc stack. It does reinitialize the hardware with a different firmware, which might be why you think this is a driver.

ryao•14h ago

I consider it to be a driver, or at least part of one. Userspace drivers exist. Graphic drivers originally were entirely in userspace, until portions of them were moved into the kernel for kernel mode setting and DRM. These days, graphics drivers themselves have both kernel mode and user mode components. The shader compiler for example would be a user mode component.

faldore•23h ago

That was last year Mi300x firmware and software have gotten much better since then

ryao•17h ago

Unfortunately, AMD and ATI before it have had driver quality issues for decades; and both they and their fans have claimed that they have solved the problems every year since.

Even if they have made progress, I doubt that they have reached parity with Nvidia. I have had enough false hope from them that I am convinced that the only way that they will ever improve their drivers if they let another group write the drivers for them.

Coincidentally, Valve has been developing the Vulkan driver used by SteamOS and other Linux distributions, which is how SteamOS is so much better than Windows. If AMD could get someone else to work on improving their GPGPU support, we would likely see it become quite good too. Until then, I have very low expectations.

latchkey•17h ago

Gaming != HPC

ryao•16h ago

GPUs were originally designed for gaming. Their ability to be used in HPC grew out of that. The history of issues goes back rather far.

latchkey•16h ago

Thanks, I totally had no idea what the G stood for. /s

Seriously though, you’re clearly stuck in the past. This is tech. It evolves fast.

Holding onto grudges just slows you down.

ryao•16h ago

The G stood for graphics.

As for being stuck in the past, I got fed up in 2006 after 8 years of nothing but ATI graphics. I spent years hoping that the issues would be fixed after the latest update, but they never were. I had a fairly problem free experience after switching to Nvidia. When issues did occur, Nvidia fixed them within months. While enjoying the relatively problem free experience on Nvidia, I would hear people claim everything was fixed on ATI (and later AMD), only to hear waves of people complaining about issues. Then Valve got involved with the driver development for AMD graphics and made the Steam deck. I brought one and it has been fantastic. I still hear about numerous problems involving drivers AMD wrote (especially their windows drivers), but I am using drivers that were in part authored by Valve, and Valve fixed the issues AMD was incapable of fixing themselves.

You claim that things are fine for HPC on AMD graphics hardware, but I have reason to be skeptical given that numerous people have reported severe problems just last year with no follow up that the various headaches have been fixed.

Also, you have repeatedly claimed that tinygrad’s software is not a driver, yet I see a userland driver here:

https://github.com/tinygrad/tinygrad/blob/master/tinygrad/ru...

As I have said elsewhere: It loads various firmware blobs, manages part of the initialization process, manages memory, writes to registers, etcetera. These are all things a driver does.

I am going to listen to others and my own eyes over you on these matters.

latchkey•15h ago

¯\_(ツ)_/¯

dragonwriter•19h ago

MI355X isn't out yet, and the upcoming B300 also has 288GB HBM3e

latchkey•18h ago

June 12th.

B300 is Q4 2025.

Yes, they keep leapfrogging each other. AMD is still ahead in vram.

rixed•16h ago

> when downvoting, please offer some insight why you disagree

And remind that (down)voting is not for (dis)agreement.

cyptus•1d ago

could such a network with all its nodes and weights be deployed to an analog circuit and be superfast?

rpmisms•1d ago

Please go into more detail about this proposal, this piqued my interest in a really strange way.

cyptus•1d ago

The idea is to replicate the weights of the network in the electronics. Somehow like our brains work? This way an analog input signal could lead to a neural network processed output signal without the digital emulation on an gpu. As this is very much simplified, the question is if this could work for modern llms?

koiueo•1d ago

Suddenly "temperature" parameter starts making sense

(If you ever tried fine-tuning an analog circuit, you'll know how finicky the process due to the environment, including temperature)

cyptus•23h ago

haha very true!

TuringNYC•22h ago

Do you mean something like this? https://www.etched.com/

ryao•1d ago

> As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.

Inference works by computing layers and then have a very small vector that you send to the next layer as input. When a model does not fit in a single GPU, you just divide it into layers and send the vector over a fabric to the GPU holding the next layer. The transfer happens so quickly that there is a negligible amount of idle time and then the next layer can be computed. The fastest inference on the planet at Cerebras uses this technique to do 2500T/sec on Llama 4 Maverick.

jimmySixDOF•1d ago

Groq and Cerebras both take a big chip approach to architecture and, at least in the case of Groq, they only make economic sense under high batch loads.

https://x.com/swyx/status/1760065636410274162?s=46

ryao•18h ago

There is nothing big about Groq’s chips. Their individual chips have only 230 MB RAM. Unlike Cerebras, which can load multiple layers into a single chip, grok must divide a layer across many chips.

dgfitz•1d ago

I am so sincerely amused that “we” figured out how to monetize LLMs from the jump using tokens.

It isn’t tech for techs sake, it’s a money grab. Reminds me of paying to send a text message or buying minutes for a phone plan. Purely rent-seeking.

kaashif•1d ago

Can you explain how this is rent seeking? It seems to be straightforwardly not rent seeking.

1. Company develops model, invests in research, hardware, and software.

2. Company sells access to the model.

(1) is the step that makes this not rent seeking.

Rent seeking is when you profit from something you didn't earn - land rent, monopoly profits, protectionism.

dgfitz•1d ago

That’s fair. My thought was, when there is an interesting new technology, it usually takes time to figure out how to monetize it. Figuring out how to monetize LLMs took no time at all.

davidmurdoch•1d ago

"GPT 1.0" was released in 2018, I think that's a decent amount of time.

dgfitz•22h ago

We must have different definitions of released.

davidmurdoch•21h ago

Don't shoot the messenger. It's just what wikipedia says.

Generative Pre-trained Transformer 1 (GPT-1) Original author(s) OpenAI Initial release June 2018; 7 years ago

vikramkr•1d ago

I don't think it's obvious that any of these model providers are even profitable right now. I'm also not sure what there is to "figure out" - it's an expensive technology where the cost scales per token, so they charge per token? would you rather they burned even more money giving it away for free until everyone was dependent on it and then hyper enshittified to try and not go broke like so much of the rest of tech?

dgfitz•1d ago

My point, poorly made, was that I can run it myself for “free” without caring about tokens at all. Tokens are an artificial construct.

AndroTux•1d ago

So by that logic all VPS providers are just a money grab because you can run your software yourself for “free” without having to pay for that artificial construct these greedy people call “compute?”

I don’t understand your point. You’re using a resource. You’re wasting time on the GPU of someone else. That chunk is called a token. And that’s what you’re being billed.

dgfitz•1d ago

VPS providers don’t put out articles espousing their value twice a week because they don’t need to, the value is obvious.

I didn’t mean to come off as argumentative. Again, in my head it’s so obvious what the end game is, and it isn’t to better humanity.

vikramkr•7h ago

That's not remotely correct- how would you be able to ignore tokens? That's literally what defines the context size of an llm, the larger the more memory. Generating each token is literally the compute you're doing. Your hardware limits how many tokens per second you can produce with a particular model. It's literally what you're consuming electricity to produce. That's like saying you can drive your own car for free without worrying about how many miles you've driven.

Workaccount2•1d ago

It's likely that no one who makes base models is currently making money from LLMs. More likely losing it at a crazy rate.

These prices are almost certainly "introductory offer" prices to get people/devs to integrate AI into their lives/workflow/product.

In a few years is when we will see what the actual cost is.

imtringued•1d ago

>It’s a peculiar feature of transformer-based LLMs that computing a batch of completions at the same time is almost as fast as computing a single completion. Why is that?

Incorrect. Transformers usually contain a classical MLP layer. Only the MLP layer can be batched. Hence all classical neural networks including convolutional networks (via im2col) can be batched.

If there's anything that the transformer architecture changes, it is that the attention layer cannot be batched.

krackers•22h ago

Yeah this part was confusing, because it's only mentioned halfway through the article that the attention step can only be batched across matching context-window sizes.

gok•1d ago

MoE is in general kind of a stupid optimization. It seems to require around 5x more total parameters for the same modeling power as a dense model in exchange for around 2x less memory bandwidth needs.

The primary win of MoE models seems to be that you can list an enormous parameter count in your marketing materials.

hansvm•1d ago

Stupid? By paying 5x (normally 2-4x, but whatever) of a thing you don't care about at inference you can gain 2x in the primary thing you care about at inference. It's like handing out 4 extra bricks and getting back an extra lump of gold.

bick_nyers•1d ago

The general rule of thumb when assessing MoE <-> Dense model intelligence is SQRT(Total_Params*Active_Params). For Deepseek, you end up with ~158B params. The economics of batch inferencing a ~158B model at scale are different when compared to something like Deepseek (it is ~4x more FLOPS per inference after all), particularly if users care about latency.

philipodonnell•1d ago

Isn’t this an arbitrage opportunity? Offer to pay a fraction of the cost per token but accept that your tokens will only be processed when the batch window isn’t big enough, then resell that for a markup to people who need non-time sensitive inference?

pama•1d ago

You may have already noticed that many providers have separate, much lower, prices for offline inference.

angry_octet•1d ago

This is a great explainer from an LLM perspective, and it would be interesting to see a computational scheduling explanation in depth. I presume that hyperscale LLM companies extensively examine the computation trace to identify bottlenecks and idle bubbles, and develop load balancers, pipeline architectures and schedulers in order to optimise their workload.

The batching requirement for efficiency makes high security applications quite difficult, because the normal technique of isolating unrelated queries would become very expensive. The nVidia vGPU GPU virtualisation time shares GPU memory, and every switch requires unload/reload context switches, doubtful they have deduplication. Multi-Instance GPU (MIG) splits GPU memory between users, but it is a fixed partitioning scheme (you have to reboot the GPU to change), and nobody wants to split their 96GB GPU into 4x24GB GPUs.

Makes me wonder what the tradeoff is for putting second level memory on the GPU board (i.e. normal DRAM), so that different matrix data can be loaded in faster than over PCIe, i.e. the HBM becomes a cache.

(I'm also really liking the honesty in the authors book on Software Engineering, not in the dry IEEE sense, but as a survival guide in a large enterprise. https://www.seangoedecke.com/book/ )

slavboj•1d ago

It is not "slow and expensive", although it could be "or". You can get 3 tokens / second running on DDR4 memory on a two generation old workstation system that costs ~1K, via llama.cpp .

KolmogorovComp•1d ago

You’re most likely confusing the real deepseek with a distilled version. Unless you have more than 192Gb of RAM.

slavboj•18h ago

Workstations routinely accommodate much more than that. The "under $1K" price referred to a 768gb build (12x 64gb sticks on a Skylake based system), you could also do a dual-socket version with twice that, at the cost of messing with NUMA (which could be a pro or a con for throughput depending on how you're spreading bandwidth between nodes).

bick_nyers•1d ago

There's still a lot of opportunity for software optimizations here. Trouble is that really only two classes of systems get optimizations for Deepseek, namely 1 small GPU + a lot of RAM (ktransformers) and the system that has all the VRAM in the world.

A system with say 192GB VRAM and rest standard memory (DGX station, 2xRTX Pro 6000, 4xB60 Dual, etc.) could still in theory run Deepseek @4bit quite quickly because of the power law type usage of the experts.

If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.

This would be an easier job for pruning, but still I think enthusiast systems are going to trend in a way the next couple years that makes these types of software optimizations useful on a much larger scale.

There's a user on Reddit with a 16x 3090 system (PCIE 3.0 x4 interconnect which doesn't seem to be using full bandwidth during tensor parallelism) that gets 7 token/s in llama.cpp. A single 3090 has enough VRAM bandwidth to scan over its 24GB of memory 39 times per second, so there's something else going on limiting performance.

latchkey•1d ago

A single MI300x has 192GB of vram.

ElectricalUnion•17h ago

Sad reality is that the MI300x isn't a monolithic die, so the chiplets have internal bandwidth limitations (ofc less severe that using PCIe/nvlink).

In AMD own parlance, the "Modular Chiplet Platform" presents itself as either single-I-don't-care-about-speed-or-latency "Single Partition X-celerator" mode or in multiple-I-actually-totally-do-care-about-speed-and-latency-NUMA-like "Core Partitioned X-celerator" mode.

So you kinda still need to care what-loads-where.

latchkey•17h ago

I have never heard of a GPU where a deep understanding of how memory is managed was not critical towards getting the best performance.

MoonGhost•1d ago

> 16x 3090 system

That's about 5KW of power

> that gets 7 token/s in llama.cpp

Just looking at electricity bill it's cheaper to use API of any major providers.

> If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.

That's interesting, it means the model can be cut and those token routed to another closest expert, just in case they happened.

bick_nyers•20h ago

Or merge the bottom 1/8 (or whatever) experts together and (optionally) do some minimal training with all other weights frozen. Would need to modify the MoE routers slightly to map old -> new expert indices so you don't need to retrain the routers.

corey_moncure•1d ago

If I understand it correctly, the effect of experts is a weighted sum of the individual calculation of each token meeting each expert, where experts to be met by a token are selected on an individual basis. Since a sum is commutative, though, it should be possible to send a large batch of tokens copied to multiple GPUs, where experts are streamed into VRAM, partitioned across GPUs. Then the bottleneck is your PCI-E bandwidth. With 2 GPUs at Gen 4 x16, you should have 60 GB/s of TX bandwidth, allowing you to upload a half precision quant of DeepSeek (about 360 GB) in about 6 seconds.

  1 GPU  -  30 GB/s TX - 12 seconds
  2 GPUs -  60 GB/s TX - 6 seconds
  4 GPUs - 120 GB/s TX - 3 seconds

Then you just optimize your batch size to match the compute time to the upload time of each GPU. The expert calculation results can be retrieved from the GPUs and summed up.

briian•1d ago

This reminded me that the economies of scale in AI, especially inference, is huge.

When people say LLMs will be commoditised, I am not sure that means that the market is going to be super competitive. As the economies of scale of AI get even bigger (larger training costs + batch inference etc.) it just seems likely only around 3 companies will dominate LLMs.

riku_iki•1d ago

For inference cost, I don't see how this is different from cloud providers vs dedicated server providers, where AWS is 5-10x more expensive than hetzner.

Somehow cloud providers manage to add lots of extra-cost on offering.

ryan_glass•1d ago

I run Deepseek V3 locally as my daily driver and I find it affordable, fast and effective. The article assumes GPU which in my opinion is not the best way to serve large models like this locally. I run a mid-range EPYC 9004 series based home server on a supermicro mobo which cost all-in around $4000. It's a single CPU machine with 384GB RAM (you could get 768GB using 64GB sticks but this costs more). No GPU means power draw is less than a gaming desktop. With the RAM limitation I run an Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original. It is around 270GB which leaves plenty of room for context - I run 16k context normally as I use the machine for other things too but can up it to 24k if I need more. I get about 9-10 tokens per second, dropping to 7 tokens/second with a large context. There are plenty of people running similar setups with 2 CPUs who run the full version at similar tokens/second.

nardi•1d ago

Whats your prompt processing speed? That’s more important in this situation than output TPS. If you have to wait minutes to start getting an answer, that makes it much worse than a cloud-hosted version.

pclmulqdq•1d ago

I assume KV caching makes this a non issue, but I'm also curious.

idonotknowwhy•22h ago

If you're just chatting with it starting with "Hi", that's correct. The conversation remains in the KV cache as it grows gradually.

But if you're posting code, writing drafts, or even small snippets of articles, etc in there it becomes a huge problem.

pclmulqdq•21h ago

Usually, when people think about the prompt tokens for a chat model, the initial system prompt is the vast majority of the tokens and it's the same regardless for many usage modes. You might have a slightly different system prompt for code than you have for English or for chatting, but that is 3 prompts which you can permanently put in some sort of persistent KV cache. After that, only your specific request in that mode is uncached.

ryao•1d ago

If he is doing multiturn conversations, he can reuse the kv cache from the last turn and skip the prompt processing on the history that would make time to first token too slow, by only doing prompt processing on his actual prompt for the current turn. This turns a quadratic amount of tokens to process into a linear number. I am not sure if this is what he is doing, but that is what I would do if I had his hardware.

ryan_glass•1d ago

Prompt eval time varies a lot with context but it feels real-time for short prompts - approx 20 tokens per second but I haven't done much benchmarking of this. When there is a lot of re-prompting in a long back and forth it is still quite fast - I do use KV cache which I assume helps and also quantize the KV cache to Q8 if I am running contexts above 16k. However, if I want it to summarize a document of say 15,000 words it does take a long time - here I walk away and come back in about 20 minutes and it will be complete.

jeff_carr•1d ago

I am impressed. Your personal website is down. HN doesn't allow private messages.

I'm Jeff Carr. I co-founded digital ocean. I assume I can't post email addresses here, but I will try. lets see how smart things are from banning me. I am: wit AT wit com

p12tic•1d ago

State of the art of local models is even further.

For example, look into https://github.com/kvcache-ai/ktransformers, which achieve >11 tokens/s on a relatively old two socket Xeon servers + retail RTX 4090 GPU. Even more interesting is prefill speed at more than 250 tokens/s. This is very useful in use cases like coding, where large prompts are common.

The above is achievable today. In the mean time Intel guys are working on something even more impressive. In https://github.com/sgl-project/sglang/pull/5150 they claim that they achieve >15 tokens/s generation and >350 tokens/s prefill. They don't share what exact hardware they run this on, but from various bits and pieces over various PRs I reverse-engineered that they use 2x Xeon 6980P with MRDIMM 8800 RAM, without GPU. Total cost of such setup will be around $10k once cheap Engineering samples hit eBay.

qeternity•23h ago

It's not impressive nor efficient when you consider batch sizes > 1.

p12tic•23h ago

All of this is for batch size 1.

adastra22•21h ago

You can put your email in your profile

trustinmenowpls•21h ago

fyi, your website is also down... wit.com doesn't resolve for me

x______________•21h ago

Bold of you to assume that an email domain needs a web server listening on port 80 for http packets..

simondotau•16h ago

You don’t even need an A/AAAA record on the domain.

saagarjha•14h ago

Pretty sure you can post email addresses here, this is mine: saagar@saagarjha.com. It's more about avoiding spam.

stavros•6h ago

You can post emails fine, you just might get spammed (because it's a public forum).

pclmulqdq•1d ago

CPUs are quietly becoming very well-balanced machines for BS 1 inference. The latest Intel Xeons should be at ~20 TPS.

Spooky23•1d ago

A base Mac Mini is ~20 :)

pclmulqdq•23h ago

Oh yeah, I did that math not assuming any quantization. I think if you can get a 3-4 bit quant working + int8 math, ~80 might be achievable.

platevoltage•1d ago

Impressive. I need to look more into this. I'm doing my best to limit my LLM usage to what I can run locally.

jbellis•1d ago

impressive, but that's 1/5 to 1/10 of the throughput that you'd get with a hosted provider, with 1/4 to 1/8 the supported context

michelsedgh•1d ago

Dude he's running locally, and I think this setup is the best bang for the buck if you wanna run locally, we're not comparing to data centers, you gotta keep it in perspective. That's very impressive results for running local. Thanks for the numbers you saved me a chatgpt search :)

carstenhag•1d ago

Title says: locally it's expensive

Other person says: I had to spend 4000$ and it's still slow

justsid•22h ago

Not to mention that $4000 is in fact expensive. If anything the OP really makes the point of the articles title.

BoorishBears•22h ago

CPU-only is really terrible bang for your buck, and I wish people would stop pushing these impractical builds on people genuinely curious in local AI.

The KV cache won't soften the blow the first time they paste a code sample into a chat and end up waiting 10 minutes with absolutely no interactivity before they even get first token.

You'll get an infinitely more useful build out of a single 3090 and sticking to stuff like Gemma 27B than you will out of trying to run Deepseek off a CPU-only build. Even a GH200 struggles to run Deepseek at realistic speeds with bs=1, and there's an entire H100 attached to CPU there: there just isn't a magic way to get "affordable fast effective" AI out of a CPU offloaded model right now.

ryan_glass•13h ago

The quality on Gemma 27B is nowhere near good enough for my needs. None of the smaller models are.

BoorishBears•12h ago

And that's fine, but the average person asking is already willing to give up some raw intelligence going local, and would not expect the kind of abysmal performance you're likely getting after describing it as "fast".

I setup Deepseek bs=1 on a $41,000 GH200 and got double digit prompt processing speeds (~50 tk/s): you're definitely getting worse performance than the GH200 was, and that's already unacceptable for most users.

They'd be much better served spending less money than you had to spend and getting an actually interactive experience, instead of having to send off prompts and wait several minutes to get an actual reply the moment the query involves any actual context.

ryan_glass•13h ago

It might be 5 to 10 times slower than a hosted provider but that doesn't really matter when the output is still faster than a person can read. Context wise, for troubleshooting I have never needed over 16k and for the rare occasion when I need to summarise a very large document I can change up the model to something smaller and get a huge context. I have never needed more than 32k though.

refibrillator•1d ago

> Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original

How close are we talking?

I’m not calling you a liar OP, but in general I wish people perpetuating such broad claims would be more rigorous.

Unsloth does amazing work, however as far as I’m aware even they themselves do not publish head to head evals with the original unquantized models.

I have sympathy here because very few people and companies can afford to run the original models, let alone engineer rigorous evals.

However I felt compelled to comment because my experience does not match. For relatively simple usage the differences are hard to notice, but they become much more apparent in high complexity and long context tasks.

ryan_glass•1d ago

You are right that I haven't been rigorous - it's easy to benchmark tokens/second but quality of output is more difficult to nail down. I couldn't find any decent comparisons for Unsloth either. So I just tried a few of their models out, looking for something that was 'good enough' i.e. does all I need: coding, summarizing documents, troubleshooting anything and everything. I would like to see head to head comparisons too - maybe I will invest in more RAM at some stage but so far I have no need for it. I ran some comparisons between the smaller and larger versions of the Unsloth models and interestingly (for me anyway) didn't notice a huge amount of difference in quality between them. But, the smaller models didn't run significantly faster so I settled for the biggest model I could fit in RAM with a decent context. For more complex coding I use Deepseek R1 (again the Unsloth) but since it's a reasoning model it isn't real-time so no use as my daily driver.

danielhanchen•22h ago

Thanks for using our quants and appreciate it :) - We're still doing internal benchmarks since they're very slow to do - but they definitely pass our internal benchmarks :)

ryan_glass•13h ago

Thank you for making the dynamic quantisations! My setup wouldn't be possible without them and for my personal use, they do exactly what I need and are indeed excellent.

ysosirius•16h ago

How do you find the quality of the output compares to that of, say, o3 or Sonnet 4?

ryan_glass•12h ago

To be honest I haven't used o3 or Sonnet as the code I work with is my own proprietary code which I like to keep private, which is one reason for the local setup. For troubleshooting day to day things I have found it at least as good as than the free in-browser version of ChatGPT (not sure which model it uses).

danielhanchen•22h ago

Oh hey :) Thanks for the kind words - we did provide benchmarks (MMLU, KLD, Perplexity) for Llama 4 Scout, Gemma 3 27B using our methodology - https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs and https://x.com/UnslothAI/status/1915476692786962441

For R1 specifically, we did an internal benchmark on the original model - https://unsloth.ai/blog/deepseekr1-dynamic

For R1-0528 specifically on evals - we're still running them :)) It's quite expensive to run, so we first do "vibe check" on some internal test cases, and they do pretty well!

But we generally stress the bug fixes that we do, which objectively increase performance by +1 to sometimes +10% accuracy - for example Llama 4 bug fixes, Gemma bug fixes - https://news.ycombinator.com/item?id=39671146 etc are much more important :)

We also provide Q8_0 and Q8_K_XL quants, which are mostly equivalent to FP8 - you can also use the magical `-ot ".ffn_.*_exps.=CPU"` incantation to offload MoE layers to RAM!

saurik•5h ago

> All Distilled and the original R1 versions seem to have accidentally assigned the padding token to <｜endofsentence｜>, which is mostly not a good idea, especially if you want to further finetune on top of these reasoning models. This will cause endless infinite generations, since most frameworks will mask the EOS token out as -100.

I couldn't tell if this was an error in the code running the model or in the model weights themselves; if/assuming the former, are these fixes being upstreamed to anywhere?

3eb7988a1663•1d ago

Do you have hard numbers on the idle/average/max power draw? I assumed that server machines are built as if they are going to red-lined constantly so put less effort into low-utilization optimizations.

ryan_glass•23h ago

No hard numbers I'm afraid in that I don't monitor the power draw. But the machine uses a standard ATX power supply: a Corsair RM750e 750W PSU and the default TDP of the CPU is 280W - I have my TDP set at 300W. It is basically built like a desktop - ATX form factor, fans spin down at idle etc.

3eb7988a1663•16h ago

Approximation is still better than I was expecting. You said supermicro and I was assuming a pizza box with dual power supplies sucking down 1kw at idle. That it can run with a large, but not unreasonable PSU says enough.

dotancohen•23h ago

Just curious what your use cases are? What type of texts are you producing?

Thank you.

ysosirius•16h ago

I've always wondered this as well, and never seem to get an answer. Why would someone want to do this when they can get a better result either renting in the cloud, or just using a subscription?

Obviously I see the value in having something local from a control and privacy perspective, but it's surely always a net loss in terms of quality and capability of output, right?

ryan_glass•7h ago

Coding, my own proprietary code hence my desire for local hosting, a decent amount of legacy code. General troubleshooting of anything and everything from running Linux servers to fixing my car. Summarizing and translation of large documents occasionally. Also, image generation and other automations but obviously not LLMs for this.

dotancohen•1h ago

Terrific, thank you.

If you don't mind another question, how do you adapt the LLM to your codebase? Keep the whole thing in context? Fine tune on your own code? Fine tune on lots of code in whatever language you're using (e.g. Python, Rust)? Just rely on the original model training?

Thank you very much!

mechagodzilla•22h ago

I use a dual-socket 18-core (so 36 total) xeon with 768GB of DDR4, and get about 1.5-2 tokens/sec with a 4-bit quantized version of the full deepseek models. It really is wild to be able to run a model like that at home.

stirfish•21h ago

Dumb question: would something like this have a graphics card too? I assume not

twotwotwo•22h ago

The latest V3 strikes me as a really practical go-to among open-weights models. Lots of tasks don't need the reasoning tokens, and not having to wait for them is nice. (If something does need it you can always switch.) If you're not running it yourself a couple providers have it with full context, 80tps, and a promise not to use your data.

9004 home server is awesome!

blindriver•21h ago

I thought GPUs with a lot of extremely fast memory was required for inference. Are you saying that we can accomplish inference with just a large amount of system memory that is non-unified and no GPU? How is that possible?

adastra22•21h ago

I’m confused as to why you think a GPU is necessary? It’s just linear algebra.

oreoftw•21h ago

most likely he was referring the fact that you need plenty of GPU-fast memory to keep the model, and GPU cards have it.

adastra22•7h ago

There is nothing magical about GPU memory though. It’s just faster. But people have been doing CPU inference since the first llama code came out.

ryan_glass•8h ago

Basically it comes down to memory bandwidth of server CPUs being decent. A bit of oversimplification here but... The model and context have to be pulled through RAM (or VRAM) every time a new token is generated. CPUs that are designed for servers with lots of cores have decent bandwidth - up to 480GB/s with the EPYC 9 series and they can use 16 channels simultaneously to process memory. So, in theory they can pull 480GB through the system every second. GPUs are faster but you also have to fit the entire model and context into RAM (or VRAM) so for larger models they are extremely expensive because a decent consumer GPU only has 24GB of VRAM and costs silly money, if you need 20 of them. Whereas you get a lot of RDIMM RAM for a couple thousand bucks so you can run bigger models and 480GB/s gives output faster than most people can read.

6Az4Mj4D•21h ago

Can we run Deepseek using Ollama or something similar for code generation like Github copilot on a 40 core CPU with about 256GB RAM say 200 GB usable for the model?

cycomanic•1d ago

I was talking with a colleague the other day and we came to the conclusion that in our experience if you're using llms as a programming help models are really being optimised for the wrong things.

At work I often compare locallly run 4-30B models against various GPTs (we can only use non-local models for few things, because of confidentiality issues). While e.g. GPT-4o gives better results on average, the chances of it making parts of the response up is high enough that one has to invest significant amount to check and iterate over results. So the difference in effort is not much lower compared to the low parameter models.

The problem is both are just too slow to really iterate quickly, which makes things painful. I'd rather have a lower quality model (but with large context) that gives me near instant responses instead of a higher quality model that is slow. I guess that's not giving you the same headlines as the improved score on some evaluation.

mikewarot•21h ago

Imagine an FPGA big enough to hold the whole model in LUTS (and not RAM) with latches in appropriate places to keep race conditions in check. Even a 100 Mhz clock cycle would beat almost anything else in the world running it. Even if there's 500 stages of pipeline involved, you could still get 200,000 tokens per second for a single stream and have 499 streams ready for other uses.

With an FPGA like that, you could translate all of the matrix multiplies and weights directly into binary logic, optimizing out every multiply or add of a zero bit. This alone could cut the number of gates and computations, and power consumption in half.

Because you wouldn't need to throw data to/from RAM, you'd save a huge percentage of the usual latency and eliminate memory bandwidth issues. The effective equivalent memory bandwidth would likely be measured in exabytes per second.

This is the type of compute load that would perfectly match a bit level systolic array.

saagarjha•14h ago

You'd need an insanely big FPGA for this.

mikewarot•5h ago

Thanks to gigabit SERDES links, it should be reasonably easy to send the vectors between chips if you need to distribute the work to fit available FPGA hardware.

Note this could also be done if you're just emulating a systolic array on cheap hardware, like Raspberry pi picos, using the built-in PIOs to handle the much lower signal rates.

Demo of kons-9 Common Lisp 3D graphics system

Dev snapshot: Godot 4.5 dev 5

Japanese Scientists Develop Artificial Blood Compatible with All Blood Types

The Oracle of Lexiconia – A Fantasy That Explains How LLMs Work

Forge – an advanced 3D Gaussian Splatting renderer for Three.js

Street Fighter 2 composer Yoko Shimomura has created a new track for SF6

Tests should not contain logic

Tech-bro satire Mountainhead is an insufferable disappointment

Shop Talk Show episode 667

How to Find a Good Available .COM Domain

Reverse Engineering Apple's Proprietary NFC Wallet Protocol (2024)

T1000-E Card Tracker is a thin, credit card-sized GPS with Meshtastic support

Obsidian Smart Composer Plugin

Gen Z parents don't like reading to their kids

Hate filling forms – Built an AI that just does that in one click

Open Sourced NeurIPS 2025 Position Papers

Everything Is a Prompt

Corpdle – Wordle for S&P 500 companies

Iron Pillar of Delhi

Making computers multiply FASTER (matrix hacking) [video]

An unfiltered conversation with Dwarkesh Patel [video]

My AI Skeptic Friends Are All Nuts

Show HN: AI makes/answers calls through your own mobile phone number [video]

AI stirs up the optimal recipe for sustainable concrete

From Military Brat to Tech Entrepreneur

3D printed models help blind and low-vision students learn about their world

India and Pakistan's Air Battle Is Over. Their Water War Has Begun

Pattern Matching 20 Habits of Exceptional Startups

Large-Scale Research with Historical Newspapers: A Turning Point Through Gen AI

Adult sports leagues took over your city

Demo of kons-9 Common Lisp 3D graphics system

Dev snapshot: Godot 4.5 dev 5

Japanese Scientists Develop Artificial Blood Compatible with All Blood Types

The Oracle of Lexiconia – A Fantasy That Explains How LLMs Work

Forge – an advanced 3D Gaussian Splatting renderer for Three.js

Street Fighter 2 composer Yoko Shimomura has created a new track for SF6

Tests should not contain logic

Tech-bro satire Mountainhead is an insufferable disappointment

Shop Talk Show episode 667

How to Find a Good Available .COM Domain

Reverse Engineering Apple's Proprietary NFC Wallet Protocol (2024)

T1000-E Card Tracker is a thin, credit card-sized GPS with Meshtastic support

Obsidian Smart Composer Plugin

Gen Z parents don't like reading to their kids

Hate filling forms – Built an AI that just does that in one click

Open Sourced NeurIPS 2025 Position Papers

Everything Is a Prompt

Corpdle – Wordle for S&P 500 companies

Iron Pillar of Delhi

Making computers multiply FASTER (matrix hacking) [video]

An unfiltered conversation with Dwarkesh Patel [video]

My AI Skeptic Friends Are All Nuts

Show HN: AI makes/answers calls through your own mobile phone number [video]

AI stirs up the optimal recipe for sustainable concrete

From Military Brat to Tech Entrepreneur

3D printed models help blind and low-vision students learn about their world

India and Pakistan's Air Battle Is Over. Their Water War Has Begun

Pattern Matching 20 Habits of Exceptional Startups

Large-Scale Research with Historical Newspapers: A Turning Point Through Gen AI

Adult sports leagues took over your city

Why DeepSeek is cheap at scale but expensive to run locally

Comments