Two different tricks for fast LLM inference

https://www.seangoedecke.com/fast-llm-inference/

28•swah•1h ago

Comments

criemen•1h ago

One other thing I'd assume Anthropic is doing is routing all fast requests to the latest-gen hardware. They most certainly have a diverse fleet of inference hardware (TPUs, GPUs of different generations), and fast will be only served by whatever is fastest, whereas the general inference workload will be more spread out.

retinaros•1h ago

Very interesting. OAI releases since their router all seem focused on cost cutting/efficiency while anthropic is mostly going the opposite direction spending all budget to overhype their models in media and release neo-hipster (aka normies) ads on taste and on how they wont do ads. The first red flag - beside every time dario speaks - was the popup events with shitty caps overhyped by all ai influencers.

It seems OAI was forced by investors to shift quickly to making money. Anthropic seem to have more time? Might be hard for OAI to keep the pace while focusing on cost

Der_Einzige•1h ago

Another possible explanation, especially if quality degrades at all (I.e on openAI) is aggressive quantization.

Another possible explanation is speculative decoding, where you trade unused GPU memory for speed (via a drafting model).

But my money is on the exact two mechanisms the OP proposes.

anonymous908213•53m ago

> especially if quality degrades at all

It is worth noting that consumers are completely and totally incapable of detecting quality degradation with any accuracy. Which is a given since the models are already effectively random, but there is a strong bent to hallucinate degradations. Having done frontend work for an AI startup, complaints of degrading the model were by far the most common, despite the fact that not only did our model not change, users could easily verify that it didn't change because we expose seeds. A significant portion of complainers continue to complain about model degradation even when shown they could regenerate from the same seed+input and get the exact same output. Humans, at scale, are essentially incapable of comprehending the concept of randomness.

dist-epoch•58m ago

The batch size explanation is wrong. Given how much Claude Code is used, finding fellow "bus passengers" is not an issue, you don't need to wait.

The real reason which batching increases latency is multi-factored and more complex to explain.

qeternity•51m ago

Yes this article is full of misunderstanding. The main explanation of bottleneck is wrong: it’s the model weights which dominate memory bandwidth (and hence why batching multiple requests in a single pass increases total throughput). If copying user tokens was the bottle neck, batching would not achieve any speed up.

When an author is confused about something so elementary, I can’t trust anything else they write.

kouteiheika•47m ago

> The main explanation of bottleneck is wrong: it’s the model weights which dominate memory bandwidth (and hence why batching multiple requests in a single pass increases total throughput). If copy user tokens was the bottle neck, batching would not achieve any speed up.

Inference is memory-bound only at low batch sizes. At high batch sizes it becomes compute-bound. There's a certain threshold where stuffing more requests in a batch will slow down every request in isolation even though it may still increase the number of tokens/second across the whole batch for all request in aggregate.

gchadwick•9m ago

> If copying user tokens was the bottle neck, batching would not achieve any speed up.

Reality is more complex. As context length grows your KV cache becomes large and will begin to dominate your total FLOPs (and hence bytes loaded). The issue with KV cache is you cannot batch it because only one user can use it, unlike static layer weights where you can reuse them across multiple users.

Emerging sparse attention techniques can greatly relieve this issue though the extent to which frontier labs deploy them is uncertain. Deepseek v3.2 uses sparse attention though I don't know off hand how much this reduces KV cache FLOPs and associated memory bandwidth.

gostsamo•54m ago

If the author is right, OpenAI have room for improvement where they can further improve the fast models for correctness for certain tasks while Anthropic are left with scaling vertically. OFC, it is likely that over time both approaches will converge when the companies understand the problem space better and what tradeoofs are worth making.

My personal take is that they will need a big model to plan and break down tasks and schedule them to specialized smaller models while there is a good enough model for real time interactions with the user, but it is the naive take and many other things might be shaping the decisions.

EdNutting•46m ago

This author thinks Cerebras chips were deployed at scale to serve users worldwide in just one month since the partnership announcement?

Seems like nonsense to me.

yorwba•20m ago

> The idea is to have a chip with SRAM large enough to fit the entire model, so inference can happen entirely in-memory. [...] So how much internal memory does the latest Cerebras chip have? 44GB. This puts OpenAI in kind of an awkward position. 44GB is enough to fit a small model (~20B params at fp16, ~40B params at int8 quantization), but clearly not enough to fit GPT-5.3-Codex.

You don't really need to fit the entire model on a single chip. Just as with GPUs, you can shard the model across multiple chips. Of course when you have a long pipeline of chips that each token needs to pass through, that decreases the end-to-end tokens per second correspondingly.

So the size of GPT-5.3-Codex-Spark isn't limited by the memory of a single Cerebras chip, but the number of such chips that you can chain together and still hit the 1000 tokens per second target. Given that Cerebras offers models much larger than 40B at faster speeds https://www.cerebras.ai/pricing#exploration GPT-5.3-Codex-Spark is likely closer to GLM 4.7 in size. (≈355B total parameters, 32B active)

I love the work of the ArchWiki maintainers

Flashpoint Archive – Over 200k web games and animations preserved

Two different tricks for fast LLM inference

My smart sleep mask broadcasts users' brainwaves to an open MQTT broker

A practical guide to observing the night sky for real skies and real equipment

Zvec: A lightweight, fast, in-process vector database

Instagram's URL Blackhole

Guitars of the USSR and the Jolana Special in Azerbaijani Music

Inspecting the Source of Go Modules

uBlock filter list to hide all YouTube Shorts

Interference Pattern Formed in a Finger Gap Is Not Single Slit Diffraction

5,300-year-old 'bow drill' rewrites story of ancient Egyptian tools

How often do full-body MRIs find cancer?

News publishers limit Internet Archive access due to AI scraping concerns

Amsterdam Compiler Kit

OpenAI should build Slack

MDST Engine: run GGUF models in the browser with WebGPU/WASM

Breaking the spell of vibe coding

Discord Distances Itself from Peter Thiel's Palantir Age Verification Firm

Seeing Theory

Ooh.directory: a place to find good blogs that interest you

Oat – Ultra-lightweight, semantic, zero-dependency HTML UI component library

The consequences of task switching in supervisory programming

NewPipe: YouTube client without vertical videos and algorithmic feed

A review of M Disc archival capability with long term testing results (2016)

Windows NT/OS2 Design Workbook

Descent, ported to the web

Show HN: MOL – A programming language where pipelines trace themselves

Flood Fill vs. The Magic Circle

A Visual Source for Shakespeare's 'Tempest'

Two different tricks for fast LLM inference

Comments

I love the work of the ArchWiki maintainers

Flashpoint Archive – Over 200k web games and animations preserved

Two different tricks for fast LLM inference

My smart sleep mask broadcasts users' brainwaves to an open MQTT broker

A practical guide to observing the night sky for real skies and real equipment

Zvec: A lightweight, fast, in-process vector database

Instagram's URL Blackhole

Guitars of the USSR and the Jolana Special in Azerbaijani Music

Inspecting the Source of Go Modules

uBlock filter list to hide all YouTube Shorts

Interference Pattern Formed in a Finger Gap Is Not Single Slit Diffraction

5,300-year-old 'bow drill' rewrites story of ancient Egyptian tools

How often do full-body MRIs find cancer?

News publishers limit Internet Archive access due to AI scraping concerns

Amsterdam Compiler Kit

OpenAI should build Slack

MDST Engine: run GGUF models in the browser with WebGPU/WASM

Breaking the spell of vibe coding

Discord Distances Itself from Peter Thiel's Palantir Age Verification Firm

Seeing Theory

Ooh.directory: a place to find good blogs that interest you

Oat – Ultra-lightweight, semantic, zero-dependency HTML UI component library

The consequences of task switching in supervisory programming

NewPipe: YouTube client without vertical videos and algorithmic feed

A review of M Disc archival capability with long term testing results (2016)

Windows NT/OS2 Design Workbook

Descent, ported to the web

Show HN: MOL – A programming language where pipelines trace themselves

Flood Fill vs. The Magic Circle

A Visual Source for Shakespeare's 'Tempest'