Two different tricks for fast LLM inference

https://www.seangoedecke.com/fast-llm-inference/

21•swah•1h ago

Comments

criemen•57m ago

One other thing I'd assume Anthropic is doing is routing all fast requests to the latest-gen hardware. They most certainly have a diverse fleet of inference hardware (TPUs, GPUs of different generations), and fast will be only served by whatever is fastest, whereas the general inference workload will be more spread out.

retinaros•56m ago

Very interesting. OAI releases since their router all seem focused on cost cutting/efficiency while anthropic is mostly going the opposite direction spending all budget to overhype their models in media and release neo-hipster (aka normies) ads on taste and on how they wont do ads. The first red flag - beside every time dario speaks - was the popup events with shitty caps overhyped by all ai influencers.

It seems OAI was forced by investors to shift quickly to making money. Anthropic seem to have more time? Might be hard for OAI to keep the pace while focusing on cost

Der_Einzige•49m ago

Another possible explanation, especially if quality degrades at all (I.e on openAI) is aggressive quantization.

Another possible explanation is speculative decoding, where you trade unused GPU memory for speed (via a drafting model).

But my money is on the exact two mechanisms the OP proposes.

anonymous908213•27m ago

> especially if quality degrades at all

It is worth noting that consumers are completely and totally incapable of detecting quality degradation with any accuracy. Which is a given since the models are already effectively random, but there is a strong bent to hallucinate degradations. Having done frontend work for an AI startup, complaints of degrading the model were by far the most common, despite the fact that not only did our model not change, users could easily verify that it didn't change because we expose seeds. A significant portion of complainers continue to complain about model degradation even when shown they could regenerate from the same seed+input and get the exact same output. Humans, at scale, are essentially incapable of comprehending the concept of randomness.

dist-epoch•33m ago

The batch size explanation is wrong. Given how much Claude Code is used, finding fellow "bus passengers" is not an issue, you don't need to wait.

The real reason which batching increases latency is multi-factored and more complex to explain.

qeternity•26m ago

Yes this article is full of misunderstanding. The main explanation of bottleneck is wrong: it’s the model weights which dominate memory bandwidth (and hence why batching multiple requests in a single pass increases total throughput). If copying user tokens was the bottle neck, batching would not achieve any speed up.

When an author is confused about something so elementary, I can’t trust anything else they write.

kouteiheika•22m ago

> The main explanation of bottleneck is wrong: it’s the model weights which dominate memory bandwidth (and hence why batching multiple requests in a single pass increases total throughput). If copy user tokens was the bottle neck, batching would not achieve any speed up.

Inference is memory-bound only at low batch sizes. At high batch sizes it becomes compute-bound. There's a certain threshold where stuffing more requests in a batch will slow down every request in isolation even though it may still increase the number of tokens/second across the whole batch for all request in aggregate.

gostsamo•28m ago

If the author is right, OpenAI have room for improvement where they can further improve the fast models for correctness for certain tasks while Anthropic are left with scaling vertically. OFC, it is likely that over time both approaches will converge when the companies understand the problem space better and what tradeoofs are worth making.

My personal take is that they will need a big model to plan and break down tasks and schedule them to specialized smaller models while there is a good enough model for real time interactions with the user, but it is the naive take and many other things might be shaping the decisions.

EdNutting•20m ago

This author thinks Cerebras chips were deployed at scale to serve users worldwide in just one month since the partnership announcement?

Seems like nonsense to me.

An AI agent harassed a Matplotlib maintainer. We're asking the wrong question

Show HN: Pangolin: Open-source identity-based VPN (Twingate/Zscaler alternative)

Show HN: I built a map where anyone can rename anything

What Happens After the Hero? The Section That Decides If Users Stay

A Pokémon of a Different Color

France assembles magistrate team to examine 'Epstein files'

Show HN: Manga Viewer – Zero-dep manga/comic reader in vanilla JavaScript

AI to stay in Flow – a personal decision on how I chose to (not) use AI

AI Agent Lands PRs in Major OSS Projects, Targets Maintainers via Cold Outreach

Starflight

Storepage – I'm a tool builder making app launch pages less painful

Show HN: Visual state tracking for AI agents in tmux

Former teen superstar James Van Der Beek needed help to pay his medical bills

The great computer science exodus (and where students are going instead)

Show HN: Addictive little browser game involving gravity

Paper Plotter

Orjson no more open issue tracker or pull requests due to signal-to-noise ratio

Learning Lean: Part 1

How I made $2.5K in 4 days selling a SaaS boilerplate for OpenClaw wrappers

Understanding the Go Runtime: The Bootstrap

Guthrie video delayed by difficult data recovery, privacy advocates worry

Show HN: Ventoux – get insights into your FIT activities

Sammy Jankis – an autonomous agent by Jason Rohrer

Arrays in Forth

Show HN: Claude Extender – Autonomous Agent Management for Claude Code

Peloton Revenue Falls on Declining Subscriptions and CFO Leaves

Lit: Version control where prompts are the source

AI CMO

Show HN: Minisft – from base model to chat model

The Sacred Ass Life Course