frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Two different tricks for fast LLM inference

https://www.seangoedecke.com/fast-llm-inference/
28•swah•1h ago

Comments

criemen•1h ago
One other thing I'd assume Anthropic is doing is routing all fast requests to the latest-gen hardware. They most certainly have a diverse fleet of inference hardware (TPUs, GPUs of different generations), and fast will be only served by whatever is fastest, whereas the general inference workload will be more spread out.
retinaros•1h ago
Very interesting. OAI releases since their router all seem focused on cost cutting/efficiency while anthropic is mostly going the opposite direction spending all budget to overhype their models in media and release neo-hipster (aka normies) ads on taste and on how they wont do ads. The first red flag - beside every time dario speaks - was the popup events with shitty caps overhyped by all ai influencers.

It seems OAI was forced by investors to shift quickly to making money. Anthropic seem to have more time? Might be hard for OAI to keep the pace while focusing on cost

Der_Einzige•1h ago
Another possible explanation, especially if quality degrades at all (I.e on openAI) is aggressive quantization.

Another possible explanation is speculative decoding, where you trade unused GPU memory for speed (via a drafting model).

But my money is on the exact two mechanisms the OP proposes.

anonymous908213•53m ago
> especially if quality degrades at all

It is worth noting that consumers are completely and totally incapable of detecting quality degradation with any accuracy. Which is a given since the models are already effectively random, but there is a strong bent to hallucinate degradations. Having done frontend work for an AI startup, complaints of degrading the model were by far the most common, despite the fact that not only did our model not change, users could easily verify that it didn't change because we expose seeds. A significant portion of complainers continue to complain about model degradation even when shown they could regenerate from the same seed+input and get the exact same output. Humans, at scale, are essentially incapable of comprehending the concept of randomness.

dist-epoch•58m ago
The batch size explanation is wrong. Given how much Claude Code is used, finding fellow "bus passengers" is not an issue, you don't need to wait.

The real reason which batching increases latency is multi-factored and more complex to explain.

qeternity•51m ago
Yes this article is full of misunderstanding. The main explanation of bottleneck is wrong: it’s the model weights which dominate memory bandwidth (and hence why batching multiple requests in a single pass increases total throughput). If copying user tokens was the bottle neck, batching would not achieve any speed up.

When an author is confused about something so elementary, I can’t trust anything else they write.

kouteiheika•47m ago
> The main explanation of bottleneck is wrong: it’s the model weights which dominate memory bandwidth (and hence why batching multiple requests in a single pass increases total throughput). If copy user tokens was the bottle neck, batching would not achieve any speed up.

Inference is memory-bound only at low batch sizes. At high batch sizes it becomes compute-bound. There's a certain threshold where stuffing more requests in a batch will slow down every request in isolation even though it may still increase the number of tokens/second across the whole batch for all request in aggregate.

gchadwick•9m ago
> If copying user tokens was the bottle neck, batching would not achieve any speed up.

Reality is more complex. As context length grows your KV cache becomes large and will begin to dominate your total FLOPs (and hence bytes loaded). The issue with KV cache is you cannot batch it because only one user can use it, unlike static layer weights where you can reuse them across multiple users.

Emerging sparse attention techniques can greatly relieve this issue though the extent to which frontier labs deploy them is uncertain. Deepseek v3.2 uses sparse attention though I don't know off hand how much this reduces KV cache FLOPs and associated memory bandwidth.

gostsamo•54m ago
If the author is right, OpenAI have room for improvement where they can further improve the fast models for correctness for certain tasks while Anthropic are left with scaling vertically. OFC, it is likely that over time both approaches will converge when the companies understand the problem space better and what tradeoofs are worth making.

My personal take is that they will need a big model to plan and break down tasks and schedule them to specialized smaller models while there is a good enough model for real time interactions with the user, but it is the naive take and many other things might be shaping the decisions.

EdNutting•46m ago
This author thinks Cerebras chips were deployed at scale to serve users worldwide in just one month since the partnership announcement?

Seems like nonsense to me.

yorwba•20m ago
> The idea is to have a chip with SRAM large enough to fit the entire model, so inference can happen entirely in-memory. [...] So how much internal memory does the latest Cerebras chip have? 44GB. This puts OpenAI in kind of an awkward position. 44GB is enough to fit a small model (~20B params at fp16, ~40B params at int8 quantization), but clearly not enough to fit GPT-5.3-Codex.

You don't really need to fit the entire model on a single chip. Just as with GPUs, you can shard the model across multiple chips. Of course when you have a long pipeline of chips that each token needs to pass through, that decreases the end-to-end tokens per second correspondingly.

So the size of GPT-5.3-Codex-Spark isn't limited by the memory of a single Cerebras chip, but the number of such chips that you can chain together and still hit the 1000 tokens per second target. Given that Cerebras offers models much larger than 40B at faster speeds https://www.cerebras.ai/pricing#exploration GPT-5.3-Codex-Spark is likely closer to GLM 4.7 in size. (≈355B total parameters, 32B active)

I love the work of the ArchWiki maintainers

https://k7r.eu/i-love-the-work-of-the-archwiki-maintainers/
483•panic•10h ago•87 comments

Flashpoint Archive – Over 200k web games and animations preserved

https://flashpointarchive.org
122•helloplanets•5h ago•29 comments

Two different tricks for fast LLM inference

https://www.seangoedecke.com/fast-llm-inference/
28•swah•1h ago•11 comments

My smart sleep mask broadcasts users' brainwaves to an open MQTT broker

https://aimilios.bearblog.dev/reverse-engineering-sleep-mask/
468•minimalthinker•19h ago•210 comments

A practical guide to observing the night sky for real skies and real equipment

https://stargazingbuddy.com/
22•constantinum•2d ago•1 comments

Zvec: A lightweight, fast, in-process vector database

https://github.com/alibaba/zvec
156•dvrp•2d ago•26 comments

Instagram's URL Blackhole

https://medium.com/@shredlife/instagrams-url-blackhole-c1733e081664
205•tkp-415•1d ago•31 comments

Guitars of the USSR and the Jolana Special in Azerbaijani Music

https://caucascapades.wordpress.com/2012/06/14/guitars-of-the-ussr-and-the-jolana-special-in-azer...
52•bpierre•7h ago•6 comments

Inspecting the Source of Go Modules

https://words.filippo.io/go-source/
12•todsacerdoti•2d ago•0 comments

uBlock filter list to hide all YouTube Shorts

https://github.com/i5heu/ublock-hide-yt-shorts/
911•i5heu•17h ago•276 comments

Interference Pattern Formed in a Finger Gap Is Not Single Slit Diffraction

https://note.com/hydraenids/n/nbe89030deaba
33•uolmir•2d ago•5 comments

5,300-year-old 'bow drill' rewrites story of ancient Egyptian tools

https://www.ncl.ac.uk/press/articles/latest/2026/02/ancientegyptiandrillbit/
123•geox•4d ago•29 comments

How often do full-body MRIs find cancer?

https://www.usatoday.com/story/life/health-wellness/2026/02/11/full-body-mris-cancer-aneurysm/883...
115•brandonb•1d ago•149 comments

News publishers limit Internet Archive access due to AI scraping concerns

https://www.niemanlab.org/2026/01/news-publishers-limit-internet-archive-access-due-to-ai-scrapin...
504•ninjagoo•16h ago•310 comments

Amsterdam Compiler Kit

https://github.com/davidgiven/ack
133•andsoitis•18h ago•43 comments

OpenAI should build Slack

https://www.latent.space/p/ainews-why-openai-should-build-slack
176•swyx•1d ago•195 comments

MDST Engine: run GGUF models in the browser with WebGPU/WASM

https://mdst.app/blog/mdst_engine_run_gguf_models_in_your_browser
21•vmirnv•3d ago•3 comments

Breaking the spell of vibe coding

https://www.fast.ai/posts/2026-01-28-dark-flow/
260•arjunbanker•1d ago•202 comments

Discord Distances Itself from Peter Thiel's Palantir Age Verification Firm

https://kotaku.com/discord-palantir-peter-thiel-persona-age-verification-2000668951
80•thisislife2•5h ago•34 comments

Seeing Theory

https://seeing-theory.brown.edu/
22•Tomte•2h ago•1 comments

Ooh.directory: a place to find good blogs that interest you

https://ooh.directory/
523•hisamafahri•21h ago•132 comments

Oat – Ultra-lightweight, semantic, zero-dependency HTML UI component library

https://oat.ink/
142•twapi•3h ago•32 comments

The consequences of task switching in supervisory programming

https://martinfowler.com/fragments/2026-02-13.html
90•bigwheels•1d ago•39 comments

NewPipe: YouTube client without vertical videos and algorithmic feed

https://newpipe.net/
267•nvader•10h ago•79 comments

A review of M Disc archival capability with long term testing results (2016)

http://www.microscopy-uk.org.uk/mag/artsep16/mol-mdisc-review.html
86•1970-01-01•19h ago•106 comments

Windows NT/OS2 Design Workbook

https://computernewb.com/~lily/files/Documents/NTDesignWorkbook/
119•markus_zhang•4d ago•44 comments

Descent, ported to the web

https://mrdoob.github.io/three-descent/
254•memalign•15h ago•48 comments

Show HN: MOL – A programming language where pipelines trace themselves

https://github.com/crux-ecosystem/mol-lang
37•MouneshK•3d ago•14 comments

Flood Fill vs. The Magic Circle

https://www.robinsloan.com/winter-garden/magic-circle/
75•tobr•4d ago•20 comments

A Visual Source for Shakespeare's 'Tempest'

https://profadamroberts.substack.com/p/a-visual-source-for-shakespeares
8•seegodanddie•3d ago•0 comments