frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Two different tricks for fast LLM inference

https://www.seangoedecke.com/fast-llm-inference/
21•swah•1h ago

Comments

criemen•57m ago
One other thing I'd assume Anthropic is doing is routing all fast requests to the latest-gen hardware. They most certainly have a diverse fleet of inference hardware (TPUs, GPUs of different generations), and fast will be only served by whatever is fastest, whereas the general inference workload will be more spread out.
retinaros•56m ago
Very interesting. OAI releases since their router all seem focused on cost cutting/efficiency while anthropic is mostly going the opposite direction spending all budget to overhype their models in media and release neo-hipster (aka normies) ads on taste and on how they wont do ads. The first red flag - beside every time dario speaks - was the popup events with shitty caps overhyped by all ai influencers.

It seems OAI was forced by investors to shift quickly to making money. Anthropic seem to have more time? Might be hard for OAI to keep the pace while focusing on cost

Der_Einzige•49m ago
Another possible explanation, especially if quality degrades at all (I.e on openAI) is aggressive quantization.

Another possible explanation is speculative decoding, where you trade unused GPU memory for speed (via a drafting model).

But my money is on the exact two mechanisms the OP proposes.

anonymous908213•27m ago
> especially if quality degrades at all

It is worth noting that consumers are completely and totally incapable of detecting quality degradation with any accuracy. Which is a given since the models are already effectively random, but there is a strong bent to hallucinate degradations. Having done frontend work for an AI startup, complaints of degrading the model were by far the most common, despite the fact that not only did our model not change, users could easily verify that it didn't change because we expose seeds. A significant portion of complainers continue to complain about model degradation even when shown they could regenerate from the same seed+input and get the exact same output. Humans, at scale, are essentially incapable of comprehending the concept of randomness.

dist-epoch•33m ago
The batch size explanation is wrong. Given how much Claude Code is used, finding fellow "bus passengers" is not an issue, you don't need to wait.

The real reason which batching increases latency is multi-factored and more complex to explain.

qeternity•26m ago
Yes this article is full of misunderstanding. The main explanation of bottleneck is wrong: it’s the model weights which dominate memory bandwidth (and hence why batching multiple requests in a single pass increases total throughput). If copying user tokens was the bottle neck, batching would not achieve any speed up.

When an author is confused about something so elementary, I can’t trust anything else they write.

kouteiheika•22m ago
> The main explanation of bottleneck is wrong: it’s the model weights which dominate memory bandwidth (and hence why batching multiple requests in a single pass increases total throughput). If copy user tokens was the bottle neck, batching would not achieve any speed up.

Inference is memory-bound only at low batch sizes. At high batch sizes it becomes compute-bound. There's a certain threshold where stuffing more requests in a batch will slow down every request in isolation even though it may still increase the number of tokens/second across the whole batch for all request in aggregate.

gostsamo•28m ago
If the author is right, OpenAI have room for improvement where they can further improve the fast models for correctness for certain tasks while Anthropic are left with scaling vertically. OFC, it is likely that over time both approaches will converge when the companies understand the problem space better and what tradeoofs are worth making.

My personal take is that they will need a big model to plan and break down tasks and schedule them to specialized smaller models while there is a good enough model for real time interactions with the user, but it is the naive take and many other things might be shaping the decisions.

EdNutting•20m ago
This author thinks Cerebras chips were deployed at scale to serve users worldwide in just one month since the partnership announcement?

Seems like nonsense to me.

An AI agent harassed a Matplotlib maintainer. We're asking the wrong question

https://chaosguru.substack.com/p/who-opened-the-door
1•allixsenos•51s ago•0 comments

Show HN: Pangolin: Open-source identity-based VPN (Twingate/Zscaler alternative)

https://github.com/fosrl/pangolin
2•miloschwartz•11m ago•0 comments

Show HN: I built a map where anyone can rename anything

https://rename.world/
4•kafked•13m ago•0 comments

What Happens After the Hero? The Section That Decides If Users Stay

https://www.indiehackers.com/post/what-happens-after-the-hero-the-section-that-decides-if-users-s...
1•allinonetools_•14m ago•0 comments

A Pokémon of a Different Color

https://matthew.verive.me/blog/color/
1•Risse•20m ago•0 comments

France assembles magistrate team to examine 'Epstein files'

https://www.dw.com/en/epstein-files-france-investigation-magistrates/a-75975774
1•pera•21m ago•0 comments

Show HN: Manga Viewer – Zero-dep manga/comic reader in vanilla JavaScript

https://github.com/tokagemushi999/manga-viewer
2•tokagemushi•24m ago•0 comments

AI to stay in Flow – a personal decision on how I chose to (not) use AI

https://www.dev-log.me/ai_to_stay_in_flow/
1•wazHFsRy•27m ago•0 comments

AI Agent Lands PRs in Major OSS Projects, Targets Maintainers via Cold Outreach

https://socket.dev/blog/ai-agent-lands-prs-in-major-oss-projects-targets-maintainers-via-cold-out...
1•choult•28m ago•0 comments

Starflight

https://en.wikipedia.org/wiki/Starflight
1•tosh•28m ago•0 comments

Storepage – I'm a tool builder making app launch pages less painful

https://storepage.app
1•scaleinitiative•30m ago•1 comments

Show HN: Visual state tracking for AI agents in tmux

https://github.com/accessd/tmux-agent-indicator
1•accessd•35m ago•0 comments

Former teen superstar James Van Der Beek needed help to pay his medical bills

https://www.bbc.com/news/articles/cx2dw01p7k8o
1•breve•40m ago•0 comments

The great computer science exodus (and where students are going instead)

https://techcrunch.com/2026/02/15/the-great-computer-science-exodus-and-where-students-are-going-...
1•e2e4•40m ago•0 comments

Show HN: Addictive little browser game involving gravity

https://retroburn.space/
1•amiralul•40m ago•0 comments

Paper Plotter

https://felixboiii.github.io/paper-plotter/#create-function
2•Tomte•41m ago•0 comments

Orjson no more open issue tracker or pull requests due to signal-to-noise ratio

https://github.com/ijl/orjson
1•anutrix•41m ago•1 comments

Learning Lean: Part 1

https://rkirov.github.io/posts/lean1/
2•vinhnx•41m ago•0 comments

How I made $2.5K in 4 days selling a SaaS boilerplate for OpenClaw wrappers

https://clawwrapper.com
1•omridan159•42m ago•1 comments

Understanding the Go Runtime: The Bootstrap

https://internals-for-interns.com/posts/understanding-go-runtime/
2•birdculture•42m ago•0 comments

Guthrie video delayed by difficult data recovery, privacy advocates worry

https://www.reuters.com/world/guthrie-doorbell-video-delayed-by-difficult-data-recovery-privacy-a...
2•cobbzilla•43m ago•1 comments

Show HN: Ventoux – get insights into your FIT activities

https://codeberg.org/eikek/ventoux
1•eikek•45m ago•0 comments

Sammy Jankis – an autonomous agent by Jason Rohrer

https://sammyjankis.com/
1•jdietrich•51m ago•0 comments

Arrays in Forth

https://www.forth.org/svfig/Len/arrays.htm
3•tosh•53m ago•0 comments

Show HN: Claude Extender – Autonomous Agent Management for Claude Code

https://github.com/wbnns/cx
1•wbnns•55m ago•1 comments

Peloton Revenue Falls on Declining Subscriptions and CFO Leaves

https://www.wsj.com/business/earnings/peloton-revenue-falls-on-declining-subscriptions-as-cfo-lea...
1•KnuthIsGod•55m ago•1 comments

Lit: Version control where prompts are the source

https://clintonboys.com/projects/lit/
1•mtsolitary•56m ago•0 comments

AI CMO

https://ai-cmo.net/
2•lunaberry•57m ago•0 comments

Show HN: Minisft – from base model to chat model

https://github.com/onurkanbakirci/Llama-2-7b-oasst-sft
1•onurkanbkrc•58m ago•0 comments

The Sacred Ass Life Course

https://sacredass.com/
1•ZguideZ•58m ago•1 comments