frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Tokasaurus: An LLM inference engine for high-throughput workloads

https://scalingintelligence.stanford.edu/blogs/tokasaurus/
197•rsehrlich•18h ago

Comments

behnamoh•18h ago
While Tokasaurus’s Async-TP shows impressive throughput gains, it seems over-engineered for common use cases. The CPU overhead from async tensor parallelism only pays off at 6k+ token batches, and you need NVLink-connected GPUs to see real benefits. Most prod deployments don’t need this complexity — you’re better off with simpler approaches unless you’re specifically optimizing for massive batch throughput. The adaptive manager skipping “optional” tasks under load also feels concerning from a reliability perspective.
bjt12345•17h ago
Buy surely next years production deployments will be very different to right now, with different use cases...etc
jdiff•16h ago
Sure. Things change over time. Is there a reason to believe they'd be different in such a way that this would be more useful than in today's landscape? I haven't seen such a forecast myself.
YetAnotherNick•17h ago
Depends on what production means for you. This is useful for batch production jobs.

Also, this seems very useful for generating synthetic data or labelling a bunch of data. 6k batch size is small for data labelling.

cpard•12h ago
How big of a use case is synthetic data generation? I’m curious as I see a lot about it coming from academic projects but I haven’t seen much related to commercial use cases
electroglyph•10h ago
tiny NNs distilled from LLMs can produce some amazing results, i'm surprised it's not more common tbh
cpard•10h ago
I agree, there are impressive results. This just came out from Berkeley https://arxiv.org/abs/2506.04178

But still, I mainly see work on this direction in academia.

nabakin•17h ago
> On throughput-focused benchmarks, Tokasaurus can outperform vLLM and SGLang by up to 3x+.

Looks like they don't compare to TensorRT-LLM throughput numbers which, last I checked, are SOTA in open source.

andersa•15h ago
TensorRT-LLM being open source is a lie, all the important kernels are loaded from cubins.
qeternity•8h ago
It also appears that this was a sampling benchmark...which is not representative.

Generation benchmark was 5% faster than SGLang.

symbolicAGI•17h ago
Given chat and API needs for low-latency, llama.cpp is probably still the best choice for self hosted models with or without GPU support. And Ollama is the leader for wrapping llama.cpp.

Because Tokasaurus was mentioned as better than Ollama for conducting darwinian godel machine operations (self-improvement), I looked for the linked repo on GitHub and it was 404. So glad it is back https://github.com/ScalingIntelligence/tokasaurus.

radq•14h ago
Cool project! The codebase is simple and well documented, a good starting point for anyone interested in how to implement a high-performance inference engine. The prefix sharing is very relevant for anyone running batch inference to generate RL rollouts.
refibrillator•14h ago
The code has few comments but gotta love when you can tell someone was having fun!

https://github.com/ScalingIntelligence/tokasaurus/blob/65efb...

I’m honestly impressed that a pure python implementation can beat out vLLM and SGLang. Granted they lean on FlashInfer, and of course torch.compile has gotten incredibly powerful in the last few years. Though dynamic shapes have still been a huge thorn in my side, I’ll need to look closer at how they pulled it off…

bobrenjc93•12h ago
Hi! I work on dynamic shapes in pytorch and would love to hear more about the challenges you’ve run into. We’re always looking to improve the experience, so if you’re open to chatting, feel free to DM me on Twitter (@bobrenjc93) or email me at bobren@meta.com.
gricardo99•11h ago
since you work on pytorch, what would you say is the best place to ask questions about general usage, trouble shooting? I’ve been struggling with a, what I would consider, a simple torchrun elastic training example, and haven’t found any good resources online. I’ve been spelunking through pytorch but have a feeling a little back and forth with someone familiar with these features would immensely clear things up.
bobrenjc93•11h ago
PyTorch Dev Discuss is a fantastic forum where many core devs actively participate and answer questions: https://dev-discuss.pytorch.org

In addition to Dev Discuss, a number of core contributors are also active on Twitter. Two particularly helpful and prolific voices are @ezyang and @cHHillee.

Finally, don’t overlook GitHub issues—they’re a surprisingly effective way to start conversations. If you’ve found a bug or have ideas on how to improve the APIs, opening an issue is always welcome.

almostgotcaught•4h ago
There's also the slack but you gotta know someone to get on that ;)
chillee•11h ago
I mean, vllm and sglang are both "pure python" essentially as well. But yeah, in ML you rarely require C++ to get good performance for most of the systems people are writing.
AStonesThrow•13h ago
Stanford was edgy enough to reefer to “toking” in the moniker, but exercises restraint by depicting the titular thunder lizard smoking a putatively conventional tobacco cigarette.

I am hoping to use this “Tokasaurus” nickname with affection for my neighbors. If Stanford is ok with informal usage.

Success with Meta AI / Llama 4:

Hey Meta, I would like to see an image of a Tyrannosaurus Rex, who is clad in a leather jacket, sunglasses, and fedora. He is so cool looking, and smoking a joint of marijuana, and his image is superimposed against a skyline of Phoenix in the golden glow of sunset.

Can you light up the joint with a glowing tip?

Art9681•12h ago
Proof that attention is not only highly desired by Stanford tech bros, but HN keyboard warriors equipped with LLM tech. Everyone is clever all of the time.
catlifeonmars•12h ago
I appreciate the double entendre
DiabloD3•6h ago
Shame this is written in Python, looks very interesting, but I'm no expert in this field.

If there is anything here worth using, it's entirely possible that the llama.cpp crew can save it from vanishing into obscurity.

Szpadel•6h ago
I'm curious what how big is latency tradeoff. I know assumption here is that it does not matter in those use cases but what order of magnitude it is? 10x? 100x?

this is important for usage in "soft realtime" application, where you do not need instant response but someone is still waiting.

if latency is really big, then it can only be used for basically background processes.

Meta: Shut Down Your Invasive AI Discover Feed. Now

https://www.mozillafoundation.org/en/campaigns/meta-shut-down-your-invasive-ai-discover-feed-now/
85•speckx•44m ago•29 comments

Decreasing Gitlab repo backup times from 48 hours to 41 minutes

https://about.gitlab.com/blog/2025/06/05/how-we-decreased-gitlab-repo-backup-times-from-48-hours-to-41-minutes/
34•immortaljoe•34m ago•2 comments

An Interactive Guide to Rate Limiting

https://blog.sagyamthapa.com.np/interactive-guide-to-rate-limiting
41•sagyam•1h ago•12 comments

Odyc.js – A tiny JavaScript library for narrative games

https://odyc.dev
84•achtaitaipai•2h ago•13 comments

A masochist's guide to web development

https://sebastiano.tronto.net/blog/2025-06-06-webdev/
62•sebtron•2h ago•5 comments

Sandia turns on brain-like storage-free supercomputer – Blocks and Files

https://blocksandfiles.com/2025/06/06/sandia-turns-on-brain-like-storage-free-supercomputer/
13•rbanffy•52m ago•0 comments

Why Bell Labs Worked

https://links.fabiomanganiello.com/share/683ee70d0409e6.66273547
11•speckx•50m ago•3 comments

Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction

https://zju3dv.github.io/freetimegs/
9•trueduke•1h ago•0 comments

Curate Your Shell History

https://esham.io/2025/05/shell-history
22•todsacerdoti•2h ago•14 comments

Too Many Open Files

https://mattrighetti.com/2025/06/04/too-many-files-open
9•furkansahin•58m ago•3 comments

VPN providers in France ordered to block pirate sports IPTV

https://torrentfreak.com/major-vpn-providers-ordered-to-block-pirate-sports-streaming-sites-250516/
26•gasull•45m ago•6 comments

Small Programs and Languages

https://ratfactor.com/cards/pl-small
56•todsacerdoti•2h ago•16 comments

Weaponizing Dependabot: Pwn Request at its finest

https://boostsecurity.io/blog/weaponizing-dependabot-pwn-request-at-its-finest
48•chha•5h ago•28 comments

Self-hosting your own media considered harmful according to YouTube

https://www.jeffgeerling.com/blog/2025/self-hosting-your-own-media-considered-harmful
1280•DavideNL•11h ago•533 comments

Deepnote (YC S19) is hiring engineers to build an AI-powered data notebook

https://deepnote.com/join-us
1•Equiet•4h ago

How to (actually) send DTMF on Android without being the default call app

https://edm115.dev/blog/2025/01/22/how-to-send-dtmf-on-android
18•EDM115•4h ago•2 comments

Swift and Cute 2D Game Framework: Setting Up a Project with CMake

https://layer22.com/swift-and-cute-framework-setting-up-a-project-with-cmake
58•pusewicz•5h ago•43 comments

Ask HN: Any good tools for viewing congressional bills?

8•tlhunter•22m ago•2 comments

Top researchers leave Intel to build startup with 'the biggest, baddest CPU'

https://www.oregonlive.com/silicon-forest/2025/06/top-researchers-leave-intel-to-build-startup-with-the-biggest-baddest-cpu.html
40•dangle1•2h ago•23 comments

ThornWalli/web-workbench: Old operating system as homepage

https://github.com/ThornWalli/web-workbench
16•rbanffy•3h ago•3 comments

Jepsen: TigerBeetle 0.16.11

https://jepsen.io/analyses/tigerbeetle-0.16.11
162•aphyr•5h ago•44 comments

The impossible predicament of the death newts

https://crookedtimber.org/2025/06/05/occasional-paper-the-impossible-predicament-of-the-death-newts/
534•bdr•1d ago•178 comments

The Coleco Adam Computer

https://dfarq.homeip.net/coleco-adam-computer/
16•rbanffy•5h ago•5 comments

Show HN: Air Lab – A portable and open air quality measuring device

https://networkedartifacts.com/airlab/simulator
436•256dpi•1d ago•177 comments

OpenAI is retaining all ChatGPT logs "indefinitely." Here's who's affected

https://arstechnica.com/tech-policy/2025/06/openai-confronts-user-panic-over-court-ordered-retention-of-chatgpt-logs/
7•Bender•55m ago•3 comments

Apple warns Australia against joining EU in mandating iPhone app sideloading

https://www.neowin.net/news/apple-warns-australia-against-joining-eu-in-mandating-iphone-app-sideloading/
22•bundie•53m ago•2 comments

Tokasaurus: An LLM inference engine for high-throughput workloads

https://scalingintelligence.stanford.edu/blogs/tokasaurus/
197•rsehrlich•18h ago•23 comments

How we’re responding to The NYT’s data demands in order to protect user privacy

https://openai.com/index/response-to-nyt-data-demands/
245•BUFU•15h ago•235 comments

Test Postgres in Python Like SQLite

https://github.com/wey-gu/py-pglite
134•wey-gu•15h ago•44 comments

APL Interpreter – An implementation of APL, written in Haskell (2024)

https://scharenbroch.dev/projects/apl-interpreter/
128•ofalkaed•18h ago•51 comments