frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

4x faster LLM inference (Flash Attention guy's company)

https://www.together.ai/blog/adaptive-learning-speculator-system-atlas
79•alecco•5h ago

Comments

petesergeant•2h ago
> Built on top of Together Turbo Speculator, ATLAS reaches up to 500 TPS on DeepSeek-V3.1 and up to 460 TPS on Kimi-K2 in a fully adapted scenario — 2.65x faster than standard decoding, outperforming even specialized hardware like Groq

and yet, if you click on: https://openrouter.ai/moonshotai/kimi-k2-0905

You'll see Groq averaging 1,086tps vs Together doing 59tps. Groq and Cerebras often feel like the only games in town. I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.

Comparing how quickly gpt-oss-120b runs gives a broader picture: https://openrouter.ai/openai/gpt-oss-120b -- Vertex (Google) and SambaNova do pretty good on it too, but still, the difference between a top provider and an also-ran is giant.

God I love OpenRouter.

senko•2h ago
> You'll see Groq averaging 1,086tps

What I don't understand is, Groq reporting 200tps for the same model: https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...

OpenRouter numbers look fishy.

p1esk•2h ago
Do these numbers compare performance at the same cost?
petesergeant•23m ago
You can see the cost in the links, and the answer is “pretty much” for the consumer. The backend maths, no idea.
Havoc•2h ago
>Groq and Cerebras often feel like the only games in town.

SambaNova should be similar...they've got a similar specialized hardware approach

jbellis•2h ago
groq is quantizing, even though it's not labeled as such on openrouter (super frustrating)
bn-l•1h ago
Do you have a source for that? They are pretty close to the ref implementation on moonshot’s ranking
immortal3•2h ago
There's another angle to this comparison. Groq and Cerebras use custom chips, but I'm not sure about Together. In this case, Together is sharing results based on the B200 GPU. Another important point is the accuracy of these speed-ups compared to the baseline model. It's known that such tricks reduce accuracy, but by how much? Kimi has already benchmarked several providers. https://x.com/Kimi_Moonshot/status/1976926483319763130
jsheard•2h ago
> Groq and Cerebras use custom chips

Not just custom chips, but custom chips which derive much of their performance from enormous amounts of SRAM. There's no denying that approach is fast, but it's also incredibly expensive, and SRAM scaling has slowed to a crawl so it won't get much cheaper any time soon.

rfoo•1h ago
> It's known that such tricks reduce accuracy

AFAIU, speculative decoding (and this fancier version of spec. decoding) does not reduce accuracy.

martinald•1h ago
No it shouldn't do. "All" you're doing is having a small model run the prompt and then have the large model "verify" it. When the large model diverges from the small one, you restart the process again.
meander_water•2h ago
Interesting, if you take a look at the median throughput chart [0], groq goes insane after 7th Oct. Wonder what happened.

[0] https://openrouter.ai/moonshotai/kimi-k2-0905/performance

awestroke•2h ago
Heavy quantization
KronisLV•49m ago
> I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.

I'm currently on the Cerebras Code subscription for like 50 USD a month because it more or less makes the rate limits I used to deal with other platforms disappear (without making me spend upwards of 100 USD paying per token): https://www.cerebras.ai/blog/introducing-cerebras-code

At the same time, their Qwen Coder 480B model is fine but I still find myself going for Claude or GPT-5 or Gemini 2.5 Pro for more complex issues (or ones where I need good usage of Latvian language), at least for programming tasks it'd eventually be super cool if they could offer more models.

Or have some sort of a partnership with Anthropic or whoever, because getting my questions answered at around 500-1500 TPS is really, really pleasant, especially for agentic use cases with code modifications, even if I still bump into the 128k context limits occasionally.

ashvardanian•2h ago
Will need some time to go through the details, but it’s increasingly rare to see teams consistently delivering meaningful improvements in the open. Impressive work!
wishawa•2h ago
Inference is impressively fast. But what about quality? In the Kimi vendor verifier (https://github.com/MoonshotAI/K2-Vendor-Verifier/), Together has one of the highest tool call failure rates (>300 failures over the benchmark, compared to 0-2 for the official API, groq, SiliconFlow, and Infinigence).
rfoo•1h ago
If you compare "schema validation error count" plus "Count of Finish Reason others" then SiliconFlow and Infinigence is in the same bucket too. Maybe their API layer detected incorrect tool call and set finish reason to something else?

IMO this likely is what you get from running the model correctly as-is (i.e. using the same weight and activation dtype), so Together is not bad.

Moonshot AI themselves and Groq likely uses some sampler tricks to eliminate schema validation errors.

So really the only thing this shows is: Nebius, Chutes, AtlasCloud could be running something else (for example further quantized model). Or bugs.

wishawa•13m ago
Fair point. If Moonshot is holding back the true weights or inference techniques that affect correctness, then providers including Together should call them out on that. I for one would stop using Kimi if that is the case.

Anyway, Novita is doing significantly better on the vendor verifier chart than Together, so the low quality must be partially Together's fault at least.

Havoc•2h ago
>a faster speculator (also known as the draft model) proposes multiple tokens ahead, and the target model verifies them in parallel in a single forward pass

TIL. Bit of an aha moment - never understood till now how a big model can verify faster than it can generate

woadwarrior01•26m ago
As with almost everything else in CS, it's a tradeoff. Pre-fill is compute bound, decoding is memory bandwidth bound. Speculative decoding works when the draft model is more often right that wrong, because most architectures have a lot more compute, compared to memory bandwidth.
andblac•1h ago
At first glance, this reminds me of how branch prediction is utilized in CPUs to speedup execution. As I understand it, this development is like a form of soft branch prediction over language trajectories: a small model predicts what the main model will do, takes few steps ahead and then verifies the results (and this can be done in parallel). If it checks out, you just jump forward, it not you take miss but its rare. I find it funny how small-big ideas like this come up in different context again and again in history of our technological development. Of course ideas as always are cheap. The hard part is how to actually use them and cash in on them.
LogicFailsMe•9m ago
No barrier to entry whatsoever? Backprop on the speculative decoding weights during inference to improve their accuracy on a per application basis?

Cool hack though, kudos. Wonder if they can make Groq or Cerebras do the same thing?

Macro Gaussian Splats

https://danybittel.ch/macro.html
120•danybittel•3h ago•16 comments

4x faster LLM inference (Flash Attention guy's company)

https://www.together.ai/blog/adaptive-learning-speculator-system-atlas
81•alecco•5h ago•22 comments

Why it took 4 years to get a lock files specification

https://snarky.ca/why-it-took-4-years-to-get-a-lock-files-specification/
54•birdculture•4h ago•33 comments

Loko Scheme: bare metal optimizing Scheme compiler

https://scheme.fail/
32•dTal•5d ago•2 comments

Nostr and ATProto (2024)

https://shreyanjain.net/2024/07/05/nostr-and-atproto.html
30•sph•4h ago•4 comments

Meta Superintelligence's surprising first paper

https://paddedinputs.substack.com/p/meta-superintelligences-surprising
330•skadamat•14h ago•177 comments

Blood test detecting Long Covid in kids with 94% accuracy microclots

https://www.researchsquare.com/article/rs-7483367/v1
81•thenerdhead•2h ago•55 comments

The Flummoxagon

https://n-e-r-v-o-u-s.com/blog/?p=9827
62•robinhouston•4d ago•10 comments

C++ Reflection and Qt MOC

https://wiki.qt.io/C%2B%2B_reflection_(P2996)_and_moc
38•coffeeaddict1•3d ago•6 comments

Pipelining in psql (PostgreSQL 18)

https://postgresql.verite.pro/blog/2025/10/01/psql-pipeline.html
118•tanelpoder•9h ago•19 comments

Django: One ORM to rule all databases

https://www.paulox.net/2025/10/06/django-orm-comparison/
19•pauloxnet•6d ago•12 comments

Show HN: I made an esoteric programming language that's read like a spellbook

https://github.com/sirbread/spellscript
55•sirbread•8h ago•11 comments

I/O Multiplexing (select vs. poll vs. epoll/kqueue)

https://nima101.github.io/io_multiplexing
74•pykello•3d ago•23 comments

Ask HN: Abandoned/dead projects you think died before their time and why?

186•ofalkaed•15h ago•530 comments

Anthropic's Prompt Engineering Tutorial

https://github.com/anthropics/prompt-eng-interactive-tutorial
234•cjbarber•19h ago•44 comments

Vancouver Stock Exchange: Scam capital of the world (1989) [pdf]

https://scamcouver.wordpress.com/wp-content/uploads/2012/04/scam-capital.pdf
109•thomassmith65•14h ago•47 comments

Coral Protocol: Open infrastructure connecting the internet of agents

https://arxiv.org/abs/2505.00749
35•joj333•10h ago•7 comments

Show HN: A Lisp Interpreter for Shell Scripting

https://github.com/gue-ni/redstart
71•quintussss•3d ago•17 comments

A Guide for WireGuard VPN Setup with Pi-Hole Adblock and Unbound DNS

https://psyonik.tech/posts/a-guide-for-wireguard-vpn-setup-with-pi-hole-adblock-and-unbound-dns/
110•pSYoniK•18h ago•19 comments

Why Wikipedia cannot claim the Earth is not flat

https://en.wikipedia.org/wiki/Wikipedia:Why_Wikipedia_cannot_claim_the_Earth_is_not_flat
92•duncanjbrown•3h ago•51 comments

CamoLeak: Critical GitHub Copilot Vulnerability Leaks Private Source Code

https://www.legitsecurity.com/blog/camoleak-critical-github-copilot-vulnerability-leaks-private-s...
72•greyadept•14h ago•14 comments

Show HN: Rift – A tiling window manager for macOS

https://github.com/acsandmann/rift
163•atticus_•13h ago•86 comments

Paper2Video: Automatic Video Generation from Scientific Papers

https://arxiv.org/abs/2510.05096
64•jinqueeny•14h ago•14 comments

The World's 2.75B Buildings

https://tech.marksblogg.com/building-footprints-gba.html
87•marklit•4d ago•42 comments

Microsoft only lets you opt out of AI photo scanning 3x a year

https://hardware.slashdot.org/story/25/10/11/0238213/microsofts-onedrive-begins-testing-face-reco...
683•dmitrygr•19h ago•265 comments

A New Algorithm Makes It Faster to Find the Shortest Paths

https://www.wired.com/story/new-method-is-the-fastest-way-to-find-the-best-routes/
15•quapster•2h ago•7 comments

LineageOS 23

https://lineageos.org/Changelog-30/
272•cdesai•14h ago•106 comments

Testing two 18 TB white label SATA hard drives from datablocks.dev

https://ounapuu.ee/posts/2025/10/06/datablocks-white-label-drives/
192•thomasjb•6d ago•119 comments

The World Trade Center under construction through photos, 1966-1979

https://rarehistoricalphotos.com/twin-towers-construction-photographs/
224•kinderjaje•5d ago•106 comments

The <output> Tag

https://denodell.com/blog/html-best-kept-secret-output-tag
780•todsacerdoti•1d ago•174 comments