frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

(Bsky thread) "This turns the maintainer into an unwitting vibe coder"

https://bsky.app/profile/fullmoon.id/post/3meadfaulhk2s
1•todsacerdoti•49s ago•0 comments

Software development is undergoing a Renaissance in front of our eyes

https://twitter.com/gdb/status/2019566641491963946
1•tosh•1m ago•0 comments

Can you beat ensloppification? I made a quiz for Wikipedia's Signs of AI Writing

https://tryward.app/aiquiz
1•bennydog224•2m ago•1 comments

Spec-Driven Design with Kiro: Lessons from Seddle

https://medium.com/@dustin_44710/spec-driven-design-with-kiro-lessons-from-seddle-9320ef18a61f
1•nslog•2m ago•0 comments

Agents need good developer experience too

https://modal.com/blog/agents-devex
1•birdculture•3m ago•0 comments

The Dark Factory

https://twitter.com/i/status/2020161285376082326
1•Ozzie_osman•3m ago•0 comments

Free data transfer out to internet when moving out of AWS (2024)

https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-internet-when-moving-out-of-aws/
1•tosh•4m ago•0 comments

Interop 2025: A Year of Convergence

https://webkit.org/blog/17808/interop-2025-review/
1•alwillis•6m ago•0 comments

Prejudice Against Leprosy

https://text.npr.org/g-s1-108321
1•hi41•7m ago•0 comments

Slint: Cross Platform UI Library

https://slint.dev/
1•Palmik•10m ago•0 comments

AI and Education: Generative AI and the Future of Critical Thinking

https://www.youtube.com/watch?v=k7PvscqGD24
1•nyc111•11m ago•0 comments

Maple Mono: Smooth your coding flow

https://font.subf.dev/en/
1•signa11•12m ago•0 comments

Moltbook isn't real but it can still hurt you

https://12gramsofcarbon.com/p/tech-things-moltbook-isnt-real-but
1•theahura•15m ago•0 comments

Take Back the Em Dash–and Your Voice

https://spin.atomicobject.com/take-back-em-dash/
1•ingve•16m ago•0 comments

Show HN: 289x speedup over MLP using Spectral Graphs

https://zenodo.org/login/?next=%2Fme%2Fuploads%3Fq%3D%26f%3Dshared_with_me%25253Afalse%26l%3Dlist...
1•andrespi•17m ago•0 comments

Teaching Mathematics

https://www.karlin.mff.cuni.cz/~spurny/doc/articles/arnold.htm
2•samuel246•19m ago•0 comments

3D Printed Microfluidic Multiplexing [video]

https://www.youtube.com/watch?v=VZ2ZcOzLnGg
2•downboots•19m ago•0 comments

Abstractions Are in the Eye of the Beholder

https://software.rajivprab.com/2019/08/29/abstractions-are-in-the-eye-of-the-beholder/
2•whack•20m ago•0 comments

Show HN: Routed Attention – 75-99% savings by routing between O(N) and O(N²)

https://zenodo.org/records/18518956
1•MikeBee•20m ago•0 comments

We didn't ask for this internet – Ezra Klein show [video]

https://www.youtube.com/shorts/ve02F0gyfjY
1•softwaredoug•21m ago•0 comments

The Real AI Talent War Is for Plumbers and Electricians

https://www.wired.com/story/why-there-arent-enough-electricians-and-plumbers-to-build-ai-data-cen...
2•geox•24m ago•0 comments

Show HN: MimiClaw, OpenClaw(Clawdbot)on $5 Chips

https://github.com/memovai/mimiclaw
1•ssslvky1•24m ago•0 comments

I Maintain My Blog in the Age of Agents

https://www.jerpint.io/blog/2026-02-07-how-i-maintain-my-blog-in-the-age-of-agents/
3•jerpint•24m ago•0 comments

The Fall of the Nerds

https://www.noahpinion.blog/p/the-fall-of-the-nerds
1•otoolep•26m ago•0 comments

Show HN: I'm 15 and built a free tool for reading ancient texts.

https://the-lexicon-project.netlify.app/
3•breadwithjam•29m ago•1 comments

How close is AI to taking my job?

https://epoch.ai/gradient-updates/how-close-is-ai-to-taking-my-job
1•cjbarber•29m ago•0 comments

You are the reason I am not reviewing this PR

https://github.com/NixOS/nixpkgs/pull/479442
2•midzer•31m ago•1 comments

Show HN: FamilyMemories.video – Turn static old photos into 5s AI videos

https://familymemories.video
1•tareq_•32m ago•0 comments

How Meta Made Linux a Planet-Scale Load Balancer

https://softwarefrontier.substack.com/p/how-meta-turned-the-linux-kernel
1•CortexFlow•32m ago•0 comments

A Turing Test for AI Coding

https://t-cadet.github.io/programming-wisdom/#2026-02-06-a-turing-test-for-ai-coding
2•phi-system•33m ago•0 comments
Open in hackernews

AdapTive-LeArning Speculator System (ATLAS): Faster LLM inference

https://www.together.ai/blog/adaptive-learning-speculator-system-atlas
198•alecco•3mo ago

Comments

petesergeant•3mo ago
> Built on top of Together Turbo Speculator, ATLAS reaches up to 500 TPS on DeepSeek-V3.1 and up to 460 TPS on Kimi-K2 in a fully adapted scenario — 2.65x faster than standard decoding, outperforming even specialized hardware like Groq

and yet, if you click on: https://openrouter.ai/moonshotai/kimi-k2-0905

You'll see Groq averaging 1,086tps vs Together doing 59tps. Groq and Cerebras often feel like the only games in town. I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.

Comparing how quickly gpt-oss-120b runs gives a broader picture: https://openrouter.ai/openai/gpt-oss-120b -- Vertex (Google) and SambaNova do pretty good on it too, but still, the difference between a top provider and an also-ran is giant.

God I love OpenRouter.

senko•3mo ago
> You'll see Groq averaging 1,086tps

What I don't understand is, Groq reporting 200tps for the same model: https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...

OpenRouter numbers look fishy.

petesergeant•3mo ago
Wonder if it’s prompt caching? OpenRouter is (I guess) just reporting actual throughput, where presumably groq is reporting a from-scratch figure? Just a guess tho.
p1esk•3mo ago
Do these numbers compare performance at the same cost?
petesergeant•3mo ago
You can see the cost in the links, and the answer is “pretty much” for the consumer. The backend maths, no idea.
Havoc•3mo ago
>Groq and Cerebras often feel like the only games in town.

SambaNova should be similar...they've got a similar specialized hardware approach

jbellis•3mo ago
groq is quantizing, even though it's not labeled as such on openrouter (super frustrating)
bn-l•3mo ago
Do you have a source for that? They are pretty close to the ref implementation on moonshot’s ranking
jbellis•3mo ago
https://groq.com/blog/inside-the-lpu-deconstructing-groq-spe...
immortal3•3mo ago
There's another angle to this comparison. Groq and Cerebras use custom chips, but I'm not sure about Together. In this case, Together is sharing results based on the B200 GPU. Another important point is the accuracy of these speed-ups compared to the baseline model. It's known that such tricks reduce accuracy, but by how much? Kimi has already benchmarked several providers. https://x.com/Kimi_Moonshot/status/1976926483319763130
jsheard•3mo ago
> Groq and Cerebras use custom chips

Not just custom chips, but custom chips which derive much of their performance from enormous amounts of SRAM. There's no denying that approach is fast, but it's also incredibly expensive, and SRAM scaling has slowed to a crawl so it won't get much cheaper any time soon.

petesergeant•3mo ago
This is an "expensive for whom" question. I'd be keen to know if they're burning investor money hosting these right now or if they're able to run these at cost.
rfoo•3mo ago
> It's known that such tricks reduce accuracy

AFAIU, speculative decoding (and this fancier version of spec. decoding) does not reduce accuracy.

martinald•3mo ago
No it shouldn't do. "All" you're doing is having a small model run the prompt and then have the large model "verify" it. When the large model diverges from the small one, you restart the process again.
Der_Einzige•3mo ago
It’s quantization which is crippling accuracy…
petesergeant•3mo ago
People all over this subthread saying that with no evidence provided. The company say they don’t — which would be pretty embarrassing to have to walk back — so who’s saying they do?
meander_water•3mo ago
Interesting, if you take a look at the median throughput chart [0], groq goes insane after 7th Oct. Wonder what happened.

[0] https://openrouter.ai/moonshotai/kimi-k2-0905/performance

awestroke•3mo ago
Heavy quantization
petesergeant•3mo ago
They claim (or someone on Reddit who claims to be staff claims) that's not accurate: https://www.reddit.com/r/LocalLLaMA/comments/1mk4kt0/comment...
sigmar•3mo ago
2x jump overnight. new LPU hardware? I checked the speed for groq's gpt-oss-120B, Llama4-maverick, and Llama4-scout; none of them had a noticeable change this month
KronisLV•3mo ago
> I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.

I'm currently on the Cerebras Code subscription for like 50 USD a month because it more or less makes the rate limits I used to deal with other platforms disappear (without making me spend upwards of 100 USD paying per token): https://www.cerebras.ai/blog/introducing-cerebras-code

At the same time, their Qwen Coder 480B model is fine but I still find myself going for Claude or GPT-5 or Gemini 2.5 Pro for more complex issues (or ones where I need good usage of Latvian language), at least for programming tasks it'd eventually be super cool if they could offer more models.

Or have some sort of a partnership with Anthropic or whoever, because getting my questions answered at around 500-1500 TPS is really, really pleasant, especially for agentic use cases with code modifications, even if I still bump into the 128k context limits occasionally.

alecco•3mo ago
But Groq/Cerebras are hardware accelerators. It's an unrelated optimization. I wouldn't be surprised if they could also use speculators (today or in the future).
ashvardanian•3mo ago
Will need some time to go through the details, but it’s increasingly rare to see teams consistently delivering meaningful improvements in the open. Impressive work!
wishawa•3mo ago
Inference is impressively fast. But what about quality? In the Kimi vendor verifier (https://github.com/MoonshotAI/K2-Vendor-Verifier/), Together has one of the highest tool call failure rates (>300 failures over the benchmark, compared to 0-2 for the official API, groq, SiliconFlow, and Infinigence).
rfoo•3mo ago
If you compare "schema validation error count" plus "Count of Finish Reason others" then SiliconFlow and Infinigence is in the same bucket too. Maybe their API layer detected incorrect tool call and set finish reason to something else?

IMO this likely is what you get from running the model correctly as-is (i.e. using the same weight and activation dtype), so Together is not bad.

Moonshot AI themselves and Groq likely uses some sampler tricks to eliminate schema validation errors.

So really the only thing this shows is: Nebius, Chutes, AtlasCloud could be running something else (for example further quantized model). Or bugs.

wishawa•3mo ago
Fair point. If Moonshot is holding back the true weights or inference techniques that affect correctness, then providers including Together should call them out on that. I for one would stop using Kimi if that is the case.

Anyway, Novita is doing significantly better on the vendor verifier chart than Together, so the low quality must be partially Together's fault at least.

rfoo•3mo ago
I don't think it's weight being different or special inference techniques, more like they are not able to train the model to follow tool schema perfectly yet, and both Moonshot and Groq decided to use something like https://github.com/noamgat/lm-format-enforcer to make sure at least the output format is correct.
sailingparrot•3mo ago
I don't know anything about Together quality in general, but the specific technique discussed here (speculative decoding) has no impact on the quality of generations. So you should be able to apply it to whichever model you want, and see the advertised speedup while retaining the quality of your base model.
furyofantares•3mo ago
> the specific technique discussed here (speculative decoding) has no impact on the quality of generations

I don't see why that would be true. As I understand, the verifier is checking if the tokens are good-enough, not if they're the exact same tokens it would have selected. The predicted tokens could be consistently slightly worse, which could have a cascading effect to make the overall output a lot worse.

sailingparrot•3mo ago
> the verifier is checking if the tokens are good-enough, not if they're the exact same tokens it would have selected

That's up to you, depends on how you implement it and how much you want to prioritize speed at the expense of quality, this is not an intrinsic attribute of speculative decoding. The verifier checks if the tokens predicted by the draft model are part of the top-k tokens predicted by the full size model at each steps. Set k to 1 and you will only accept perfect matches. Set k to > 1 and you will indeed start selecting "good enough" tokens, but will get faster inference.

But no matter what value you choose for k, the technique described in the article can apply and will result in faster inference at no loss when compared to a setup without this technique, with the same value of k.

buildbot•3mo ago
It can be exact or not! Depends on the kind of sampling you are doing.

You can do exact verification, and as soon as a token mismatches you reject everything after that token from your draft. Relaxed acceptance techniques measure how wrong that mispredicted token is via some metric, and accept it if it’s close enough. So you get longer draft lengths with higher acceptance rates.

gkapur•3mo ago
Adding to the prior comments as my intuition matched yours, there’s a nice Reddit thread that gives some context into how it can be faster even if you require exact matches: https://www.reddit.com/r/LocalLLaMA/s/ARxHLqRjdM

The TLDR/key (from my understanding) is that verifying N tokens can be faster than generating N tokens.

sailingparrot•3mo ago
> The TLDR/key (from my understanding) is that verifying N tokens can be faster than generating N tokens.

Yes. This is because to generate token n+1 you need token n etc. So generating from scratch is a sequential (thus slow) process. When we verify tokens, we can, for each token, use all preceding tokens as input and check that the output token matches the expectation. But since the full sequence we want to verify already exist, we can do it in parallel for each token we want to verify and not sequentially.

This is why training transformer models is much faster than RNN, we do the same thing during training, it's just that the sequence we compare to is the ground truth and not coming from another model.

wishawa•3mo ago
I didn't know this! I've always thought speculative decoding was "if p(draft_token) > threshold, use it". You made me go read how it actually works and it's pretty neat!

That said, I still think some providers are cheating. Please correct me if the test below is flawed.

I generated texts at temperature = 0 vs temperature = 2. At high temperature, the distributions effectively become flatter, meaning the difference between real and draft effective distributions (the D_LK used in theorem 3.5 of 2211.17192) becomes smaller. When T=2, the model speaks complete gibberish, so the effective distribution must be pretty flat. This should mean fewer rejections --> a lot faster speculative decoding. Yet, I see no increase in throughput at all...

sailingparrot•3mo ago
Not sure exactly what setup you are running, in theory yes, higher temperature for both model means higher chance of overlap and thus less rejections -> faster sampling (but worse quality overall).

However, if you have higher temperature but still are operating under a top-k sampling where k is small, not sure it's going to translate to any noticeable difference, since this will make your actual distributions very much non-uniform.

wishawa•3mo ago
This is with Together's API via OpenRouter, running DeepSeek V3 0324 and Kimi K2 0905.

I didn't set a top-k. So it seems like Together must be doing something weird in their speculative decoding implementation.

sailingparrot•3mo ago
Oh in that case there is definitely a top-k or top-p behind the scene, it might just not be exposed to the user as a param they can change through their API. I haven’t heard of anyone running a LLM in prod with actual pure sampling
wishawa•3mo ago
I see. That's slightly unfortunate. In principle, increasing temperature flattens out the distribution but the ordering between different tokens' probabilities remain the same, so setting a top-k shouldn't break my test. Can't say the same for top-p though. And all of this is probably too deep into the provider's implementation details for me to make assumptions on.
Havoc•3mo ago
>a faster speculator (also known as the draft model) proposes multiple tokens ahead, and the target model verifies them in parallel in a single forward pass

TIL. Bit of an aha moment - never understood till now how a big model can verify faster than it can generate

woadwarrior01•3mo ago
As with almost everything else in CS, it's a tradeoff. Pre-fill is compute bound, decoding is memory bandwidth bound. Speculative decoding works when the draft model is more often right that wrong, because most architectures have a lot more compute, compared to memory bandwidth.
andblac•3mo ago
At first glance, this reminds me of how branch prediction is utilized in CPUs to speedup execution. As I understand it, this development is like a form of soft branch prediction over language trajectories: a small model predicts what the main model will do, takes few steps ahead and then verifies the results (and this can be done in parallel). If it checks out, you just jump forward, it not you take miss but its rare. I find it funny how small-big ideas like this come up in different context again and again in history of our technological development. Of course ideas as always are cheap. The hard part is how to actually use them and cash in on them.
red2awn•3mo ago
A lot of optimizations in LLMs now are low hanging fruits inspired by techniques in classical computer science. Another one that comes to mind is paged KV caching which is based on memory paging.
LogicFailsMe•3mo ago
No barrier to entry whatsoever? Backprop on the speculative decoding weights during inference to improve their accuracy on a per application basis?

Cool hack though, kudos. Wonder if they can make Groq or Cerebras do the same thing?

necovek•3mo ago
So with a 4x speed-up, Together will give us at least 2x lower price for top-end models, right? :)
jsutton97•3mo ago
I can't help but wonder how much longer we'll see this work shared openly.
diamond559•3mo ago
Great, my slop memes can come out much faster now. This is the future of the world economy!
hazrmard•3mo ago
Do I understand this right?

A light-weight speculative model adapts to usage, keeping the acceptance rate for the static heavy-weight model within acceptable bounds.

Do they adapt with LoRAs?