frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

https://www.baseten.co/blog/sota-performance-for-gpt-oss-120b-on-nvidia-gpus/
96•philipkiely•4h ago•25 comments

Claude Code IDE integration for Emacs

https://github.com/manzaltu/claude-code-ide.el
632•kgwgk•17h ago•208 comments

Rules by which a great empire may be reduced to a small one (1773)

https://founders.archives.gov/documents/Franklin/01-20-02-0213
128•freediver•7h ago•75 comments

We replaced passwords with something worse

https://blog.danielh.cc/blog/passwords
103•max__dev•4h ago•77 comments

Project Hyperion: Interstellar ship design competition

https://www.projecthyperion.org
210•codeulike•10h ago•155 comments

A candidate giant planet imaged in the habitable zone of α Cen A

https://arxiv.org/abs/2508.03814
51•pinewurst•4h ago•15 comments

Litestar is worth a look

https://www.b-list.org/weblog/2025/aug/06/litestar/
238•todsacerdoti•10h ago•58 comments

You know more Finnish than you think

https://dannybate.com/2025/08/03/you-know-more-finnish-than-you-think/
96•infinate•2d ago•59 comments

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model

https://github.com/KittenML/KittenTTS
816•divamgupta•1d ago•327 comments

Jules, our asynchronous coding agent

https://blog.google/technology/google-labs/jules-now-available/
268•meetpateltech•14h ago•177 comments

Mac history echoes in current Mac operating systems

http://tenfourfox.blogspot.com/2025/08/mac-history-echoes-in-mac-operating.html
96•classichasclass•4h ago•28 comments

We'd be better off with 9-bit bytes

https://pavpanchekha.com/blog/9bit.html
131•luu•11h ago•225 comments

Writing a Rust GPU kernel driver: a brief introduction on how GPU drivers work

https://www.collabora.com/news-and-blog/blog/2025/08/06/writing-a-rust-gpu-kernel-driver-a-brief-introduction-on-how-gpu-drivers-work/
242•losgehts•14h ago•31 comments

A fast, growable array with stable pointers in C

https://danielchasehooper.com/posts/segment_array/
165•ibobev•12h ago•61 comments

The Bluesky Dictionary

https://www.avibagla.com/blueskydictionary/
137•gaws•9h ago•46 comments

Show HN: Rust framework for advanced file recognition and identification

https://crates.io/crates/magical_rs
18•reimisdev•3h ago•4 comments

How ChatGPT spoiled my semester (2024)

https://benborgers.com/chatgpt-semester
47•edent•1h ago•14 comments

SQLite offline sync for Android quick start

https://github.com/sqliteai/sqlite-sync/tree/main/examples/android-integration
9•marcobambini•2d ago•3 comments

Researchers Uncover RCE Attack Chains in HashiCorp Vault and CyberArk Conjur

https://www.csoonline.com/article/4035274/researchers-uncover-rce-attack-chains-in-popular-enterprise-credential-vaults.html
5•GavCo•14m ago•0 comments

Herbie detects inaccurate expressions and finds more accurate replacements

https://herbie.uwplse.org/
3•bwidlar•3d ago•0 comments

FDA approves eye drops that fix near vision without glasses

https://newatlas.com/aging/age-related-near-sighted-drops-vizz/
53•geox•3h ago•28 comments

Multics

https://www.multicians.org/multics.html
107•unleaded•13h ago•23 comments

Compaq’s Rod Canion broke IBM's hold on the PC market

https://every.to/feeds/b0e329f3048258e8eeb7/the-man-who-beat-ibm
61•vinnyglennon•3d ago•22 comments

What is the average length of a queue of cars? (2023)

https://e-dorigatti.github.io/math/2023/11/01/queue-length.html
10•alexmolas•3d ago•2 comments

Comptime.ts: compile-time expressions for TypeScript

https://comptime.js.org/
115•excalo•3d ago•24 comments

Out-Fibbing CPython with the Plush Interpreter

https://pointersgonewild.com/2025-08-06-out-fibbing-cpython-with-the-plush-interpreter/
29•Bogdanp•7h ago•0 comments

Automerge 3.0

https://automerge.org/blog/automerge-3/
279•surprisetalk•3d ago•24 comments

Breaking the sorting barrier for directed single-source shortest paths

https://www.quantamagazine.org/new-method-is-the-fastest-way-to-find-the-best-routes-20250806/
141•baruchel•15h ago•44 comments

Rethinking DOM from first principles

https://acko.net/blog/html-is-dead-long-live-html/
206•puzzlingcaptcha•23h ago•193 comments

Zig Error Patterns

https://glfmn.io/posts/zig-error-patterns/
135•Bogdanp•15h ago•36 comments
Open in hackernews

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

https://www.baseten.co/blog/sota-performance-for-gpt-oss-120b-on-nvidia-gpus/
94•philipkiely•4h ago

Comments

tmshapland•1h ago
Such a fascinating read. I didn't realize how much massaging needed to be done to get the models to perform well. I just sort of assumed they worked out of the box.
acters•44m ago
Personally, I think bigger companies should be more proactive and work with some of the popular inference engine software devs with getting their special snowflake LLM to work before it gets released. I guess it is all very much experimental at the end of the day. Those devs are putting in God's work for us to use on our budget friendly hardware choices.
magicalhippo•1h ago
Maybe I'm especially daft this morning but I don't get the point of the speculative decoding.

How does the target model validate the draft tokens without running the inference as normal?

Because if it is doing just that, I don't get the point as you can't trust the draft tokens before they are validated, so you're still stuck waiting for the target model.

joliu•1h ago
It does run inference, but on the batch of tokens that were drafted, akin to the prefill phase.

So your draft model can decode N new tokens, then the real model does one inference pass to score the N new drafted tokens.

Prefill is computation bound whereas decode is bandwidth bound, so in practice doing one prefill over N tokens is cheaper than doing N decode passes.

furyofantares•1h ago
Not an expert, but here's how I understand it. You know how input tokens are cheaper than output tokens? It's related to that.

Say the model so far has "The capital of France". The small model generates "is Paris.", which let's say is 5 tokens.

You feed the large model "The capital of France is Paris." to validate all 5 of those tokens in a single forward pass.

ahmedfromtunis•1h ago
But what would happen if the small model's prediction was "is Rome."? Wouldn't that result in costlier inference if the small model is "wrong" more than it is correct.

Also, if the small model would be sufficiently more "correct" than "wrong", wouldn't be more efficient to get rid of the large model at this point?

cwyers•47m ago
So, the way speculative decoding works, the model begins predicting at the first wrong token, so you still get 'is' for free.
acters•37m ago
I believe that is exactly the downside of using speculative decoding, which is why it is very important to have the models properly sized between each other by making sure the small use is big enough to be mostly correct while also being exceptionally faster than the larger one. However the larger one has to be fast enough that catching flaws won't introduce too manyrandom delays. Also, if the small one is incorrect then the larger one correcting the mistake is miles better than leaving in incorrect output.

It is about improving quality while allowing for faster speed most of the time. The tradeoff is that you consume more memory from having two models loaded vs one of them exclusively.

If you just focus on one then it would make sense to reduce memory usage by just running the smaller model.

acters•27m ago
Another caveat with this method is that both larger and smaller models need to behave very similar because a lot of the savings come from generating the necessary fluff around each detail such as grammar, formatting and words/letters that transition between each other.

Unsurprisingly gpt-oss has both larger and smaller models that work very similarly! Both model sizes are so similar that even if getting a few wrong would not be slowing down the performance enough to equal the speed of the larger model(which is the worst case with this setup). We want the speed of the smaller model as much as possible. That is all

isoprophlex•1h ago
but... do you get any validation during the forward pass? the small model could just as well have generated "is Berlin." or whatever. do these models somehow give you a likelihood for the next token when you're prefilling, that you can compare against? if so why not just... use that always?

or is this a scenario where computation is expensive but validation is cheap?

cristoperb•1h ago
My simplified understanding: The target model can validate the draft tokens all at once, in a single forward pass. The output of that forward pass is a list of probabilities for each draft token which are compared to the probabilities produced by the draft model. If the target model's probabilities are the same or greater than the draft model, the tokens are accepted. Worst case none of the draft tokens are accepted and instead the target model selects the single next token as usual.
robrenaud•56m ago
I think your core misunderstanding is that you are assuming K calls to generate 1 token is expensive as 1 call to generate K tokens. It is actually much more expensive to generate serially than even in small batches.
porridgeraisin•46m ago
Let's say I want to run f2(f1(x)) where f1 and f2 are both a single pass through GPT4.

This takes 2 seconds time, assuming 1 second for every pass.

What I instead do is kick off f1(x) in another thread, and then run f2(g1(x)) where g1 is one pass through GPT-nano.

This takes 1 + 0.1 seconds, assuming gpt nano takes 0.1s for every pass. In this 1.1 seconds, the f1(x) that we kicked off in the 2nd thread would have finished (it takes 1 second).

So in 1.1 seconds we have available to us f1(x), f2(g1(x)), and we store the intermediate g1(x) as well

We compare g1(x) and f1(x)

If they were equal, i.e g1(x) = f1(x), then we have our answer = f2(g1(x)) in just 1.1s.

If they were not, we compute f2(output of f1(x) from 2nd thread) which takes 1 further second, bringing our total to 2.1s.

If the small model is equalling the big model in say 2/3 of cases, you will spend 2/3 * 1.1 + 1/3 * 2.1 = 1.433s on average for this computation. Without speculative decoding, it is always 2s.

arkmm•43m ago
This is a really great explanation.
magicalhippo•36m ago
Thanks, very nice explanation, that makes perfect sense. I guess their graphics confused me for some reason and had me thinking all wrong.

Now I see they tried to point out the obvious thing which is to predict multiple tokens ahead, not just two as in your example.

bhaney•11m ago
> How does the target model validate the draft tokens without running the inference as normal?

It does run the inference as normal, just in parallel with the other inferences

> if it is doing just that, I don't get the point

Running inferences in parallel allows you to only read the model weights out of memory only once for N parallel inferences, as opposed to reading them out of memory N times for N serial inferences. Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot.

littlestymaar•7m ago
> Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot.

Nitpick: it's only bottlenecked by memory bandwidth if the batch size is too low (that is: if you don't have many users calling the same model in parallel).

Speculative decoding is just a way of running a single query as if it was parallel queries.

modeless•1h ago
What's the best speed people have gotten on 4090s?
ActorNightly•56m ago
You can't fit the model into 4090 without quantization, its like 64 gigs.

For home use, Gemma27B QAT is king. Its almost as good as Deepseek R1

modeless•37m ago
The 20B one fits.
asabla•51m ago
I'm on a 5090 so it's not apples to apples comparison. But I'm getting ~150t/s for the 20B version using ~16000 context size.
modeless•36m ago
Cool, what software?
steinvakt2•9m ago
And flash attention doesn't work on 5090 yet, right? So currently 4090 is probably faster, or?
littlestymaar•45m ago
Very fast “Sorry I can't help with that” generator.