frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

https://www.baseten.co/blog/sota-performance-for-gpt-oss-120b-on-nvidia-gpus/
57•philipkiely•3h ago

Comments

tmshapland•1h ago
Such a fascinating read. I didn't realize how much massaging needed to be done to get the models to perform well. I just sort of assumed they worked out of the box.
davepeck•45m ago
This is a good read, as is the linked-to deeper dive on Baseten’s inference and infrastructure stacks: https://www.baseten.co/resources/guide/the-baseten-inference...
magicalhippo•40m ago
Maybe I'm especially daft this morning but I don't get the point of the speculative decoding.

How does the target model validate the draft tokens without running the inference as normal?

Because if it is doing just that, I don't get the point as you can't trust the draft tokens before they are validated, so you're still stuck waiting for the target model.

joliu•28m ago
It does run inference, but on the batch of tokens that were drafted, akin to the prefill phase.

So your draft model can decode N new tokens, then the real model does one inference pass to score the N new drafted tokens.

Prefill is computation bound whereas decode is bandwidth bound, so in practice doing one prefill over N tokens is cheaper than doing N decode passes.

furyofantares•20m ago
Not an expert, but here's how I understand it. You know how input tokens are cheaper than output tokens? It's related to that.

Say the model so far has "The capital of France". The small model generates "is Paris.", which let's say is 5 tokens.

You feed the large model "The capital of France is Paris." to validate all 5 of those tokens in a single forward pass.

ahmedfromtunis•12m ago
But what would happen if the small model's prediction was "is Rome."? Wouldn't that result in costlier inference if the small model is "wrong" more than it is correct.

Also, if the small model would be sufficiently more "correct" than "wrong", wouldn't be more efficient to get rid of the large model at this point?

isoprophlex•9m ago
but... do you get any validation during the forward pass? the small model could just as well have generated "is Berlin." or whatever. do these models somehow give you a likelihood for the next token when you're prefilling, that you can compare against? if so why not just... use that always?

or is this a scenario where computation is expensive but validation is cheap?

cristoperb•10m ago
My simplified understanding: The target model can validate the draft tokens all at once, in a single forward pass. The output of that forward pass is a list of probabilities for each draft token which are compared to the probabilities produced by the draft model. If the target model's probabilities are the same or greater than the draft model, the tokens are accepted. Worst case none of the draft tokens are accepted and instead the target model selects the single next token as usual.
robrenaud•1m ago
I think your core misunderstanding is that you are assuming K calls to generate 1 token is expensive as 1 call to generate K tokens. It is actually much more expensive to generate serially than even in small batches.
modeless•15m ago
What's the best speed people have gotten on 4090s?
ActorNightly•1m ago
You can't fit the model into 4090 without quantization, its like 64 gigs.

"OpenAI Harmony Response Format": standardize CoT+tools via <|specialtokens|>

https://cookbook.openai.com/articles/openai-harmony
1•almatia•1m ago•0 comments

Postgres Logical Redplication Slot Invalidations in 16.9 and 17.5

1•saisrirampur•2m ago•0 comments

Show HN: Pinterest Videos Downloader

https://chromewebstore.google.com/detail/pinterest-videos-download/ocpbpkfophcggcebcdacallifdnlmbai
1•qwikhost•5m ago•0 comments

Arab States Call for Hamas to Disarm Amid Push for a Palestinian State

https://nytimes.com/2025/07/31/world/middleeast/hamas-arab-states-palestinians.html
1•thomassmith65•5m ago•0 comments

Tennessee school book bans include Calvin and Hobbes and The Magic Tree House

https://www.bklmy.com/archives/30187
1•caned•8m ago•0 comments

Will Future Civilizations Bother to Excavate Our Remains?

https://www.palladiummag.com/2025/07/08/will-future-civilizations-bother-to-excavate-our-remains/
1•MrBuddyCasino•9m ago•0 comments

Maruko: Write SwiftUI iOS Apps on Your Phone with No-Code Magic

https://apps.apple.com/us/app/maruko-craft-your-apps/id6470918527
1•anyinfa•11m ago•0 comments

Show HN: Don't settle for a waitlist. Create a VIP list that drives conversions

https://www.vipli.st/
1•doppelgunner•14m ago•0 comments

Show HN: Complexipy, calculate the cognitive complexity of Python

https://github.com/rohaquinlop/complexipy
2•rohaquinlop•15m ago•0 comments

GPT-5 model descriptions accidentally leaked on GitHub

https://twitter.com/ns123abc/status/1953318288286519676
5•codergautam•17m ago•1 comments

RIP to the Macintosh HD hard drive icon, 2000–2025

https://arstechnica.com/gadgets/2025/08/rip-to-the-macintosh-hd-hard-drive-icon-2000-2025/
2•xrayarx•25m ago•1 comments

Viral TikTok Challenge Leaves 9-Year-Old with Burns: Police

https://www.msn.com/en-us/news/crime/viral-tiktok-challenge-leaves-9-year-old-with-severe-burns-police/ar-AA1JYCIK
2•josephcsible•27m ago•0 comments

Writing Your Own Simple Tab-Completions for Bash and Zsh

https://mill-build.org/blog/14-bash-zsh-completion.html
3•lihaoyi•28m ago•0 comments

Waymos of Loving Grace

https://www.kvncnnlly.com/2025-08-02-waymos-of-loving-grace/
4•wintercarver•36m ago•0 comments

Bitfrost – LLM gateway 90x faster than Litellm at p99

https://github.com/maximhq/bifrost
4•havercosine•39m ago•1 comments

How ChatGPT spoiled my semester (2024)

https://benborgers.com/chatgpt-semester
33•edent•44m ago•6 comments

Tech company reaches gender quotas by replacing half workers with female AI bots

https://www.betootaadvocate.com/uncategorized/tech-company-reaches-gender-quotas-by-replacing-half-the-workforce-with-female-ai-assistants/
1•tjmc•47m ago•0 comments

Ask HN: What's the best career move you made in tech–and why?

2•karma_7•50m ago•4 comments

Wary of sticker shock, retailers clash with brands on price hikes

https://www.reuters.com/business/retail-consumer/wary-sticker-shock-retailers-clash-with-brands-price-hikes-2025-08-07/
1•petethomas•52m ago•0 comments

Elvis is alive How 'AI' stunts modern mythmaking

https://bsdly.blogspot.com/2025/08/elvis-is-alive-how-ai-stunts-modern.html
1•peter_hansteen•52m ago•0 comments

Onion-Lang

https://github.com/sjrsjz/onion-lang
2•todsacerdoti•52m ago•0 comments

Apple hit by string of departures in AI talent war

https://www.ft.com/content/6b9ce8ce-a327-40c1-a8a1-579c2727fc60
2•mfiguiere•53m ago•0 comments

The rise of couples location sharing

https://www.theguardian.com/lifeandstyle/2025/jul/24/inside-the-rise-of-couple-location-sharing
2•bryanrasmussen•1h ago•1 comments

Official Reserve Revaluations: The International Experience

https://www.federalreserve.gov/econres/notes/feds-notes/official-reserve-revaluations-the-international-experience-20250801.html
2•palmfacehn•1h ago•0 comments

Actual LLM agents are coming

https://pleias.fr/blog/blogactual-llm-agents-are-coming
5•whoami_nr•1h ago•1 comments

Fun Command-Line Tricks You Should Try

https://www.nxgntools.com/blog/5-fun-and-handy-curl-command-line-tricks-you-should-try
1•doppelgunner•1h ago•1 comments

New Gemini app tools to help students learn, understand and study better

https://blog.google/products/gemini/new-gemini-tools-students-august-2025/
4•from_neverland•1h ago•1 comments

Sleep Ledger

https://domofutu.substack.com/p/sleep-ledger
1•wjb3•1h ago•0 comments

Your LLM Does Not Care About MCP

https://hackteam.io/blog/your-llm-does-not-care-about-mcp/
2•gethackteam•1h ago•1 comments

AI in production: reflecting on one year, five projects and factories deployed

https://medium.com/oss-ventures/ai-in-production-reflecting-on-one-year-five-projects-and-dozens-of-factories-deployed-582e627d6cec
1•philberto•1h ago•0 comments