frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

I'm Eric Ries, author of "The Lean Startup" and new book "Incorruptible" – AMA

287•eries•3h ago•226 comments

How JPL keeps the 13-year-old Curiosity rover doing science

https://spectrum.ieee.org/curiosity-rover-jpl-mars-science
35•pseudolus•57m ago•0 comments

Claude Desktop spins up a VM without no way of stopping it

https://github.com/anthropics/claude-code/issues/29045
46•tonyrice•1h ago•28 comments

PgDog is funded and coming to a database near you

https://pgdog.dev/blog/our-funding-announcement
239•levkk•4h ago•115 comments

Why SpaceX 2040 Revenue FCST $4.3T in highly unlikely

https://www.matteast.io/spacex-escape-velocity.html
24•meast•43m ago•3 comments

GitHub Authentication issues related to API requests

https://www.githubstatus.com/incidents/fcj3088jg1wx
87•Multicomp•2h ago•22 comments

Building an HTML-first site doubled our users overnight

https://mohkohn.co.uk/writing/html-first/
761•edent•5h ago•354 comments

Providers, not insurers, are responsible for excess U.S. health care cost (2024)

https://www.noahpinion.blog/p/insurers-arent-the-main-villain-of
20•paulpauper•50m ago•11 comments

Mercedes‑Benz starts large‑scale production of electric axial flux motor

https://media.mercedes-benz.com/en/article/bebac2af-acdc-465a-9538-adb0bf3d8ccf
430•raffael_de•10h ago•259 comments

Apache Burr: Build reliable AI agents and applications

https://burr.apache.org/
100•anhldbk•3h ago•65 comments

The Dynamo and the Computer: The Modern Productivity Paradox (1989) [pdf]

https://www.almendron.com/tribuna/wp-content/uploads/2018/03/the-dynamo-and-the-computer-an-histo...
10•simonpure•40m ago•0 comments

All 9,300 Japanese train station, animated by the year it opened (1872–2026)

https://jivx.com/eki
138•momentmaker•6h ago•46 comments

DiffusionGemma: 4x Faster Text Generation

https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-gen...
137•meetpateltech•2h ago•30 comments

Show HN: HelixDB – A graph database built on object storage

https://github.com/HelixDB/helix-db/tree/main
19•GeorgeCurtis•2h ago•14 comments

Buy a train, bridge or tracks from the Swiss Railway

https://sbbresale.ch/
136•kisamoto•2d ago•67 comments

Show HN: Extend UI – open-source UI kit for modern document apps

https://www.extend.ai/ui
7•kbyatnal•2h ago•0 comments

Who's the Smartest Corvid?

https://thetyee.ca/Culture/2026/06/05/Whos-the-Smartest-Corvid/
13•NaOH•1d ago•4 comments

A €0.01 bank transfer could compromise a banking AI agent

https://blue41.com/blog/how-we-helped-bunq-secure-their-financial-ai-assistant/
86•tvissers•4h ago•70 comments

Babel-USB: USB drive with every file

https://github.com/p2r3/babel-usb
7•LorenDB•1h ago•3 comments

Anatomy of a high-performance EP kernel

https://fergusfinn.com/blog/anatomy-of-a-high-performance-ep-kernel/
8•kkm•2h ago•1 comments

'They take you out of life, out of time': a journey into Spain's cave paintings

https://www.theguardian.com/science/2026/jun/02/journey-into-spain-palaeolithic-cave-paintings-al...
42•NaOH•2d ago•19 comments

Who Runs Your Rust Future? Hands-On Intro to Async Rust

https://aibodh.com/posts/async-rust-chapter-1-hands-on-intro-to-async-rust/
74•febin•2d ago•12 comments

The Case for Free Online Books (2014)

http://from-a-to-remzi.blogspot.com/2014/01/the-case-for-free-online-books-fobs.html
62•jimsojim•1h ago•58 comments

The Last Evolution, by John W Campbell Jr. (1932)

https://www.gutenberg.org/files/27462/27462-h/27462-h.htm
12•cf100clunk•2h ago•0 comments

Reviving Papers with Code

https://paperswithcode.co/
169•nielz_r•2d ago•38 comments

Smudging the game disc to make speedrunning 'SpongeBob' faster

https://www.inverse.com/input/gaming/the-dirty-secret-that-makes-speedrunning-on-spongebob-a-lot-...
36•pncnmnp•16h ago•22 comments

Ask HN: Are most corporate SWE jobs performative?

129•hnthrow10282910•5h ago•149 comments

The iPad was on Tailscale: a WebRTC debugging story

https://p2claw.com/blog/2026-06-09-the-ipad-was-on-tailscale/
32•syllogistic•3h ago•15 comments

macOS Container Machines

https://github.com/apple/container/blob/main/docs/container-machine.md
1121•timsneath•17h ago•389 comments

AWS Bedrock to require sharing data with Anthropic for Mythos and future models

361•TomAnthony•10h ago•214 comments
Open in hackernews

DiffusionGemma: 4x Faster Text Generation

https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/
137•meetpateltech•2h ago

Comments

minimaxir•2h ago
A few days ago I was just thinking that Google never talked about their diffusion text generation model after demoing it at I/O a year ago. The rumor is that it was too expensive to run, but with the provided chart using the same 1x H100 hardware and comparing DiffusionGemma to regular Gemma, that shouldn't be the case. I'm curious what the downside for this speed is here aside from being slightly weaker than Gemma.
ac29•1h ago
> I'm curious what the downside for this speed is here

"DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs"

GaggiX•1h ago
Well with a standard autoregressive model you can generate for example 256 tokens at once if you have 256 users, with this approach you can generate 256 tokens for a single user but you need several forward steps.

So the diffusion process takes more GFLOPs, if you have enough users you can already balance memory and compute.

minimaxir•54m ago
Batching is a fair counterpoint.
rvz•2h ago
We need more local open weight models that are performant and just as good (or good enough) as the best frontier ones.

Then you will be able to achieve Jevons Paradox and enjoy the same “productivity gains” without paying for these extortionate token prices by closed model providers or have it as cheap as possible.

And especially, no silent nerfing of the model.

_fw•1h ago
We have this though, right? Compare SOTA local models to where the frontier was last year. There weren't many people complaining that last year's frontier models were incapable.

Next year, and the year after, Fable, GPT 5.5 and Gemini 3.5 will feel quite ordinary. And perhaps even within reach of a prosumer running models locally.

beklein•1h ago
A good visual explanation of how text diffusion models like DiffusionGemma work: https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
kkukshtel•1h ago
I think this is the future. The sort of left-field rumble that turns into a quake in 5 years.
lambda•1h ago
This may be the future of local models.

The thing is, diffusion models perform somewhat worse than autoregressive on text. So you lose some performance.

Speed is the big advantage. Autoregressive when doing local inference is mostly memory bound; you're doing one token at a time, for each token you need to load all weights. MTP helps a bit by allowing you to draft tokens in a smaller model and then verify them in parallel with the larger model, allowing you to do a few computations for every memory load, but because you're still doing tokens sequentially and need to discard invalid drafted tokens, you can only get so much speedup.

For hosted models, however, you can batch many token generations together, fully utilizing all of the compute while no longer being bottlenecked on memory bandwidth. So they are already operating at close to max efficiency.

So, diffusion kind of loses its beneifit in hosted models. Sure, maybe you could pay more to have slightly lower latency responses by doing diffusion for one user at a time instead of autoregressive for many in parallel. But given that it also reduces accuracy, it's hard to see where you'd really want that. Unless they're able to bring it up to par with autoregressive, it seems like it's a bit of a dead out outside of local models where you're generally just doing one thing at a time.

horsawlarway•21m ago
I'm particularly curious to know how this plays out, and I seriously hope that more labs focus on diffusion models for text usage.

My immediate thought - this performs slightly worse than the autoregressive gemma equivalent, but it may also let me functionally run better models in diffusion variants.

Ex - I can run 70b-120b autoregressive models locally right now, but I get ~5-15t/s, which just isn't fast enough for serious work.

Which caps me down in the 20-36b models (ex - gemma4) where I can get 100+t/s on the same hardware.

So the question becomes - does the quality drop from a diffusion model outweigh the quality bump from using a larger model?

Because if not... sounds like diffusion models have a lot of space to thrive.

---

Sadly - if they can't be hosted profitably, I question whether this space will actually be explored.

vineyardmike•1h ago
Recently I had switched to OpenCode to try out many of the Non-US-Frontier-Labs models. My unexpected favorite model to use was Mercury (a diffusion model). Not because it was “smart” but because it was stupid fast. It was more of a pair-programming experience instead of the SOTA agentic experience of prompting and waiting. Honestly, it was also way more fun and brought back some of the pre-AI coding experience while still getting some benefits of AI. It felt less of a slot machine where you prompt, wait, and hope it went in the right direction. It made me even use the tiny models like Gemini Flash Lite and GPT Mini/Nano more too.

Anyways, so excited for an open-weight model and I hope it performs well. I’ll be testing this ASAP.

onlyrealcuzzo•22m ago
If you can run your tests fast and cheaply, and have metrics that show what bad/sloppy code is that are cheap & fast to generate, a worse fast model can outperform a far better far slower model if you value time...

I've had pretty good success with LLMs after putting in place metrics to measure true complexity (not cyclomatic).

skybrian•22m ago
Could you say more about how you use it? What does your workflow look like?
xnx•1h ago
Is the diffusion approach any use in Multi-Token Prediction (MTP) drafters? https://blog.google/innovation-and-ai/technology/developers-...
fcanesin•1h ago
Yes, DFlash is currently a SOTA speculative decoding method that Xiaomi just used in their MiMo model for >1000tkps
doctorpangloss•1h ago
MTP is a training optimization. Drafting requires verification, and verification is the full model inference. Speculative decoders are the name for the inference time optimization, that is more like a verifier that is a smaller model.
SkitterKherpi•1h ago
It is cool but local models while okay already feel noticeably worse than even the cheapest APIs so I can't see myself sacrificing even a little bit of their quality for speed. I'm sure it's worth it for some usecases, curious to hear specific ones that people are already planning to deploy to production.
Mashimo•49m ago
Maybe writing / bootstraping unit tests?

Does not need opus level to write, and easy to iterate on.

SkitterKherpi•46m ago
I can see it but even if I do that for something like tests I'd still eat the time cost of the normal Gemma for 10% extra performance. And further, if you switch between the fast and normal Gemma for different tasks you eat the big time cost of loading the other model (and maintaining both in the first place).
roosgit•48m ago
Can LoRAs be used to increase the quality of these diffusion models? Nvidia mentions something about this https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B#inf...
samuelknight•46m ago
Some of these comments miss the advantage of diffusion. This is will have a big impact on edge devices, such as your phone or the GPU in your computer.

An LLM's decoder computes tokens one-at-a-time because attention has to account for each previous token. The existing LLM decoders scale well when you have enough load to batch many inferences together. Diffusion of limited benefit there. On edge you have a different problem: your inference accelerator is starved while sloshing GB of weights back and forth from RAM. That's because the consumer RAM like LPDDRx/GDDRx is lower bandwidth than HBM, and the requests are serial so you can't batch compute common weights. Diffusion can compute tokens in parallel which relieves the memory bandwidth bottle neck.

schmorptron•44m ago
What would a diffusing reasoning model look like? have a pre-defined length [thinking] block that gets diffused over a long time, and then the final output block uses what is in that thinking block as part of its input? And how do diffusion models decide the output length in the first place, is it a pre-set parameter? or does it diffuse an [end] token into the middle somewhere?
schmorptron•41m ago
got one answer by reading the rest of the comments, makes sense that the diffusion process is inherently reasoning-like: https://www.inceptionlabs.ai/blog/introducing-mercury-2
incognito124•38m ago
I just *love* the commit message on Github: "Make TPUs go brr"
bachmeier•15m ago
> DiffusionGemma reverses this inefficiency. Instead of predicting words sequentially, it drafts an entire 256-token paragraph simultaneously. By giving the computer's processor a larger chunk of work at once, DiffusionGemma utilizes your hardware to its full potential. It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously.

> Operating as a 26B total Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, DiffusionGemma fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized.

Okay, so Gemma 4 26B is a MoE model that's really fast on my 24 GB GPU using ollama. This sounds like speculative decoding but I don't think that works with MoE models? It's hard to keep up with all this when it's not your job to keep up with it.

regularfry•6m ago
This is a different model with, confusingly, approximately the same number of params as the existing gemma4 MoE. Unclear from a quick scan whether one was trained somehow from the other.

The mechanism isn't the same as speculative decoding. Speculative decoding happens sequentially and (usually) a couple of tokens at a time; diffusion doesn't, and does blocks of text at once. I haven't read the collateral yet but my assumption would be that it's trained to keep the specific experts stable across a diffusion block.

simonw•7m ago
NVIDIA are hosting a free endpoint for this one, details at https://build.nvidia.com/google/diffusiongemma-26b-a4b-it - you have to create an account and (I think) verify a phone number too.

(I got it to draw a pelican: https://tools.simonwillison.net/markdown-svg-renderer#url=ht... )

alfirous•1m ago
I register few weeks ago, the account still not verified, despite following the procedure. Can't use API if the account not verified.
famouswaffles•1h ago
Almost certainly not if things remain as they are. The reason there's been little traction is the quality gap between diffusion and autoregressive models is pretty stark. I mean just look at the benchmarks here. Large dropoffs, with the hardest benchmarks seeing the largest drops. On top of that, almost all the speed benefits of diffusion models become negated at scale. So this is only attractive for local model development and almost everyone training local models still care about pound for pound quality and inference efficiency at scale.
regularfry•53s ago
It's fast enough that "ask it twice and pick the best" should still come out ahead performance-wise. I don't know how much that would close the quality gap by, but it's worth a play.