frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/
107•theanonymousone•2h ago

Comments

minimaxir•1h ago
It's a bit awkward to release Gemma 4 12B (https://news.ycombinator.com/item?id=48385906), and then a canonical Q4_0 Gemma 4 12B a couple days later.

It's good that this post lists the expected VRAM usage for the models with Q4_0 Gemma 4 12B being 6.7GB, which will indeed fit Google's claims of fitting within 16GB comfortably, altough it confirms that only the quantized version will do so.

Relatedly, in Google's newly released Edge Gallery for macOS, Gemma 4 12B is explicitly listed as unsupported due to not enough RAM even on a 16GB machine, but given the expected VRAM usage here the Q4_0 variant definitely should fit and Google should fix that.

netdur•1h ago
not sure if I understand you, but 4Q and QAT 4Q are different
refulgentis•1h ago
It's super annoying when you have products that utilize these because there's...4? releases in 3 weeks?

- Gemma 4 2B/4B/27BE3B/31B

- Gemma 4 2B/4B/27BE3B/31B x "assistant" / MTP drafter models (i.e. multitoken prediction)

- Gemma 4 12B (2 days ago? 1?)

- Gemma 4 QAT 2B/4B/12B/27BE3B/31B x "assistant" models (i.e. multitoken prediction)

It probably sounds silly and really whiny in the abstract. It just causes a ton of work / confusion downstream that feels unnecessary.

Extremely glad for the output, not glad to have to chase it.

ex. llama.cpp currently supports the originals but not the MTP predictors but there is a patch for the MTP predictors but not for the small MoE models and I think it supports the 12B but maybe not media for it yet and now we have these too and the blog says there's GGUFs (llama.cpp models) but there isn't in any of the 12? repos I clicked through. and ~every consumer-facing local LLM app is built on llama.cpp or a fork of it.

Also if anyone at Google is taking feedback over to b/ or product, pleaseeee stop the "E"2B "E"4B thing, unless it's actually taking up less RAM on Android during CPU inference. I can't tell if I need to treat the 4B like an 8B (i.e. beyond most consumer hardware without a GPU) or a 4B (i.e. will run on most consumer hardware since 2021)

ddarolfi•1h ago
These models aren't products? They are open source ish (open weight I guess), research outputs. While the naming scheme may be confusing, it is relevant and important. I believe it's on you to understand it.
refulgentis•37m ago
I understand it. :)
satvikpendem•1h ago
Just use Unsloth Studio it supports them all.
Aurornis•1h ago
I'm not sure why you think it's awkward to have multiple releases. It's better to release models and variations as they're ready, not withhold them all until everything is ready to release all at once.

The Q4_0 is a quantization aware training checkpoint. It's not a simple quantization of the original Gemma 4 12B.

netdur•1h ago
had a good run with Gemma 4 E2B Unsloth 4Q: https://youtube.com/shorts/XLsAnz5aAAI

The E4B model doesn’t fit on my phone TPU, so it swaps to RAM, the QAT version means more accuracy, good!

refulgentis•1h ago
@google.com'ers, there are no GGUFs (blog says there is)
minimaxir•1h ago
Isn’t this it? https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf
refulgentis•1h ago
Ah, nice, ty! My excuse is those repos were added to the collection after my comment, but perhaps not :3
satvikpendem•1h ago
Unsloth's collection as well [0], with their results [1]. Looks like they can get very close to 100% accuracy compared to the BF16 model that is unquantized, and Unsloth's quants are better than the original Google's QAT as posted in the article.

Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones.

[0] https://huggingface.co/collections/unsloth/gemma-4-qat

[1] https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis

llmoorator•58m ago
you misunderstand what that chart shows - it shows BF16 QAT Q4_0, not BF16 regular.

meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.

Like storing small 8 bit numbers in full 32 bit integers.

So it's not close to 100% of unquantized BF16.

I'm curious if anybody can explain why Google released 4 bit QAT Q4_0 is not exactly 100% of BF16 QAT Q4_0? seems like it should be just bit twiddling, no further quantization to convert between these two packings. Unsloth talks about "lattice alignment" being an issue.

That being said I hate it that smol model makers, like Google, Qwen, ... only show the BF16 benchmarks when they release a new models, knowing that what people really run are 4-8 bit quantizations, so it's really hard to understand how much you lose when you run 4 bit vs 6 bit...

slopinthebag•51m ago
I'm confused, the unsloth model is ~600mb and the one from google is 7gb?
cr3cr3•42m ago
For a moment I got excited thinking QAT is Intel Quick Assist Technology...
somewhatrandom9•39m ago
Could these quantized models make MTP (Multi-Token Prediction) significantly faster when used as drafters for larger regular Gemma 4 models?
MillionOClock•16m ago
Where can the mobile text-only GGUF models be found? I see the mobile ones but not the text-only variant.
simonw•9m ago
I just ran one of these locally on a Mac like this:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu \
    --prompt="Generate an SVG of a pelican riding a bicycle"
The first time you run that it downloads 3.2GB to ~/.cache/huggingface/hub/models--litert-community--gemma-4-E2B-it-litert-lm

It can handle audio and image input too, which is pretty cool for a 3.2GB model. For images:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu --vision-backend gpu \
    --attachment image.jpg --prompt describe
And for audio:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu --audio-backend cpu \
    --attachment audio.wav --prompt transcribe
(The pelican is rubbish, but it's only a 3.2GB file so the fact it even outputs valid SVG is impressive to me: https://gist.github.com/simonw/94b318afde4b1ce5ff67d4b5d0362... )

Astronauts told to return to ISS after sheltering over air leak repairs

https://www.bbc.com/news/live/c4g44ew3g1kt
249•janpot•3h ago•163 comments

pg_durable: Microsoft open sources in-database durable execution

https://github.com/microsoft/pg_durable
161•coffeemug•2h ago•39 comments

Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gem...
108•theanonymousone•2h ago•18 comments

Conventional Commits encourages focus on the wrong things

https://sumnerevans.com/posts/software-engineering/stop-using-conventional-commits/
164•jsve•3h ago•117 comments

Adyen Selected as Payment Services Provider for GOV.UK Pay

https://www.adyen.com/press-and-media/adyen-payments-gov-uk
44•ChrisArchitect•1h ago•2 comments

Cloudflare CEO Is Lying to You About the Bot Traffic Jump

https://www.flyingpenguin.com/cloudflare-ceo-is-lying-to-you-about-the-bot-traffic-jump/
28•speckx•51m ago•5 comments

Mouseless – keyboard-driven control of macOS/Linux/Windows

https://mouseless.click
317•riddley•2d ago•156 comments

I tested every IP KVM in my Homelab

https://www.jeffgeerling.com/blog/2026/i-tested-every-ip-kvm/
144•vquemener•4h ago•42 comments

New method turns ocean water into drinking water, without waste

https://www.rochester.edu/newscenter/what-is-desalination-definition-ocean-water-704732/
57•speckx•3h ago•27 comments

Mantine-datatable (and others) compromised – owner account suspended

https://github.com/icflorescu/mantine-datatable/discussions/813
38•justsomehuman•2h ago•9 comments

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

22•guanming0717•2h ago•9 comments

Cooldown Support for Ruby Bundler

https://blog.rubygems.org/2026/06/03/cooldown-let-new-gems-be-vetted.html
112•calyhre•2d ago•27 comments

My Agent Skill for Test-Driven Development

https://www.saturnci.com/my-agent-skill-for-test-driven-development.html
11•laxmena•1d ago•0 comments

Tracing a powerful GNSS interference source over Europe

https://arxiv.org/abs/2606.03673
314•mimorigasaka•10h ago•158 comments

Did Claude increase bugs in rsync?

https://alexispurslane.github.io/rsync-analysis/
95•logicprog•6h ago•90 comments

Do the Hardest Thing

https://justinjackson.ca/hard-thing
13•levhawk•1d ago•4 comments

Inside FAISS: Billion-Scale Similarity Search

https://fremaconsulting.ch/blog/faiss
6•tohms•1d ago•0 comments

Gov.uk goes Dutch on payments as it dumps Stripe

https://www.theregister.com/public-sector/2026/06/04/govuk-goes-dutch-on-payments-as-it-dumps-str...
85•toomuchtodo•1h ago•24 comments

Sakana AI's Recursive Self-Improvement (RSI) Lab

https://sakana.ai/rsi-lab/
10•hardmaru•1h ago•7 comments

Nango (YC W23, dev infra) is hiring staff back end engineers

https://nango.dev/careers
1•bastienbeurier•6h ago

Redis 8.8: New array data structure, rate limiter, performance improvements

https://redis.io/blog/announcing-redis-8-8/
175•ksec•2d ago•80 comments

India's surprise baby bust

https://www.economist.com/leaders/2026/06/04/indias-surprise-baby-bust-is-a-warning-to-the-world
48•hakonbogen•4h ago•179 comments

Dutch gov't will only allow European company to operate DigiD platform

https://nltimes.nl/2026/06/05/dutch-govt-will-allow-european-company-operate-digid-platform
165•TechTechTech•3h ago•54 comments

C++: The Documentary

https://herbsutter.com/2026/06/04/c-the-documentary-released-today/
319•ingve•14h ago•235 comments

Show HN: Lowfat – pluggable CLI filter that saved 91.8% of my LLM tokens

https://github.com/zdk/lowfat
59•zdkaster•9h ago•45 comments

Entanglement Builds Space-Time. Now "Magic" Gives It Gravity

https://www.quantamagazine.org/entanglement-builds-space-time-now-magic-gives-it-gravity-20260603/
133•rbanffy•10h ago•140 comments

"Maybe later" was a feature

https://arnorhs.dev/posts/2026-06-04/maybe-later-was-a-feature/
5•arnorhs•1d ago•0 comments

Changing how we develop Ladybird

https://ladybird.org/posts/changing-how-we-develop-ladybird/
731•EdwinHoksberg•11h ago•482 comments

South Korean forums will need to scan every images with AI censorship tools

https://discuss.privacyguides.net/t/south-korean-online-communities-will-need-to-scan-every-image...
178•Cider9986•19h ago•123 comments

Ask HN: What is your (AI) dev tech stack / workflow?

86•dv35z•3h ago•68 comments