Google releases Gemma 4 open models

https://deepmind.google/models/gemma/gemma-4/

384•jeffmcjunkin•1h ago

Comments

danielhanchen•1h ago

Thinking / reasoning + multimodal + tool calling.

We made some quants at https://huggingface.co/collections/unsloth/gemma-4 for folks to run them - they work really well!

Guide for those interested: https://unsloth.ai/docs/models/gemma-4

Also note to use temperature = 1.0, top_p = 0.95, top_k = 64 and the EOS is "<turn|>". "<|channel>thought\n" is also used for the thinking trace!

l2dy•1h ago

FYI, screenshot for the "Search and download Gemma 4" step on your guide is for qwen3.5, and when I searched for gemma-4 in Unsloth Studio it only shows Gemma 3 models.

danielhanchen•1h ago

We're still updating it haha! Sorry! It's been quite complex to support new models without breaking old ones

Imustaskforhelp•1h ago

Daniel, I know you might hear this a lot but I really appreciate a lot of what you have been doing at Unsloth and the way you handle your communication, whether within hackernews/reddit.

I am not sure if someone might have asked this already to you, but I have a question (out of curiosity) as to which open source model you find best and also, which AI training team (Qwen/Gemini/Kimi/GLM) has cooperated the most with the Unsloth team and is friendly to work with from such perspective?

danielhanchen•54m ago

Thanks a lot for the support :)

Tbh Gemma-4 haha - it's sooooo good!!!

For teams - Google haha definitely hands down then Qwen, Meta haha through PyTorch and Llama and Mistral - tbh all labs are great!

Imustaskforhelp•50m ago

Now you have gotten me a bit excited for Gemma-4, Definitely gonna see if I can run the unsloth quants of this on my mac air & thanks for responding to my comment :-)

danielhanchen•47m ago

Thanks! Have a super good day!!

evilelectron•44m ago

Daniel, your work is changing the world. More power to you.

I setup a pipeline for inference with OCR, full text search, embedding and summarization of land records dating back 1800s. All powered by the GGUF's you generate and llama.cpp. People are so excited that they can now search the records in multiple languages that a 1 minute wait to process the document seems nothing. Thank you!

danielhanchen•43m ago

Oh appreciate it!

Oh nice! That sounds fantastic! I hope Gemma-4 will make it even better! The small ones 2B and 4B are shockingly good haha!

zaat•13m ago

Thank you for your work.

You have an answer on your page regarding "Should I pick 26B-A4B or 31B?", but can you please clarify if, assuming 24GB vRAM, I should pick a full precision smaller model or 4 bit larger model?

jwr•1h ago

Really looking forward to testing and benchmarking this on my spam filtering benchmark. gemma-3-27b was a really strong model, surpassed later by gpt-oss:20b (which was also much faster). qwen models always had more variance.

jeffbee•1h ago

Does spam filtering really need a better model? My impression is that the whole game is based on having the best and freshest user-contributed labels.

a7om_com•1h ago

Gemma models are already in our AIPI inference pricing index. Open source models like Gemma run 70.7% cheaper than proprietary equivalents at the median across the 2,614 SKUs we track. With Gemma 4 hitting third-party platforms the pricing will be worth watching closely. Full data at a7om.com.

flakiness•1h ago

It's good they still have non-instruction-tuned models.

minimaxir•1h ago

The benchmark comparisons to Gemma 3 27B on Hugging Face are interesting: The Gemma 4 E4B variant (https://huggingface.co/google/gemma-4-E4B-it) beats the old 27B in every benchmark at a fraction of parameters.

The E2B/E4B models also support voice input, which is rare.

regularfry•40m ago

Thinking vs non-thinking. There'll be a token cost there. But still fairly remarkable!

DoctorOetker•11m ago

Is there a reason we can't use thinking completions to train non-thinking? i.e. gradient descent towards what thinking would have answered?

NitpickLawyer•1h ago

Best thing is that this is Apache 2.0 (edit: and they have base models available. Gemma3 was good for finetuning)

The sizes are E2B and E4B (following gemma3n arch, with focus on mobile) and 26BA4 MoE and 31B dense. The mobile ones have audio in (so I can see some local privacy focused translation apps) and the 31B seems to be strong in agentic stuff. 26BA4 stands somewhere in between, similar VRAM footprint, but much faster inference.

babelfish•1h ago

Wow, 30B parameters as capable as a 1T parameter model?

darshanmakwana•1h ago

This is awesome! I will try to use them locally with opencode and see if they are usable inreplacement of claude code for basic tasks

antirez•1h ago

Featuring the ELO score as the main benchmark in chart is very misleading. The big dense Gemma 4 model does not seem to reach Qwen 3.5 27B dense model in most benchmarks. This is obviously what matters. The small 2B / 4B models are interesting and may potentially be better ASR models than specialized ones (not just for performances but since they are going to be easily served via llama.cpp / MLX and front-ends). Also interesting for "fast" OCR, given they are vision models as well. But other than that, the release is a bit disappointing.

nabakin•53m ago

Public benchmarks can be trivially faked. Lmarena is a bit harder to fake and is human-evaluated.

I agree it's misleading for them to hyper-focus on one metric, but public benchmarks are far from the only thing that matters. I place more weight on Lmarena scores and private benchmarks.

moffkalast•13m ago

Lm arena is so easy to game that it's ceased to be a relevant metric over a year ago. People are not usable validators beyond "yeah that looks good to me", nobody checks if the facts are correct or not.

WarmWash•46m ago

I am unable to shake that the Chinese models all perform awfully on the private arc-agi 2 tests.

minimaxir•42m ago

I can't find what ELO score specifically the benchmark chart is referring to, it's just labeled "Elo Score". It's not Codeforces ELO as that Gemma 4 31B has 2150 for that which would be off the given chart.

nabakin•36m ago

It's referring to the Lmsys Leaderboard/Lmarena/Arena.ai[0]. It's very well-known in the LLM community for being one of the few sources of human evaluation data.

[0] https://arena.ai/leaderboard/chat

azinman2•41m ago

I find the benchmarks to be suggestive but not necessarily representative of reality. It's really best if you have your own use case and can benchmark the models yourself. I've found the results to be surprising and not what these public benchmarks would have you believe.

rvz•1h ago

Open weight models once again marching on and slowly being a viable alternative to the larger ones.

We are at least 1 year and at most 2 years until they surpass closed models for everyday tasks that can be done locally to save spending on tokens.

echelon•1h ago

> We are at least 1 year and at most 2 years until they surpass closed models for everyday tasks that can be done locally to save spending on tokens.

Until they pass what closed models today can do.

By that time, closed models will be 4 years ahead.

Google would not be giving this away if they believed local open models could win.

Google is doing this to slow down Anthropic, OpenAI, and the Chinese, knowing that in the fullness of time they can be the leader. They'll stop being so generous once the dust settles.

pixl97•1h ago

I mean, correct, but running open models locally will still massively drop your costs even if you still need to interface with large paid for models. Google will still make less money than if they were the only model that existed at the end of the day.

ma2kx•27m ago

I think it will be less of a local versus cloud situation, but rather one where both complement each other. The next step will undoubtedly be for local LLMs to be fast and intelligent enough to allow for vocal conversation. A low-latency model will then run locally, enabling smoother conversations, while batch jobs in the cloud handle the more complex tasks.

Google, at least, is likely interested in such a scenario, given their broad smartphone market. And if their local Gemma/Gemini-nano LLMs perform better with Gemini in the cloud, that would naturally be a significant advantage.

james2doyle•1h ago

Hmm just tried the google/gemma-4-31B-it through HuggingFace (inference provider seems to be Novita) and function/tool calling was not enabled...

james2doyle•1h ago

Yeah you can see here that tool calling is disabled: https://huggingface.co/inference/models?model=google%2Fgemma...

At least, as of this post

linolevan•1h ago

Hosted on Parasail + Google (both for free, as of now) themselves, probably would give those a shot

originalvichy•1h ago

The wait is finally over. One or two iterations, and I’ll be happy to say that language models are more than fulfilling my most common needs when self-hosting. Thanks to the Gemma team!

adamtaylor_13•1h ago

What sort of tasks are you using self-hosting for? Just curious as I've been watching the scene but not experimenting with self-hosting.

irishcoffee•1h ago

I would personally be much more interested in using LLMs if I didn’t need to depend on an internet connection and spending money on tokens.

vunderba•56m ago

Not OP but one example is that recent VL models are more than sufficient for analyzing your local photo albums/images for creating metadata / descriptions / captions to help better organize your library.

kejaed•49m ago

Any pointers on some local VLMs to start with?

canyon289•45m ago

You could try Gemma4 :D

vunderba•41m ago

The easiest way to get started is probably to use something like Ollama and use the `qwen3-vl:8b` 4‑bit quantized model [1].

It's a good balance between accuracy and memory, though in my experience, it's slower than older model architectures such as Llava. Just be aware Qwen-VL tends to be a bit verbose [2], and you can’t really control that reliably with token limits - it'll just cut off abruptly. You can ask it to be more concise but it can be hit or miss.

What I often end up doing and I admit it's a bit ridiculous is letting Qwen-VL generate its full detailed output, and then passing that to a different LLM to summarize.

- [1] https://ollama.com/library/qwen3-vl:8b

- [2] https://mordenstar.com/other/vlm-xkcd

BoredPositron•54m ago

I use local models for auto complete in simple coding tasks, cli auto complete, formatter, grammarly replacement, translation (it/de/fr -> en), ocr, simple web research, dataset tagging, file sorting, email sorting, validating configs or creating boilerplates of well known tools and much more basically anything that I would have used the old mini models of OpenAI for.

ktimespi•37m ago

For me, receipt scanning and tagging documents and parts of speech in my personal notes. It's a lot of manual labour and I'd like to automate it if possible.

mentalgear•22m ago

Adding to the Q: Any good small open-source model with a high correctness of reading/extracting Tables and/of PDFs with more uncommon layouts.

vunderba•59m ago

Strongly agree. Gemma3:27b and Qwen3-vl:30b-a3b are among my favorite local LLMs and handle the vast majority of translation, classification, and categorization work that I throw at them.

scrlk•1h ago

Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respective Hugging Face model cards:

    | Model          | MMLUP | GPQA  | LCB   | ELO  | TAU2  | MMMLU | HLE-n | HLE-t |
    |----------------|-------|-------|-------|------|-------|-------|-------|-------|
    | G4 31B         | 85.2% | 84.3% | 80.0% | 2150 | 76.9% | 88.4% | 19.5% | 26.5% |
    | G4 26B A4B     | 82.6% | 82.3% | 77.1% | 1718 | 68.2% | 86.3% |  8.7% | 17.2% |
    | G4 E4B         | 69.4% | 58.6% | 52.0% |  940 | 42.2% | 76.6% |   -   |   -   |
    | G4 E2B         | 60.0% | 43.4% | 44.0% |  633 | 24.5% | 67.4% |   -   |   -   |
    | G3 27B no-T    | 67.6% | 42.4% | 29.1% |  110 | 16.2% | 70.7% |   -   |   -   |
    | GPT-5-mini     | 83.7% | 82.8% | 80.5% | 2160 | 69.8% | 86.2% | 19.4% | 35.8% |
    | GPT-OSS-120B   | 80.8% | 80.1% | 82.7% | 2157 |  --   | 78.2% | 14.9% | 19.0% |
    | Q3-235B-A22B   | 84.4% | 81.1% | 75.1% | 2146 | 58.5% | 83.4% | 18.2% |  --   |
    | Q3.5-122B-A10B | 86.7% | 86.6% | 78.9% | 2100 | 79.5% | 86.7% | 25.3% | 47.5% |
    | Q3.5-27B       | 86.1% | 85.5% | 80.7% | 1899 | 79.0% | 85.9% | 24.3% | 48.5% |
    | Q3.5-35B-A3B   | 85.3% | 84.2% | 74.6% | 2028 | 81.2% | 85.2% | 22.4% | 47.4% |

    MMLUP: MMLU-Pro
    GPQA: GPQA Diamond
    LCB: LiveCodeBench v6
    ELO: Codeforces ELO
    TAU2: TAU2-Bench
    MMMLU: MMMLU
    HLE-n: Humanity's Last Exam (no tools / CoT)
    HLE-t: Humanity's Last Exam (with search / tool)
    no-T: no think

kpw94•49m ago

Wild differences in ELO compared to tfa's graph: https://storage.googleapis.com/gdm-deepmind-com-prod-public/...

(Comparing Q3.5-27B to G4 26B A4B and G4 31B specifically)

I'd assume Q3.5-35B-A3B would performe worse than the Q3.5 deep 27B model, but the cards you pasted above, somehow show that for ELO and TAU2 it's the other way around...

Very impressed by unsloth's team releasing the GGUF so quickly, if that's like the qwen 3.5, I'll wait a few more days in case they make a major update.

Overall great news if it's at parity or slightly better than Qwen 3.5 open weights, hope to see both of these evolve in the sub-32GB-RAM space. Disappointed in Mistral/Ministral being so far behind these US & Chinese models

coder543•34m ago

> Wild differences in ELO compared to tfa's graph

Because those are two different, completely independent Elos... the one you linked is for LMArena, not Codeforces.

nateb2022•23m ago

> Very impressed by unsloth's team releasing the GGUF so quickly, if that's like the qwen 3.5, I'll wait a few more days in case they make a major update.

Same here. I can't wait until mlx-community releases MLX optimized versions of these models as well, but happily running the GGUFs in the meantime!

ceroxylon•1h ago

Even with search grounding, it scored a 2.5/5 on a basic botanical benchmark. It would take much longer for the average human to do a similar write-up, but they would likely do better than 50% hallucination if they had access to a search engine.

WarmWash•43m ago

Even multimodal models are still really bad when it comes to vision. The strength is still definitely language.

wg0•1h ago

Google might not have the best coding models (yet) but they seem to have the most intelligent and knowledgeable models of all especially Gemini 3.1 Pro is something.

One more thing about Google is that they have everything that others do not:

1. Huge data, audio, video, geospatial 2. Tons of expertise. Attention all you need was born there. 3. Libraries that they wrote. 4. Their own data centers and cloud. 4. Most of all, their own hardware TPUs that no one has.

Therefore once the bubble bursts, the only player standing tall and above all would be Google.

chasd00•57m ago

Not sure why you're being downvoted, the other thing Google has is Google. They just have to spend the effort/resources to keep up and wait for everyone else to go bankrupt. At the end of the day I think Google will be the eventual LLM winner. I think this is why Meta isn't really in the race and just releases open weight models, the writing is on the wall. Also, probably why Apple went ahead and signed a deal with Google and not OpenAI or Anthropic.

wg0•52m ago

I don't know why I am downvoted but Google has data, expertise, hardware and deep pockets. This whole LLM thing is invented at Google and machine learning ecosystem libraries come from Google. I don't know how people can be so irrational discounting Google's muscle.

Others have just borrowed data, money, hardware and they would run out of resources for sure.

greenavocado•49m ago

This remains true so long as advertisers give Google money.

bitpush•14m ago

Why wouldnt advertisers give Google money? Are you noticing any shift in trend?

faangguyindia•9m ago

Same can be said for java, yet google own android.

WarmWash•41m ago

The rumor is also that Meta is looking to lease Gemini similar to Apple, as their recent efforts reportedly came up short of expectations.

whimblepop•16m ago

I recently canceled my Google One subscription because getting accurate answers out of Gemini for chat is basically impossible afaict. Whether I enable thinking makes no difference: Gemini always answers me super quickly, rarely actually looks something up, and lies to me. It has a really bad unchecked hallucination problem because it prioritizes speed over accuracy and (astonishingly, to me) is way more hesitant to run web searches than ChatGPT or Claude.

Maybe the model is good but the product is so shitty that I can't perceive its virtues while using it. I would characterize it as pretty much unusable (including as the "Google Assistant" on my phone).

It's extremely frustrating every way that I've used it but it seems like Gemini and Gemma get nothing but praise here.

mudkipdev•1h ago

Can't wait for gemma4-31b-it-claude-opus-4-6-distilled-q4-k-m on huggingface tomorrow

entropicdrifter•17m ago

I'd rather see a distill on the 26B model that uses only 3.8B parameters at inference time. Seems like it will be wildly productive to use for locally-hosted stuff

bertili•1h ago

Qwen: Hold my beer

https://news.ycombinator.com/item?id=47615002

xfalcox•1h ago

Comparing a model you can downloads weights for with an API-only model doesn't make much sense.

regularfry•37m ago

My money's on whatever models qwen does release edging ahead. Probably not by much, but I reckon they'll be better coders just because that's where qwen's edge over gemma has always been. Plus after having seen this land they'll probably tack on a couple of epochs just to be sure.

svachalek•45m ago

The Qwen Plus models should be compared to Gemini, not Gemma.

fooker•1h ago

What's a realistic way to run this locally or a single expensive remote dev machine (in a vm, not through API calls)?

matja•46m ago

I'm running Gemma 4 with the llama.cpp web UI.

https://unsloth.ai/docs/models/gemma-4 > Gemma 4 GGUFs > "Use this model" > llama.cpp > llama-server -hf unsloth/gemma-4-31B-it-GGUF:Q8_0

If you already have llama.cpp you might need to update it to support Gemma 4.

evanbabaallos•1h ago

Impressive

heraldgeezer•53m ago

Gemma vs Gemini?

I am only a casual AI chatbot user, I use what gives me the most and best free limits and versions.

daemonologist•47m ago

Gemma will give you the most, Gemini will give you the best. The former is much smaller and therefore cheaper to run, but less capable.

Although I'm not sure whether Gemma will be available even in aistudio - they took the last one down after people got it to say/do questionable stuff. It's very much intended for self-hosting.

worldsavior•44m ago

Gemma is only 10s of billion parameters, Gemini is 100s.

VadimPR•49m ago

Gemma 3 E4E runs very quick on my Samsung S26, so I am looking forward to trying Gemma 4! It is fantastic to have local alternatives to frontier models in an offline manner.

canyon289•46m ago

Hi all! I work on the Gemma team, one of many as this one was a bigger effort given it was a mainline release. Happy to answer whatever questions I can

wahnfrieden•44m ago

How is the performance for Japanese, voice in particular?

canyon289•13m ago

I dont have the metrics off hand, but I'd say try it and see if you're impressed! What matters at the end of the day is if its useful for your use cases and only you'll be able to assess that!

k3nz0•40m ago

How do you test codeforces ELO?

canyon289•14m ago

On this one I dont know :) I'll ask my friends on the evaluation side of things how they do this

azinman2•40m ago

How do the smaller models differ from what you guys will ultimately ship on Pixel phones?

What's the business case for releasing Gemma and not just focusing on Gemini + cloud only?

canyon289•14m ago

Its hard to say because Pixel comes prepacked with a lot of models, not just ones that that are text output models.

With the caveat that I'm not on the pixel team and I'm not building _all_ the models that are used on, its evident there are many models that support the Android experience, from autocomplete on keyboard to image editing.

https://store.google.com/us/magazine/magic-editor?hl=en-US&p...

abhikul0•37m ago

Thanks for this release! Any reason why 12B variant was skipped this time? Was looking forward for a competitor to Qwen3.5 9B as it allows for a good agentic flow without taking up a whole lotta vram. I guess E4B is taking its place.

mohsen1•37m ago

On LM Studio I'm only seeing models/google/gemma-4-26b-a4b

Where can I download the full model? I have 128GB Mac Studio

tjwebbnorfolk•30m ago

Will larger-parameter versions be released?

canyon289•25m ago

We are always figuring out what parameter size makes sense.

The decision is always a mix between how good we can make the models from a technical aspect, with how good they need to be to make all of you super excited to use them. And its a bit of a challenge what is an ever changing ecosystem.

I'm personally curious is there a certain parameter size you're looking for?

WarmWash•18m ago

Mainline consumer cards are 16GB, so everyone wants models they can run on their $400 GPU.

NitpickLawyer•14m ago

Jeff Dean apparently didn't get the message that you weren't releasing the 124B Moe :D

Was it too good or not good enough? (blink twice if you can't answer lol)

philipkglass•28m ago

Do you have plans to do a follow-up model release with quantization aware training as was done for Gemma 3?

https://developers.googleblog.com/en/gemma-3-quantized-aware...

Having 4 bit QAT versions of the larger models would be great for people who only have 16 or 24 GB of VRAM.

_boffin_•20m ago

What was the main focus when training this model? Besides the ELO score, it's looking like the models (31B / 26B-A4) are underperforming on some of the typical benchmarks by a wide margin. Do you believe there's an issue with the tests or the results are misleading (such as comparative models benchmaxxing)?

Thank you for the release.

mwizamwiinga•45m ago

curious how this scales with larger datasets. anyone tried it in production?

chrislattner•37m ago

If you want the fastest open source implementation on Blackwell and AMD MI355, check out Modular's MAX nightly. You can pip install it super fast, check it out here: https://www.modular.com/blog/day-zero-launch-fastest-perform...

-Chris Lattner (yes, affiliated with Modular :-)

nabakin•10m ago

Faster than TensorRT-LLM on Blackwell? Or do you not consider TensorRT-LLM open source because some dependencies are closed source?

simonw•28m ago

I ran these in LM Studio and got unrecognizable pelicans out of the 2B and 4B models and an outstanding pelican out of the 26b-a4b model - I think the best I've seen from a model that runs on my laptop.

https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56...

The gemma-4-31b model is completely broken for me - it just spits out "---\n" no matter what prompt I feed it.

wordpad•22m ago

Do you think it's just part of their training set now?

entropicdrifter•19m ago

Your posting of the pelican benchmark is honestly the biggest reason I check the HackerNews comments on big new model announcements

bertili•14m ago

The timing is interesting as Apple supposedly will distill google models in the upcoming Siri update [1]. So maybe Gemma is a lower bound on what we can expect baked into iPhones.

[1] https://news.ycombinator.com/item?id=47520438

Attorney General Pam Bondi Out at DOJ

Show HN: Artemis II interactive 3D animation

Composer 2 Technical Report

Agent Side Hustle School

Nvidia market share in China falls to less than 60%

Prosaic: A text editor where AI scribbles in the margins

The Epic 50-Year Story of Apple, Told Through the WSJ Archive

Storm Dave Cometh. But Why Is It Called That?

Utah Bans Polygraph Tests for Those Reporting Sexual Assault

How High Can Memory Prices Go?

JustEat and Autotrader among firms investigated in fake reviews probe

Show HN: Agentic Commerce Marketplace Chat

A Mirror Test for LLMs

I vibecoded the full event management system with custom floor planner

Show HN: Mac-hardware toys, control your Mac's hardware like a modular synth

Rust's dynamically-sized types are just polymorphically compiled generics

Codex now offers more flexible pricing for teams

Anthropic: We scanned Claude to look for emotions [video]

Semantic retrieval vs. grep for LLM code queries: 2.4x faster, 5.6x fewer tokens

Microsoft.ai

Artemis II astronaut finds two Outlook instances running on computers

Police find drug-smuggling tunnel in Spain, underground rail system and cranes

Users say Adobe Creative Cloud rewrote hosts file to detect installed app

Even Artemis II Astronauts Have Microsoft Outlook Problems

Meshllm – Pool compute to run powerful open models

A technical report on Composer 2

Good ideas do not need lots of lies in order to gain public acceptance

MAI Announces 3 New Foundational Models

Show HN: Wick – censorship circumvention tech repurposed for AI web access

Is Taste the One Thing A.I. Can't Replace?