Gemma 4 12B: A unified, encoder-free multimodal model

https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/

177•rvz•1h ago

Comments

minimaxir•58m ago

The big story here is the encoder-free part, which I still don't fully understand.

> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.

That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...

> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.

GaggiX•54m ago

> That's technically encoding

Isn't that just projecting the patches into the d_model size vectors that the models takes?

>I am assuming that involves of quantization

12B model in 16GB seems very reasonable to me, int8 is top quality for running models.

minimaxir•48m ago

The guide describes it as projection although there is apparently an extra step: "A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input."

12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but the OS will not like that.

kristjansson•54m ago

> quantization

12b means 12G @ 8 bits/param (basically lossless) and 6G at 4 b/p (generally accepted 'pretty close' level). Not too bad?

But TBD how well the base model performs before thinking too much about quantization

jszymborski•52m ago

Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.

minimaxir•50m ago

In hindsight I may have been pedantic.

wilkystyle•25m ago

I had a similar thought to you, and found your question and the resulting discussion helpful!

alberto467•21m ago

Not at all. Getting really pedantic, tokenization is also a form of encoding, so it doesn't matter the modality you're using, you'll end up doing some type of encoding in some way.

reactordev•52m ago

It actually works well because unlike encoders, the latent space is trained on that initial layer so it “knows” what to do with that sparse density. I’ve been using gemma4-12b with Flux2 and its ability to reason on visual input is pretty good. That said, each model is good in their own ways so YMMV but overall, it’s about as solid as Qwen just with a more advanced architecture.

LarsDu88•51m ago

Well its a real simple encoder I guess

wolttam•49m ago

I think the idea is that the model is seeing embeddings that map directly to underlying pixel data, rather than being fed semantically rich embeddings from an encoder model which itself had seen the raw pixel data.

matja•26m ago

One side-effect, is that the separate .mmproj file (Multi-Modal Projection encoder) is no longer needed, when using the model with llama.cpp etc.

georgehm•18m ago

Embedded within that developer page is a good explainer of the encoder free architecture . https://newsletter.maartengrootendorst.com/p/a-visual-guide-...

nickandbro•57m ago

Wow Google is becoming the new pre Llama 4 Meta when it comes to releasing open weights models.

embedding-shape•50m ago

I dunno, feels a bit unfair to companies that actually do FOSS releases (Gemma 4 being released under Apache 2.0 license) to compare them to a company that never done any FOSS releases, and mostly done proprietary "available to download" releases.

seba_dos1•37m ago

Note that a binary released under Apache 2.0 license does not yet make it FOSS.

embedding-shape•31m ago

Agreed, miles ahead though from "proprietary" which is what Meta been using for most model releases.

Ideally companies would share the fucking datasets and training code already, but no, no one wants to talk about the source of those or even share the ones they have as then who knows what comes out of Pandora's box...

redman25•46m ago

IDK this model release is a bit disappointing considering the community has been chomping at the bit for the 124ba4b model. There was some leaked info about it but people suspect it was not released because it was too close to gemini flash in performance.

brianwawok•35m ago

ethanpil•48m ago

What's Google's business case for releasing open models? Don't get me wrong, I am grateful and appreciative of these releases. I'm trying to understand how it fits into their bigger picture as a for profit company? Are they not helping competitors build on the novel technology they have developed?

Is it simply goodwill and/or marketing? Or am I missing something strategic?

mmarian•46m ago

Marketing + Pro Serv if I had to take a guess.

XzAeRosho•44m ago

Google's MO since always has been to release great products or services for free, position themselves high and then abandon them or just find uses for Enterprise sales.

I'm pretty sure they are doing it because they get some research experience by shrinking and improving these models, and because they know that by doing this they get some good PR among the dev community.

Aachen•21m ago

Google's "free" is and was ad-supported, even if some products now have a paid tier. These models don't include ads. Doesn't seem like the same underlying reason

theturtletalks•43m ago

Maybe they are hedging against a future where local models are just as good as cloud models? Or maybe they can go the Taalas route and start hardcoding Gemma on a chip and hardware manufacturers can use it for local private AI.

zuminator•41m ago

How does it compare with e4b, aside from being larger?

thomasjb•34m ago

That's what I want to know too. A smarter E4B that's happy in opencode would be a good selfhosted model for me

anonova•27m ago

There's a comparison of all the Gemma 4 models (+ Gemma 3 27B) on the Huggingface model card: https://huggingface.co/google/gemma-4-12B-it#benchmark-resul...

dwa3592•39m ago

This is a pretty good update. The demo video is a bit funny though - the tester asks to turn the release into bullet points. okay, the model obliges. then the tester says draft an email with this content. BAM! the LLM turns the content from bullets to passages even though it was not asked and it undid the last good thing that it did. i am not sure if it's an email etiquette to not put bullets in the email.

Zambyte•35m ago

Is this Mac only? Or is that an Ollama issue that it only supports this release of models on Mac? It seems like every tag with the MLX badge is only supported on Mac[0], and that includes all of the tags in this release.

[0] https://ollama.com/library/gemma4/tags

Edit: MLX being Mac-only is independent of the model being MLX (and therefore Mac) only. The latter is what I am asking about.

embedding-shape•29m ago

MLX is quite literally macOS-specific technology, for other platforms you want non-MLX.

I was sure "MLX" stood for "Metal-something-something" but can't find any reference to that somehow, anywho, "Metal" is hardware-accelerated graphics on Apple platforms FWIW.

Edit: about the actual release on Ollama, if you're on non-Apple hardware you probably want the NVFP4 variant ("gemma4:12b-nvfp4") which was uploaded 45 minutes ago, especially if you're with a recent nvidia GPU.

jw1224•26m ago

MLX is Apple’s own machine learning framework, designed for Apple Silicon: https://opensource.apple.com/projects/mlx/

jasonjmcghee•6m ago

There's a CUDA backend for MLX now. Not sure about the maturity.

randomNumber7•35m ago

> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.

I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)

djyde•33m ago

What are the use cases for these small models? Is there anyone using models of this scale in their daily life who could share their experience?

Xiol•16m ago

I've yet to see someone answer a question like this with a decent, useful answer.

Aachen•15m ago

"Small" models are the ones I can run myself on my own terms. LLMs aren't useful enough for me to justify spending hundreds of euros on a GPU with 16GB VRAM or something, and that's assuming I have the rest of the desktop just laying around. Back when I checked (before the RAM price hike), these models weren't meaningfully better than 4-8GB ones anyway, you'd have to go for the top tier cards at 24 or 32 GB iirc to get something vaguely in the direction of the SaaS versions, and that was absolutely out of my budget. Even if that changed, so have hardware prices so it'd probably still work out the same

jdelman•30m ago

I can’t help but wonder if this is the basis of the model they’ve helped tune for Apple.

digdugdirk•21m ago

I do enjoy the immediate out of touch signaling with the "runs on your 16gb vram laptop" line. Because everyone has a laptop with 16gb vram, or can just pop out and buy a new one, right?

Havoc•20m ago

Quite a niche release. The MoE outperforms it on score and will likely be faster thanks to lower active weights. So this really only makes sense for specific ram constrained applications that can’t fit a quantized MoE

dist-epoch•14m ago

The un-quantized MoE outperforms it.

But between same (V)RAM requirement 4 bit 26B-A3B and 8 bit 12B it's unclear which one will win, especially given one is MoE and the other dense.

All the launch benchmarks are at 16 bit.

ComputerGuru•16m ago

Quite aside from the architectural changes, I suppose this is the answer to why Google had such a glaring hole in the (pretrained) Gemma4 model lineup between the Gemma4 4b and Gemma4 26b models!

A model that comfortably fits in 16GB of VRAM (allowing room for context) is a welcome upgrade.

claysmithr•16m ago

I don’t see the download in lm studio

BiraIgnacio•15m ago

using an embedder instead of a decoder is quite clever. Not sure who came up with that first but it's a cool idea.

lxgr•14m ago

Am I missing something or are the Ollama versions of this (https://ollama.com/library/gemma4/tags) text-only for now?

philipkglass•12m ago

Since ollama has diverged from llama.cpp, it will take a bit of time for ollama to support multi-modality. If you're using plain llama.cpp it looks like a PR has already merged for this model with vision and audio support:

https://github.com/ggml-org/llama.cpp/pull/24077/changes

The Bloat

Facing life-threatening miscarriage in Arkansas, calls to governor didn't help

The Relaunch of the Old West and Why I Chose Vanilla PHP

Batching API Calls

Show HN: Mashines.dev – Live-migrate microVMs between hosts without restarting

Impermeabiliza uses AI to modernize waterproofing in Valencia

Microsoft unveils new AI models

Tesla Cybertruck resale value plunges amid sales slump

AI enthusiasts racing against time; AI skeptics are racing against entropy

AgentSight: System-wide AI agent tracing and monitoring with eBPF

I believe a whole generation of developers miss how open source used to work

Gooey: A GPU-accelerated UI framework for Zig

Network State Propaganda

The 15-minute city is a dead end

Book of Cron Job [fiction]

Show HN: Hive Trust – Ed25519-signed benchmarks for every AI inference primitive

Knowable – Open-Source Personal AI Tutor on macOS

See SBA Loans Around You

Safe Made Easy Pt.2: Don't Fear the Ref

I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos

The Download: Trump's new AI order, and smart glasses for warfare

10M requests in my bot black hole

Stats from 30K AI debates: Opus 4.7 is the most influential model

How to Build an ML Framework in Rust, from Scratch, in a Weekend

NASA Says Farewell to Maven Mars Mission

Why open standards matter for AI infrastructure

Compiling Zig to RISC-V

Counterfeit G.Skill and V-Color DDR5 modules hit Chinese marketplaces

The Public Should Own Half of the Big A.I. Companies

If AI Data Centers Are So Great, Why Are They Being Built in Secret?