And for that matter, what is
>mix’n’match capability in Gemma 3n to dynamically create submodels
It seems like mixture-of-experts taken to the extreme, where you actually create an entire submodel instead of routing per token?
> Gemma 3n models are listed with parameter counts, such as E2B and E4B, that are lower than the total number of parameters contained in the models. The E prefix indicates these models can operate with a reduced set of Effective parameters. This reduced parameter operation can be achieved using the flexible parameter technology built into Gemma 3n models to help them run efficiently on lower resource devices.
> The parameters in Gemma 3n models are divided into 4 main groups: text, visual, audio, and per-layer embedding (PLE) parameters. With standard execution of the E2B model, over 5 billion parameters are loaded when executing the model. However, using parameter skipping and PLE caching techniques, this model can be operated with an effective memory load of just under 2 billion (1.91B) parameters, as illustrated in Figure 1.
Experimenters in the open source tinkering community have done the opposite (copy/pasting layers in existing models to make them deeper) and it seems to work... fine, with minimal post-training on the new, deeper model required to exceed the performance of the original model. So it's not a crazy idea.
> Gemma 3n leverages a Google DeepMind innovation called Per-Layer Embeddings (PLE) that delivers a significant reduction in RAM usage.
Like you I’m also interested in the architectural details. We can speculate but we’ll probably need to wait for some sort of paper to get the details.
PLE is much more likely to be a reference to the Per-Layer Embeddings paper that will be published in the future once it doesn't give away any secret sauce anymore.
At a very high level, instead of having embeddings at the input layers, this method keeps the embeddings at the layer level. That is every transformer layer would have its own set of learnable embedding vectors that are used to modify the processed hidden states flowing through the network. Mostly, the embeddings are precomputed and stored separately. They are queried at inference time and has very low latency, so you can get comparable performance with half the RAM. (i am not exactly sure how 3n is doing it, but talking it in a general sense).
On the other hand, I'm really not looking forward to app sizes ballooning even more – there's no reasonable way to share them across apps at least on iOS, and I can absolutely imagine random corporate apps to start including LLMs, just because it's possible.
Though, I won't be surprised if they try to force devs to use their models for "privacy" (and not monopolistic reasons, of course).
Why, on Earth, would Apple ever want to solve the problem of Apps taking up more space? That's just not good business. Way better business right now to put R&D into increased memory access speeds.
Apple would need to have a different business model for them to have a business case for fixing this. They may fix it because they just want to help out they AI guys? Maybe they're getting money from them or something?
But all other things being equal, the money for Apple is in this not being fixed.
However there are now alternatives like the official RPi AI Hat that has between about 3x to 6x the TOPs (4 for Coral Vs 13/26 for RPi depending on model) so there is that. 20 TOPs on a RPi 5 - complete with nice vertically integrated camera etc - is quite interesting.
I've also seemingly hit a rate limit on Gemini Pro 2.5 (on an account not subscribed to Gemini Advanced) yesterday, even though my last query is weeks past.
Possibly there's a capacity shortage (I'd presume it all runs on the same Google hardware in the end), and they are prioritizing paid inference?
E4B has a score of 44.4 in the Aider polyglot dashboard. Which means its on-par with gemini-2.5-flash (not the latest preview but the version used for the bench on aider's website), gpt4o and gpt4.5.
Thats sounds very good - image what a coding focused version of this could do if this is a "generic" embedded only model.
On the other hand - this does have a much lower score for livecodebench.
Also:
> These models were evaluated at full precision (float32)
For 4B effective parameters that's 16 GB ram.
Download the Edge Gallery apk from github: https://github.com/google-ai-edge/gallery/releases/tag/1.0.0
Download one of the .task files from huggingface: https://huggingface.co/collections/google/gemma-3n-preview-6...
Import the .task file in Edge Gallery with the + bottom right.
You can take pictures right from the app. The model is indeed pretty fast.
Hint: You have to set the Max tokens to 32000 for longer conversations. The slider makes it look like it's limited to 1024, just enter it manually.
https://huggingface.co/collections/google/gemma-3n-preview-6...
Gemma 3n Preview
google/gemma-3n-E4B-it-litert-preview
google/gemma-3n-E2B-it-litert-preview
Interesting, hope it comes on LMStudio as MLX or GGUF. Sparse and or MoE models make a difference when running on localhost. MoE Qwen3-30B-A3B most recent game changer for me. Activating only 3b weights on the gpu cores of sparse Qwen3-30B-A3B, rather than comparable ~30b of dense models (Qwen3-32B, Gemma3-27b, GLM-{4,Z1}-32B, older QwQ-32B), is a huge speedup for me: MoE A3B achieves 20-60 tps on my oldish M2 in LMStudio, versus only 4-5 tps for the dense models.
Looking forward to trying gemma-3n. Kudos to Google for open sourcing their Gemmas. Would not have predicted that the lab with "open" in the name has yet to release even v1 (atm at 0; disregarding gpt-2), while other labs, more commercial labs, are are at versions 3, 4 etc already.
Seems like we will not be able to run this with Llama and friends.
Also it's funny that they are saying that Llama 4 Maverick performs about the same as GPT-4.1 Nano.
onlyrealcuzzo•2h ago
Gemma 3n is a model utilizing Per-Layer Embeddings to achieve an on-device memory footprint of a 2-4B parameter model.
At the same time, it performs nearly as well as Claude 3.7 Sonnet in Chatbot Arena.
ai-christianson•2h ago
What's the catch?
Vuizur•1h ago
refulgentis•1h ago
I can give more details if you (or anyone else!) is interested.
TL;DR: it is scoring only for "How authoritative did the answer look? How much flattering & emojis?"
Jowsey•52m ago
refulgentis•38m ago
Not sure if they've shared more since.
IMVHO it won't help, at all, even if they trained a perfect model that could accurately penalize it*
The main problem is its one off responses, A/B tested. There's no way to connect it into all the stuff we're using to do work these days (i.e. tools / MCP servers), so at this point its sort of skipping the hard problems we'd want to see graded.
(this situation is a example: whats more likely, style control is a small idea for an intractable problem, or Google has now released multiple free models better than Sonnet, including the latest, only 4B params?
To my frustration, I have to go and bench these things myself because I have an AI-agnostic app I build, but I can confirm it is not the case that Gemma 3-not-n is better than Sonnet. 12B can half-consistently make file edits, which is a major step forward for local tbh)
* in this case I am also setting aside how that could be done, just saying it as an illustration of no matter what, it's the wrong platform for a "how intelligent is this model?" signal, at this point, post-Eliza post-Turing, couple years out from ChatGPT 1.0
Deathmax•2h ago
zamadatix•1h ago
esafak•2h ago
edit: I seem to be the only one excited by the possibilities of such small yet powerful models. This is an iPhone moment: a computer that fits in your pocket, except this time it's smart.
codr7•1h ago
esafak•1h ago
codr7•1h ago
TeMPOraL•1h ago
rhdjsjebshjffn•1h ago
TeMPOraL•1h ago
I'd go as far as saying LLMs are meaning made incarnate - that huge tensor of floats represents a stupidly high-dimensional latent space, which encodes semantic similarity of every token, and combinations of tokens (up to a limit). That's as close as reifying the meaning of "meaning" itself as we ever come.
(It's funny that we got there through brute force instead of developing philosophy, and it's also nice that we get a computational artifact out of it that we can poke and study, instead of incomprehensible and mostly bogus theories.)
rhdjsjebshjffn•42m ago
To anyone who questions why we might produce such a machine, I ask them to kill themselves out of pity for myself so that I am not obligated to perform such a task.
rhdjsjebshjffn•1h ago
onlyrealcuzzo•1h ago
goatlover•1h ago
rhdjsjebshjffn•1h ago
Now would I take AI as a trivia partner? Absolutely. But that's not really the same as what I look for in "smart" humans.