And for that matter, what is
>mix’n’match capability in Gemma 3n to dynamically create submodels
It seems like mixture-of-experts taken to the extreme, where you actually create an entire submodel instead of routing per token?
> Gemma 3n models are listed with parameter counts, such as E2B and E4B, that are lower than the total number of parameters contained in the models. The E prefix indicates these models can operate with a reduced set of Effective parameters. This reduced parameter operation can be achieved using the flexible parameter technology built into Gemma 3n models to help them run efficiently on lower resource devices.
> The parameters in Gemma 3n models are divided into 4 main groups: text, visual, audio, and per-layer embedding (PLE) parameters. With standard execution of the E2B model, over 5 billion parameters are loaded when executing the model. However, using parameter skipping and PLE caching techniques, this model can be operated with an effective memory load of just under 2 billion (1.91B) parameters, as illustrated in Figure 1.
Experimenters in the open source tinkering community have done the opposite (copy/pasting layers in existing models to make them deeper) and it seems to work... fine, with minimal post-training on the new, deeper model required to exceed the performance of the original model. So it's not a crazy idea.
It seems to be embedding from 262k possible vocab tokens down to 256 dims. 262144 matches the same vocab size used for the existing Gemma model, so it really does seem to be an embedding of the input token directly, fed into each layer.
I guess intuitively it might help the model somewhat for later layers to have direct access to the input query without needing to encode it in the residual stream, and it can use those parameters for something else. I'm kind of surprised no one tried this before, if the idea is that simple? Reminds me of resnet where you have the "skip" layers so future layers can access the input directly.
Edit: As for what exactly the embedding is used for, it could be that the embedding is still used for something more clever than induction head-type stuff. Responses in [1] suggest it might be some low-rank data/token dependent signal that can be "factored out"/precomputed. Another clever suggestion was that it's a per-layer input-token-derived control/steering vector.
> Gemma 3n leverages a Google DeepMind innovation called Per-Layer Embeddings (PLE) that delivers a significant reduction in RAM usage.
Like you I’m also interested in the architectural details. We can speculate but we’ll probably need to wait for some sort of paper to get the details.
PLE is much more likely to be a reference to the Per-Layer Embeddings paper that will be published in the future once it doesn't give away any secret sauce anymore.
At a very high level, instead of having embeddings at the input layers, this method keeps the embeddings at the layer level. That is every transformer layer would have its own set of learnable embedding vectors that are used to modify the processed hidden states flowing through the network. Mostly, the embeddings are precomputed and stored separately. They are queried at inference time and has very low latency, so you can get comparable performance with half the RAM. (i am not exactly sure how 3n is doing it, but talking it in a general sense).
It's fine to speculate based on the name, but don't pretend that it's a known technique when it clearly isn't.
I have not looked at the model but since the embedding dimension of 256 seems quite small (for reference according to [3] the old Gemma 1B had 1152 dimension input embedding), I'm guessing that this is not done _in lieu_ of the main input embedding to first layer, but in addition to it.
[1] https://twitter.com/cccntu/status/1925043973170856393
[2] https://news.ycombinator.com/edit?id=44048662
[3] https://developers.googleblog.com/en/gemma-explained-whats-n...
On the other hand, I'm really not looking forward to app sizes ballooning even more – there's no reasonable way to share them across apps at least on iOS, and I can absolutely imagine random corporate apps to start including LLMs, just because it's possible.
Though, I won't be surprised if they try to force devs to use their models for "privacy" (and not monopolistic reasons, of course).
Why, on Earth, would Apple ever want to solve the problem of Apps taking up more space? That's just not good business. Way better business right now to put R&D into increased memory access speeds.
Apple would need to have a different business model entirely for them to have a business case for fixing this. They may fix it because they just want to help out they AI guys? Maybe in the future they're getting money from the AI guys or something? So fixing it starts to make a lot of sense.
But all other things being equal, the money for Apple is in this not being fixed.
To make their devices more pleasant / less frustrating to use.
They’ve got a long track record of introducing features that reduce app size, speed up installs, and reduce storage use from infrequently used apps - there’s no reason to believe they’d stop doing that except for cynical vitriol.
I occasionally post decompositions of public .ipa's on the App Store, and I'm looking forward to seeing how these change over the next year.
On top of the already hefty markup, they don't even take storage capacity into consideration for trade-ins.
Only half joking. I really do think the majority of that revenue will be going away.
Given the most "modern" and "hip" way of shipping desktop applications seems to be for everyone and their mother to include a browser runtime together with their 1MB UI, don't get your hopes up.
However there are now alternatives like the official RPi AI Hat that has between about 3x to 6x the TOPs (4 for Coral Vs 13/26 for RPi depending on model) so there is that. 20 TOPs on a RPi 5 - complete with nice vertically integrated camera etc - is quite interesting.
I've also seemingly hit a rate limit on Gemini Pro 2.5 (on an account not subscribed to Gemini Advanced) yesterday, even though my last query is weeks past.
Possibly there's a capacity shortage (I'd presume it all runs on the same Google hardware in the end), and they are prioritizing paid inference?
This sounds unintuitive, but in Google's defense the rate limit errors include a link to docs that explain this.
https://www.reddit.com/r/LocalLLaMA/comments/1kr8s40/comment...
E4B has a score of 44.4 in the Aider polyglot dashboard. Which means its on-par with gemini-2.5-flash (not the latest preview but the version used for the bench on aider's website), gpt4o and gpt4.5.
Thats sounds very good - imagine what a coding focused version of this could do if this is a "generic" embedded only model.
On the other hand - this does have a much lower score for livecodebench.
Also:
> These models were evaluated at full precision (float32)
For 4B effective parameters that's 16 GB ram.
Download the Edge Gallery apk from github: https://github.com/google-ai-edge/gallery/releases/tag/1.0.0
Download one of the .task files from huggingface: https://huggingface.co/collections/google/gemma-3n-preview-6...
Import the .task file in Edge Gallery with the + bottom right.
You can take pictures right from the app. The model is indeed pretty fast.
Hint: You have to set the Max tokens to 32000 for longer conversations. The slider makes it look like it's limited to 1024, just enter it manually.
Okay perhaps my phones not great and perhaps this isn't optimized/pruned for phone use but it's unusably slow. The answers are solid from my brief test.
I wouldn't exactly say phone use, unless you have no internet and you don't mind a bit of a wait.
Really impressive, regardless.
Or, run them on a microcontroller! https://github.com/tensorflow/tflite-micro
Final stats:
15.9 seconds to first token
16.4 tokens/second prefill speed
0.33 tokens/second decode speed
662 seconds to complete the answer
First image ('Describe', photo of my desk)
- 15.6 seconds to first token
- 2.6 tokens/second
- Total 180 seconds
Second image ('What can you see?', photo of a bowl of pasta)
- 10.3 seconds to first token
- 3.1 tokens/second
- Total 26 seconds
The Edge Gallery app defaults to CPU as the accelerator. Switched to GPU.
Pasta / what can you see:
- It actually takes a full 1-2 minutes to start printing tokens. But the stats say 4.2 seconds to first token...
- 5.8 tokens/second
- 12 seconds total
Desk / describe:
- The output is: while True: print("[toxicity=0]")
- Bugged? I stopped it after 80 seconds of output. 1st token after 4.1 seconds, then 5.7 tokens/second.
Pixel Fold was in the Pixel 8 generation but uses the Tensor G2 from the 7s. Pixel 7 release date = October 2022
That's a 26 month difference, yet a full order of magnitude difference in token generation rate on the CPU. Who said Moore's Law is dead? ;)
CPU:
7.37 seconds to first token
35.55 tokens/second prefill speed
7.09 tokens/second decode speed
27.97 seconds to complete the answer
GPU:
1.96 seconds to first token
133.40 tokens/second prefill speed
7.95 tokens/second decode speed
14.80 seconds to complete the answer
("What can you see?"; photo of small monitor displaying stats in my home office)
1st token: 7.48s
Prefill speed: 35.02 tokens/s
Decode speed: 5.72 tokens/s
Latency: 86.88s
It did a pretty good job, the photo had lots of glare and was at a bad angle and a distance, with small text; it picked out weather, outdoor temperature, CO2/ppm, temp/C, pm2.5/ug/m^3 in the office; Misread "Homelab" as "Household" but got the UPS load and power correctly, Misread "Homelab" again (smaller text this time) as "Hereford" but got the power in W, and misread "Wed May 21" on the weather map as "World May 21".
Overall very good considering how poor the input image was.
Edit: E4B
Stats -
CPU -
first token - 4.52 sec
prefill speed - 57.50 sec tokens/s
decode speed - 10.59 tokens/s
Latency - 20.66 sec
GPU -
first token - 1.92 sec
prefill speed - 135.35 sec tokens/s
decode speed - 11.92 tokens/s
Latency - 9.98 sec
Although my entire usecase of local models is amoral questions, which it blocks. Excited for the abliterated version.
I just installed the apk on a GrapheneOS endpoint (old Pixel 7 Pro) without the Google Play Services installed. The app requires network access to contact Hugging Face and download the model through your HF account. It also requires some interaction/permission agreement with Kaggle. Upon install _with_ network perms the app works, and I'm getting decent performance on the Gemma-3n-E2B-it-int4 model (5-6 token/s). Ok, cool.
Now kill the app, disable network permissions and restart it. Choose one of the models that you downloaded when it had network access. It still works. It does appear to be fully local. Yay.
And you really notice that the model is dumber on GPU, because OpenGL doesn't take accuracy that seriously.
[0] https://blog.tensorflow.org/2020/08/faster-mobile-gpu-infere...
https://huggingface.co/collections/google/gemma-3n-preview-6...
Gemma 3n Preview
google/gemma-3n-E4B-it-litert-preview
google/gemma-3n-E2B-it-litert-preview
Interesting, hope it comes on LMStudio as MLX or GGUF. Sparse and or MoE models make a difference when running on localhost. MoE Qwen3-30B-A3B most recent game changer for me. Activating only 3b weights on the gpu cores of sparse Qwen3-30B-A3B, rather than comparable ~30b of dense models (Qwen3-32B, Gemma3-27b, GLM-{4,Z1}-32B, older QwQ-32B), is a huge speedup for me: MoE A3B achieves 20-60 tps on my oldish M2 in LMStudio, versus only 4-5 tps for the dense models.
Looking forward to trying gemma-3n. Kudos to Google for open sourcing their Gemmas. Would not have predicted that the lab with "open" in the name has yet to release even v1 (atm at 0; disregarding gpt-2), while other labs, more commercial labs, are are at versions 3, 4 etc already.
Seems like we will not be able to run this with Llama and friends.
Also it's funny that they are saying that Llama 4 Maverick performs about the same as GPT-4.1 Nano.
Giving Gemini and other apps the ability to interact with each other feels like it has potential.
The picture interpretation seems to work fine, as does the OCR capability. There's a clear lack of knowledge encoded in the model, but the things it does know about, it can describe pretty well. Impressive for a model only a bit larger than a DVD.
loads pretty fast. starts to reply near-instant (text chat mode).
doesn't answer questions like "when is your cutoff date"
apparently answers "may 15 2024" as today date so probably explains why it answered joe biden as answer to who is US president
Am I missing something?
Wouldn't matter for a tech demo, but once you deploy this on all Android devices, it matters.
in the video they've added in announcement, they are showing some live interaction with the model(which is quite fast as compared to AI Edge gallery app), how's it built, how can I use it like this?
# Is Eiffel Tower or a soccer ball bigger ?
> A soccer ball is bigger than the Eiffel Tower! Here's a breakdown:
> Eiffel Tower: Approximately 330 meters (1,083 feet) tall.
> Soccer Ball: A standard soccer ball has a circumference of about 68-70 cm (27-28 inches).
> While the Eiffel Tower is very tall, its base is relatively small compared to its height. A soccer ball, though much smaller in height, has a significant diameter, making it physically larger in terms of volume.
That’s all I really want for Christmas.
Wow!!
It is pretty good indeed (despite the ~30sec input limit), but this feels unrelated to the topic at hand.
onlyrealcuzzo•1mo ago
Gemma 3n is a model utilizing Per-Layer Embeddings to achieve an on-device memory footprint of a 2-4B parameter model.
At the same time, it performs nearly as well as Claude 3.7 Sonnet in Chatbot Arena.
ai-christianson•1mo ago
What's the catch?
Vuizur•1mo ago
moneywoes•1mo ago
refulgentis•1mo ago
I can give more details if you (or anyone else!) is interested.
TL;DR: it is scoring only for "How authoritative did the answer look? How much flattering & emojis?"
Jowsey•1mo ago
refulgentis•1mo ago
Not sure if they've shared more since.
IMVHO it won't help, at all, even if they trained a perfect model that could accurately penalize it*
The main problem is its one off responses, A/B tested. There's no way to connect it into all the stuff we're using to do work these days (i.e. tools / MCP servers), so at this point its sort of skipping the hard problems we'd want to see graded.
(this situation is a example: whats more likely, style control is a small idea for an intractable problem, or Google has now released multiple free models better than Sonnet, including the latest, only 4B params?
To my frustration, I have to go and bench these things myself because I have an AI-agnostic app I build, but I can confirm it is not the case that Gemma 3-not-n is better than Sonnet. 12B can half-consistently make file edits, which is a major step forward for local tbh)
* I'm not sure how, "correctness" is a confounding metric here: we're probably much more likely to describe a formatted answer in negative terms if the answer is incorrect.
In this case I am also setting aside how that could be done, just saying it as an illustration of no matter what, it's the wrong platform for a "how intelligent is this model?" signal, at this point, post-Eliza post-Turing, couple years out from ChatGPT 1.0
int_19h•1mo ago
In general, all those scores are mostly useful to filter out the models that are blatantly and obviously bad. But to determine whether the model is actually good at any specific thing that you need, you'll have to evaluate them yourself to find out.
Deathmax•1mo ago
zamadatix•1mo ago
jdiff•1mo ago
osanseviero•1mo ago
lostmsu•1mo ago
esafak•1mo ago
edit: I seem to be the only one excited by the possibilities of such small yet powerful models. This is an iPhone moment: a computer that fits in your pocket, except this time it's smart.
codr7•1mo ago
esafak•1mo ago
codr7•1mo ago
TeMPOraL•1mo ago
rhdjsjebshjffn•1mo ago
TeMPOraL•1mo ago
I'd go as far as saying LLMs are meaning made incarnate - that huge tensor of floats represents a stupidly high-dimensional latent space, which encodes semantic similarity of every token, and combinations of tokens (up to a limit). That's as close as reifying the meaning of "meaning" itself as we ever come.
(It's funny that we got there through brute force instead of developing philosophy, and it's also nice that we get a computational artifact out of it that we can poke and study, instead of incomprehensible and mostly bogus theories.)
croes•1mo ago
croes•1mo ago
https://neurosciencenews.com/llm-ai-logic-27987/
otabdeveloper4•1mo ago
Ask these questions again in two years when the next winter happens.
TeMPOraL•1mo ago
croes•1mo ago
dinfinity•1mo ago
They trained models on only task specific data, not on a general dataset and certainly not on the enormous datasets frontier models are trained on.
"Our training sets consist of 2.9M sequences (120M tokens) for shortest paths; 31M sequences (1.7B tokens) for noisy shortest paths; and 91M sequences (4.7B tokens) for random walks. We train two types of transformers [38] from scratch using next-token prediction for each dataset: an 89.3M parameter model consisting of 12 layers, 768 hidden dimensions, and 12 heads; and a 1.5B parameter model consisting of 48 layers, 1600 hidden dimensions, and 25 heads."
KoolKat23•1mo ago
Zambyte•1mo ago
If you don't believe me, here is a fun mental exercise: define "understand" and "reason" in a measurable way, that includes humans but excludes LLMs.
otabdeveloper4•1mo ago
> The `foobar` is also incorrect. It should be a valid frobozz, but it currently points to `ABC`, which is not a valid frobozz format. It should be something like `ABC`.
Where the two `ABC`s are the exact same string of tokens.
Obviously nonsense to any human, but a valid LLM output for any LLM.
This is just one example. Once you start using LLMs as tools instead of virtual pets you'll find lots more similar.
Zambyte•1mo ago
croes•1mo ago
But a LLM is sometimes the genius and sometimes the idiot.
That doesn’t happen often if you always talk to the same person
otabdeveloper4•1mo ago
But for LLMs these kinds of illogical statements are common.
LLMs don't think; what they do cannot be in good faith called "thinking" by any definition.
rhdjsjebshjffn•1mo ago
onlyrealcuzzo•1mo ago
croes•1mo ago
rhdjsjebshjffn•1mo ago
Bruh. Do you need to be paid to interact in good faith or were you raised to be social
goatlover•1mo ago
rhdjsjebshjffn•1mo ago
Now would I take AI as a trivia partner? Absolutely. But that's not really the same as what I look for in "smart" humans.
hmapple•1mo ago
If not, I strongly encourage you to discuss your area of expertise with it and rate based on that
It is incredibly competent
rhdjsjebshjffn•1mo ago
But potentially maybe I'm just not looking for a trivia partner in my software.
int_19h•1mo ago
sureglymop•1mo ago
selcuka•1mo ago
Note that "smarter than smart humans" and "smarter than most humans" are not the same. The latter is a pretty low bar.
anonzzzies•1mo ago
compared to what you are used to right?
I know it's elitist but most people <=100 iq (and no, this is not exact obviously, but we have not many other things to go by) are just ... well, a lot of state of the art LLMs are better at everything compared, outside body 'things' (for now) of course, as they don't have any. They hallucinate/bluff/lie as much as the humans and the humans might know they don't know, but outside that, the LLMs win at everything. So I guess that, for now, people with 120-160 iqs find LLMs funny but wouldn't call them intelligent, but below that...
My circle of people I talk with during the day has changed since I took on more charity which consists of fixing up old laptops and installing Ubuntu on them; I get them for free from everyone and I give them to people who cannot afford, including some lessons and remote support (which is easy as I can just ssh in via tailscale). Many of them believe in chemtrails, vaccinations are a gov ploy etc and multiple have told me they read that these AI chatbots are nigerian or indian (or so) farms trying to fraud them out of 'things' (they usually don't have anything to fraud otherwise I would not be there). This is about half of humanity; Gemma is gonna be smarter than all of them, even though I don't register any LLM as intelligence and with the current models, it won't happen either. Maybe a breakthrough in models will be made that changes it, but it has not much chance yet.
disgruntledphd2•1mo ago
This is incorrect, IQ tests are normally scaled such that average intelligence is 100, and such that they are approximately normally distributed so that most people will be somewhere between 85-115 (66% on average).
anonzzzies•1mo ago
85 is special housing where I live... LLMs are far beyond that now.
disgruntledphd2•1mo ago
rhdjsjebshjffn•1mo ago
Not only is this unbelievable, it's reprehensible
selcuka•1mo ago
> average intelligence is 100
You both are saying the same thing.
IQ is defined such that both average and mean would be equal 100. The combination of sub-100 and exactly-100 would be more people than above-100, hence "most people <=100 iq".
tptacek•1mo ago
selcuka•1mo ago
rhdjsjebshjffn•1mo ago
GTP•1mo ago
rhdjsjebshjffn•1mo ago
I refuse to touch the IQ bait
nsonha•1mo ago
Leave that part out, I'm excited. I'd love to see this plays some roles in "inference caching", to reduce dependencies on external services.
If only agents can plan and match patterns of tasks locally, and only needs real intelligence for doing self-contained/computationally heavy tasks.