I don't normally do the whole "abliterated" thing (dealignment) but after discovering https://github.com/p-e-w/heretic , I was too tempted to try it with this model a couple days ago (made a repo to make it easier, actually) https://github.com/pmarreck/gemma4-heretical and... Wow. It worked. And... Not having a built-in nanny is fun!
It's also possible to make an MLX version of it, which runs a little faster on Macs, but won't work through Ollama unfortunately. (LM Studio maybe.)
Runs great on my M4 Macbook Pro w/128GB and likely also runs fine under 64GB... smaller memories might require lower quantizations.
I specifically like dealigned local models because if I have to get my thoughts policed when playing in someone else's playground, like hell am I going to be judged while messing around in my own local open-source one too. And there's a whole set of ethically-justifiable but rule-flagging conversations (loosely categorizable as things like "sensitive", "ethically-borderline-but-productive" or "violating sacred cows") that are now possible with this, and at a level never before possible until now.
Note: I tried to hook this one up to OpenClaw and ran into issues
To answer the obvious question- Yes, this sort of thing enables bad actors more (as do many other tools). Fortunately, there are far more good actors out there, and bad actors don't listen to rules that good actors subject themselves to, anyway.
On Android the sandbox loads an index.html into a WebView, with standardized string I/O to the harness via some window properties. You can even return a rendered HTML page.
Definitely hacked together, but feels like an indication of what an edge compute agentic sandbox might look like in future.
I checked the abliterate script and I don't yet understand what it does or what the result is. What are the conversations this enables?
In my experience, though, it's necessary to do anything security related. Interestingly, the big models have fewer refusals for me when I ask e.g. "in <X> situation, how do you exploit <Y>?", but local models will frequently flat out refuse, unless the model has been abliterated.
But it does refuse being critical of the usual topics: israel, islam, trans, or race.
So wanting to discuss one of those is the real reason people would use an uncensored model.
2) Asking questions about sketchy things. Simply asking should not be censored.
3) I don't use it for this, but porn or foul language.
4) Imitating or representing a public figure is often blocked.
5) Asking security-related questions when you are trying to do security.
6) For those who have had it, people who are trying to use AI to deal with traumatic experiences that are illegal to even describe.
Many other instances.
I guess there are things it's better at?
So far gemma 4 seems excellent at role playing, document analysis, and decent at making agentic decisions.
I'm not sure if I can make the 35B-A3B work with my 32GB machine
Mind giving us a few of the examples that you plan to run in your local LLM? I am curious.
1) I am able to run the model on my iPhone and get good results. Not as good as Gemini in the cloud, but good.
2) I love the “mobile actions” tool calls that allow the LLM to turn on the flashlight, open maps, etc. It would be fun if they added Siri Shortcuts support. I want the personal automation that Apple promised but never delivered.
3) I am so excited for local models to be normalized. I build little apps for teachers and there are stringent privacy laws involved that mean I strongly prefer writing code that runs fully client-side when possible. When I develop apps and websites, I want easy API access to on-device models for free. I know it sort of exists on iOS and Chrome right now, but as far as I’m aware it’s not particularly good yet.
It’s very impressive that this can run locally. And I hope we will continue to be able to run couple-year-old-equivalent models locally going forward.
Also on Android: https://play.google.com/store/apps/details?id=com.google.ai....
It's a demo app for Google's Edge project: https://ai.google.dev/edge
I’m sure very fast TPUs in desktops and phones are coming.
The latter option will only bemusedly for tasks that humans are more expensive or much slower in.
This Gemma 4 model gives me hope for a future Siri or other with iPhone and macOS integration, “Her” (as in the movie) style.
Seriously????
The big benefit of moving compute to edge devices is to distribute the inference load on the grid. Powering and cooling phones is a lot easier than powering and cooling a datacenter
Why? It's widely understood that the big players are making profit on inference. The only reason they still have losses is because training is so expensive, but you need to do that no matter whether the models are running in the cloud or on your device.
If you think about it, it's always going to be cheaper and more energy-efficient to have dedicated cloud hardware to run models. Running them on your phone, even if possible, is just going to suck up your battery life.
Are they? Or are they just saying that to make their offerings more attractive to investors?
Plus I think most people using agents for coding are using subscriptions which they are definitely not profitable in.
Locally running models that are snappy and mostly as capable as current sota models would be a dream. No internet connection required, no payment plans or relying on a third party provider to do your job. No privacy concerns. Etc etc.
This assessment might change if local AI frameworks start working seriously on support for tensor-parallel distributed inference, then you might get away with cheaper homelab-class hardware and only mildly unreasonable amounts of money.
Also while datacenter-based scaleout of a model over multiple GPUs running large batches is more energy efficient, it ultimately creates a single point of failure you may wish to avoid.
This is most definitely not widely understood. We still don't know yet. There's tons of discussions about people disagreeing on whether it really is profitable. Unless you have proof, don't say "this is widely understood".
I love the whole “they are making money if you ignore training costs” bit. It is always great to see somebody say something like “if you look at the amount of money that they’re spending it looks bad, but if you look away it looks pretty good” like it’s the money version of a solar eclipse
It may be physically "local" but not in spirit.
I just made a real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B. I posted it on /r/LocalLLaMA a few hours ago and it's gaining some traction [0]. Here's the repo [1]
I'm running it on a Macbook instead of an iPhone, but based on the benchmark here [2], you should be able to run the same thing on an iPhone 17 Pro.
[0] https://www.reddit.com/r/LocalLLaMA/comments/1sda3r6/realtim...
[1] https://github.com/fikrikarim/parlor
[2] https://huggingface.co/litert-community/gemma-4-E2B-it-liter...
https://github.com/a-ghorbani/pocketpal-ai
https://apps.apple.com/us/app/pocketpal-ai/id6502579498
https://play.google.com/store/apps/details?id=com.pocketpala...
After some back and forth the chat app started to crash tho, so YMMV.
Although the phone got considerably hot while inferencing. It’s quite an impressive performance and cannot wait to try it myself in one of my personal apps.
I assume it is the 26B A4B one, if it runs locally?
hadrien01•2h ago
piperswe•2h ago
hadrien01•2h ago
Screenshot of the header: https://i.imgur.com/4abfGYF.png
morpheuskafka•2h ago
t-sauer•2h ago
Edit: Seems like mix-blend-mode: plus-lighter is bugged in Firefox on Windows https://jsfiddle.net/bjg24hk9/
throwatdem12311•2h ago
On my iPhone it opens on the App Store app, so it looks fine to me.
j0hax•2h ago
giarc•2h ago
If you just go to https://apps.apple.com/ it does look better, but I agree, still a bit "off".
ezfe•2h ago
lateforwork•54m ago
The design quality is still poor. But that's the new Apple. Design is no longer one of their core strengths.