If we can get this down to a single Raspberry Pi, then we have crazy embedded toys and tools. Locally, at the edge, with no internet connection.
Kids will be growing up with toys that talk to them and remember their stories.
We're living in the sci-fi future. This was unthinkable ten years ago.
I think there's something beautiful and important about the fact that parents shape their kids, leaving with them some of the best (and worst) aspects of themselves. Likewise with their interactions with other people.
The tech is cool. But I think we should aim to be thoughtful about how we use it.
What a radical departure from the social norms of childhood. Next you'll tell me that they've got an AI toy that can change their diaper and cook Chef Boyardee.
We're at the precipice of having a real "A Young Lady's Illustrated Primer" from The Diamond Age.
Graphics cards with decent amount of memory are still massively overpriced (even used), big, noisy and draw a lot of energy.
Apple really is #2 and probably could be #1 in AI consumer hardware.
I'd try the whole AI thing on my work Macbook but Apple's built-in AI stuff isn't available in my language, so perhaps that's also why I haven't heard anybody mention it.
The hard part is identifying those filter functions outside of the code domain.
> People don’t know what they want yet, you have to show it to them
Henry Ford famously quipped that had he asked his customers what they wanted, they would have wanted a faster horse.You have to get into the highest 16-core M4 Max configurations to begin pulling away from that number.
Seems like at the consumer hardware level you just have to pick your poison or what one factor you care about most. Macs with a Max or Ultra chip can have good memory bandwidth but low compute, but also ultra low power consumption. Discrete GPUs have great compute and bandwidth but low to middling VRAM, and high costs and power consumption. The unified memory PCs like the Ryzen AI Max and the Nvidia DGX deliver middling compute, higher VRAMs, and terrible memory bandwidth.
Also I don't think power consumption is important for AI. Typically you do AI at home or in the office where there is lot of electricity.
Being able to quickly calculate a dumb or unreliable result because you're VRAM starved is not very useful for most scenarios. To run capable models you need VRAM, so high VRAM and lower compute is usually more useful than the inverse (a lot of both is even better, but you need a lot of money and power for that).
Even in this post with four RPis, the Qwen3 30 A3B is still an MOE model and not a dense model. It runs fast with only 3B active parameters and can be parallelized across computers but it's much less capable than a dense 30B model running on a single GPU.
> Also I don't think power consumption is important for AI. Typically you do AI at home or in the office where there is lot of electricity.
Depends on what scale you're discussing. If you want to get similar VRAM as a 512GB Mac Studio Ultra with a bunch of Nvidia GPUs like RTX 3090 cards you're not going to be able to run that on a typical American 15 AMP circuits, you'll trip a breaker half way there.
I would recommend sticking to macOS if compatibility and performance are the goal.
Asahi is an amazing accomplishment, but running native optimized macOS software including MLX acceleration is the way to go unless you’re dead-set on using Linux and willing to deal with the tradeoffs.
More devices mean faster performance, leveraging tensor parallelism and high-speed synchronization over Ethernet.
The maximum number of nodes is equal to the number of KV heads in the model #70.
I found this[1] article nice for an overview of the parallelism modes.
[1]: https://medium.com/@chenhao511132/parallelism-in-llm-inferen...
I'm curious about the applications though. Do people randomly buy 4xRPi5s that they can now dedicate to running LLMs?
You'll be much better off spending that money on something else more useful.
Yeah, like a Mac Mini or something with better bandwidth.
Karpathy said in his recent talk, on the topic of AI developer-assistants: don't bother with less capable models.
So ... using an rpi is probably not what you want.
Karpathy elides he is an individual. We expect to find a distribution of individuals, such that a nontrivial # of them are fine with 5-10% off the leading edge performance. Why? At least for free as in beer. At most, concerns about connectivity, IP rights, and so on.
[1] gpt-5 finally dethroned sonnet after 7 months
Interesting because he also said the future is small "cognitive core" models:
> a few billion param model that maximally sacrifices encyclopedic knowledge for capability. It lives always-on and by default on every computer as the kernel of LLM personal computing.
https://xcancel.com/karpathy/status/1938626382248149433#m
In which case, a raspberry Pi sounds like what you need.
For an LLM, size is a virtue - the larger a model is, the more intelligent it is, all other things equal - and even aggressive distillation only gets you this far.
Maybe with significantly better post-training, a lot of distillation from a very large and very capable model, and extremely high quality synthetic data, you could fit GPT-5 Pro tier of reasoning and tool use, with severe cuts to world knowledge, into a 40B model. But not into a 4B one. And it would need some very specific training to know when to fall back to web search or knowledge databases, or delegate to a larger cloud-hosted model.
And if we had the kind of training mastery required to pull that off? I'm a bit afraid of what kind of AI we would be able to train as a frontier run.
We could go back and forth on this all day.
The high end Pis aren’t $25 though.
Though I must admit to first noticing the trend decades before discovering Arduino when I looked at the stack of 289, 302, and 351W intake manifolds on my shelf and realised that I need the width of the 351W manifold but the fuel injection of the 302. Some things just never change.
Intel pro B50 in a dumpster PC would do you well better at this model (not enough ram for dense 30b alas) and get close to 20 tokens a second and so much cheaper.
though at what quality?
I get 8.2 tokens per second on a random orange pi board with Qwen3-Coder-30B-A3B at Q3_K_XL (~12.9GB). I need to try two of them in parallel ... should be significantly faster than this even at Q6.
geerlingguy•10h ago
alchemist1e9•10h ago
If that problem gets solved, even if for only a batch approach that enables parallel batch inference resulting in high total token/s but low per session, and for bigger models, then it would he a serious game changer for large scale low cost AI automation without billions capex. My intuition says it should be possible, so perhaps someone has done it or started on it already.