Going from BF16 to 2.8 and losing only ~5% sounds odd to me.
They detail their methodology here: https://byteshape.com/blogs/Qwen3-4B-I-2507/
- the interactive devices - all the alexa/google/apple devices out there are this interface, also, probably some TV input that stays local and I can voice control. That kind of thing. It should have a good speaker and voice control. It probably should also do other things like act as a wifi range extender or be the router. That would actually be good. I would buy one for each room so no need for crazy antennas if they are close and can create true mesh network for me. But I digress.
- the home 'cloud' server that is storage and control. This is a cheap CPU, a little ram and potentially a lot of storage. It should hold the 'apps' for my home and be the one place I can back-up everything about my network (including the network config!)
- the inference engines. That is where this kind of repo/device combo comes in. I buy it and it knows how to advertise in a standard way its services and the controlling node connects it to the home devices. It would be great to just plug it in and go.
Of course all of these could be combined but conceptually I want to be able to swap and mix and match at these levels so options here and interoperability is what really matters.
I know a lot of (all of) these pieces exist, but they don't work well together. There isn't a simple standard 'buy this turn it on and pair with your local network' kind of plug and play environment.
My core requirements are really privacy and that it starts taking over the unitaskers/plays well together with other things. There is a reason I am buying all this local stuff. If you phone home/require me to set up an account with you I probably don't want to buy your product. I want to be able to say 'Freddy, set timer for 10 mins' or 'Freddy, what is the number one tourist attraction in South Dakota' (wall drugs if you were wondering)
"Hey, are you still having trouble with[succinct summary of a problem it identified]?" "Yes" "I have a solution that meets your requirements as I understand them, and fits in your budget."
I call that Dreaming.
(TM)
Doubly so if you could just talk and brainstorm while it's listening and condensing, so you can circle back later and see what raindrops formed from the brainstorm.
Call that DayDreaming (TM)
Reminds me of a video from the 90's where some wizard put a camcorder and a giant antenna on a petrol powered rc car, an even bigger antenna on his house and controlled it from a 40's style sofa and a huge tube TV in his cramped garage. Over a mile range. Surrounded by enormous cars I think he was going 40-50 mph but with the screaming engine sound and the camera so low to the ground it looked like 500 mph. I'm still laughing, it looked like he was having all of the fun.
Great timing as I was looking into it yesterday as was thinking about writing my own set of agents to run house stuff. I don't want to spent loads of time on voice interaction so HA wakeword stuff would've been useful. If not I'll bypass HA for voice and really only use HA via mcp.
I can do fw dev for micros...but omg do I not want to spend the time looking thru a datasheet and getting something to run efficiently myself these days.
I'd imagine you'd have a bunch of cheap ones in the house that are all WiFi + Mic + Speakers, streaming back to your actual voice processing box (which would cost a wee bit more, but also have local access to all the data it needs).
You can see quite quickly that this becomes just another program running on a host, so if you use a slightly beefier machine and chuck a WiFi card in as well you've got your WiFi extenders.
But really my use case is as simple as
1. Wake word, what time is it in ____
2. Wake word, how is the weather in ____
3. Wake word, will it rain/snow/?? in _____ today / tomorrow / ??
4. Wake word, what is ______
5. Wake word, when is the next new moon / full moon?
6. Wake word, when is sunrise / sunset?
And something similar like that
As compared to Alexa? I bought their preview hardware (and had a home-rolled ESP32 version before that even) and things are getting closer, I can see the future where this works but we aren't there today IMHO. HA Voice (the current hardware) does not do well enough in the mic or speaker [0] department when compared to the Echos. My Echo can hear me over just about anything and I can hear it back, the HA Voice hardware is too quiet and the mic does not pick my up from the same distances or noise pollution levels as the Echo.
I _love_ my HA setup and run everything through it. I'd like nothing more than to trash all my Echos, I cam close to ordering multiple of the preview devices but convinced myself to get just 1 to test (glad I did).
Bottom line: I think HA Voice is the future (for me) but it's not ready yet, it doesn't compare to the Echos. I wish so much that my Sonos speakers could integrate with HA Voice since I already have those everywhere and I know they sound good.
[0] I use Sonos for all my music/audio listening in my house so I only care about the speaker for hearing it talk back to me, I don't need high-end audiophile speakers.
Sadly, the Jabra (or any USB) audio device means I'll need to shift over to an rPi which comes with it's own lifecycle challenges.
I failed to mention I have Claude connected to it rather than their default assistant. To us, this just beats Alexa hands down. I have the default assistant another wake word and mistral on the last, they're about as good as Alexa but I rarely use them.
I will say, while it was too slow (today) with the my local inference hardware (CPU, older computer and a little on my newer MBP) it was magically to talk to and hear back from HA all locally. I look forward to a future where I can do that at the same speed/quality as the cloud models. Yes, I know cloud models will continue to get better but turning on/off my fans/lights/etc doesn't need to best model available, just needs to be reliable and fast, I'm even fine with it "shelling out" to the cloud if I ask it for something outside of the basics though I doubt I'll care to do that.
Yeah because dynamic digital price signs in shops based on what data vendors have about you and AI can extract from it are such fun! Total surveillance. More than what's already happening. Such fun!
You have all of the different components:
* you can use a number of things for the interactive devices (any touchscreen device, buttons, voice, etc)
* have it HA do the basic parsing (word for word matching), with optionally plugging into something more complex (cloud service like ChatGPT, or self-hosted Ollama or whatever) for more advanced parsing (logical parsing)
Every part of the ecosystem is interchangeable and very open. You can use a bunch of different devices, a bunch of different LLMs to do the advanced parsing if you want it. HA can control pretty much everything with an API, and can itself be controlled by pretty much anything that can talk an API.
The market is not ready for building this due to costs etc. not because the big companies block them or anything. And nvidia is not selling subscriptions at all.
./build/bin/llama-cli -m "models/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.70bpw.gguf" -e --no-mmap -t 4
...
Loading model... -ggml_aligned_malloc: insufficient memory (attempted to allocate 24576.00 MB)
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 25769803776
alloc_tensor_range: failed to allocate CPU buffer of size 25769803776
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
Segmentation fault
I'm not sure how they're running it... any kind of guide for replicating their results? It does take up a little over 10 GB of RAM (watching with btop) before it segfaults and quits.[Edit: had to add -c 4096 to cut down the context size, now it loads]
I'm able to get 6-7 tokens/sec generation with 10-11 tokens/sec prompt processing with their model. Seems quite good, actually—much more useful than llama 3.2:3b, which has comparable performance on this Pi.
You need to have a fan/heatsink to get that speed of course, it's maxing out the CPU for the entire time.
That’s a pretty big caveat. In my experience, using a small context size is only okay for very short answers and questions. The output looks coherent until you try to use it for anything, then it turns into the classic LLM babble that looks like words are being put into a coherent order but the sum total of the output is just rambling.
https://github.com/ikawrakow/ik_llama.cpp and their 4Bit-quants?
Or maybe even Microsofts Bitnet? https://github.com/microsoft/BitNet
https://github.com/ikawrakow/ik_llama.cpp/pull/337
https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf ?
That would be an interesting comparison for running local LLMs on such low-end/edge-devices. Or common office machines with only iGPU.
I have not figured out what models that fit in the available memory (say 16Gb) that would be best for doing this. A CPU model I can run on a laptop would be nice. The models I have tried are much smaller than 30B.
llama-server -m /Qwen3-30B-A3B-Instruct-2507-GGUF:IQ3_S --jinja -c 4096 --host 0.0.0.0 --port 8033 Got <= 10 t/s Which I think is not so bad!
On AMD Ryzen 5 5500U with Radeon Graphics and Compiled for Vulkan Got 15 t/s - could swear this morning was <= 20 t/s
On AMD Ryzen 7 H 255 w/ Radeon 780M Graphics and Compiled for Vulkan Got 40 t/s On the last I did a quick comparison with unsloth version unsloth/Qwen3-30B-A3B-GGUF:Q4_K_M and got 25 t/s Can't really comment on quality of output - seems similar
Realistically, the biggest models you can run at a reasonable price right now are quantized versions of things like the Qwen3 30B A3B family. A 4-bit quantized version fits in roughly 15GB of RAM. This will run very nicely on something like an Nvidia 3090. But you can also use your regular RAM (though it will be slower).
These models aren't competitive with GPT 5 or Opus 4.5! But they're mostly all noticeably better than GPT-4o, some by quite a bit. Some of the 30B models will run as basic agentic coders.
There are also some great 4B to 8B models from various organizations that will fit on smaller systems. A 8B model, for example, can be a great translator.
(If you have a bunch of money and patience, you can also run something like GPT OSS 120B or GLM 4.5 Air locally.)
Don't need patience for these, just money. A single RTX 6000 Pro runs those great and super fast.
Runs everything today with bleeding edge performance.
Overall whats the difference between 8k or 30k?
/s
As long as you have the money this hardware is easily accessible to normal people, unlike fancy server hardware.
This one runs at perfectly servicable pace locally on a laptop 5090 with 64gb system ram with zero effort required. Just download ollama and select this model from the drop-down.
OpenRouter gives you $10 credit when you sign up - stick your API key in and compare as many models as you want. It's all browser local storage.
The industry has to copy CUDA, or give up and focus on raster. ASIC solutions are a snipe chase, not to mention small and slow.
There have been a lot of boards and chips for years with dedicated compute hardware, but they’re only so useful for these LLM models that require huge memory bandwidth.
It's just that practically nothing uses those NPUs.
In a nutshell: LLMs generate tokens one at a time. "only 3B parameters active a a time" means that for each of those tokens only 3B parameters need to be fetched from memory, instead of all of them (30B).
As to how the selection works - each mixture-of-experts layer in the netwosk has essentially a small subnetwork called a "router" which looks at the input and calculates the scores for each expert; then the best scoring experts are picked and the inputs are only routed to them.
MoE models still operate on token-by-token basis, i.e. "pot/at/o" -> "12345/7654/8472". "Experts" are selected on per-token basis, not per-interation, so "expert" naming might be a bit of a misnomer, or marketing.
It punches well above the weight class expected from 3B active parameters. You could build the bear in Spielberg's "AI" with this thing, if not the kid.
It's accuracy across GSM8K, MMLU, IFEVAL and LiveCodeBench.
They detail their methodology here: https://byteshape.com/blogs/Qwen3-4B-I-2507/
Eight tokens per second is "real time" in that sense, but that's also the kind of speeds that we used to mock old video games for, when they would show "computers" but the text would slowly get printed to a screen letter for letter or word for word.
For anyone interested in a comparative review of different models that can run on a Pi, here’s a great article [1] I came across while working on my project.
[0] https://github.com/syxanash/maxheadbox
[1] https://www.stratosphereips.org/blog/2025/6/5/how-well-do-ll...
That got me thinking again about what practical even means when it comes to AI running on the edge. Like, away from big servers.
I came up with a basic way to look at it. First, capability. That is, what kinds of tasks can it handle decently. Then latency. Does it respond quick enough so it does not feel laggy. There are also constraints to consider. Things like power use, how much memory it needs, and heat buildup.
The use case part seems key too. What happens if you try to take it off the cloud and run it locally. In my experience, a lot of these edge AI demos fall short there. The tech looks impressive, but it is hard to see why you really need it that way.
It seems like most people overlook that unclear need. I am curious about how others see it. Local inference probably beats cloud in spots where you cannot rely on internet, or maybe for privacy reasons. Or when data stays on device for security.
Some workloads feel close right now. They might shift soon as hardware gets better. I think stuff like voice assistants or simple image recognition could tip over.
If someone has actually put a model on limited hardware, like in a product, what stood out as a surprise. The thermals maybe, or unexpected power drains. It feels like that part gets messy in practice. I might be oversimplifying how tricky it all is.
If you have very specific, constrained tasks it can do quite a lot. It's not perfect though.
https://tools.nicklothian.com/llm_comparator.html?gist=fcae9... is an example conversation where I took OpenAI's "Natural language to SQL" prompt[1], send it to Ollama:qwen3:0.6b and the asked Gemini Flash 3 to compare what qwen3:0.6b did vs what Flash did.
Flash was clearly correct, but the qwen3:0.6b errors are interesting in themselves.
[1] https://platform.openai.com/docs/examples/default-sql-transl...
They still aren't useful like large LLMs, but for things like summarization, and other tasks where you can give them structure but want the sheen of natural language they are much better than things like the Phi series were.
Original: 11tok/s Byteshape: 16tok/s
Quite a nice improvement!
yjftsjthsd-h•1d ago
> On a Pi 5 (16GB), Q3_K_S-2.70bpw [KQ-2] hits 8.03 TPS at 2.70 BPW and maintains 94.18% of BF16 quality.
And they talk about other hardware and details. But that's the expanded version of the headline claim.
CSSer•1d ago
mschuster91•1d ago
boothby•18h ago
Imustaskforhelp•23h ago
You can paste any article and chatgpt (took the most laymen AI thing) and just writing summarize this article https://byteshape.com/blogs/Qwen3-30B-A3B-Instruct-2507/
can give you insights about it.
Although I am all for freedom, one forgets that this is one of the few places left on internet where discussions feel meaningful and I am not judging you if you want AI but do it at your own discretion using chatbots.
If you want, you can even hack around a simple extension (tampermonkey etc.) where you can have a button which can do this for you if you really so desire.
Ended up being bored and asked chatgpt to do this but chatgpt is having something wrong, it got just blinking mode so I asked claude web (4.5 sonnet) to do it and I ended up building it with tampermonkey script.
Created the code. https://github.com/SerJaimeLannister/tampermonkey-hn-summari...
I was just writing this comment and I just got curious I guess so in the end ended up building it.
Although Edit: Thinking about it, I felt that we should read other people's articles as well. I just created this tool not out of endorsement of idea or anything but just curiosity or boredom but I think that we should probably read the articles themselves instead of asking chatgpt or LLM's about it.
There is this quote which I remembered right now
If something is worth talking/discussing about, its worth writing
If something is worth writing, then its worth reading.
Information that we write is fundamentally subjective (our writing style etc with our biases etc.), passing it through a black box which will try to homogenify all of it just feels like it misses the point.
6510•22h ago
haha, like so works too
https://raw.githubusercontent.com/SerJaimeLannister/tampermo...
Imustaskforhelp•22h ago
Is this what you are talking about? If you need any cooperation from my side lemme know, I don't know too much about tampermonkey but I end up using it for my mini scripts because its way much easier to deal with compared to building pure extensions themselves and these have their own editors as well so I just copy paste for a faster way to prototype with stuff like this
Alex2037•22h ago
sure, and reading a LLM summary allows one to decide whether the full article is worth reading or not.
Imustaskforhelp•22h ago
bigyabai•21h ago
kadoban•23h ago
Aurornis•21h ago
Their output is not great so they get downvoted and spotted quickly.
jacquesm•13h ago
grosswait•19h ago
ukuina•15h ago
Latitude7973•11h ago
make3•4h ago