A couple suggestions:
1. I have an M3 Ultra with 256GB of memory, but the options list only goes up to 192GB. The M3 Ultra supports up to 512GB. 2. It'd be great if I could flip this around and choose a model, and then see the performance for all the different processors. Would help making buying decisions!
Not sure if it still works.
I too was a little surprised by this. My browser (Vivladi) makes a big deal about how privacy-conscious they are, but apparently browser fingerprinting is not on their radar.
> Estimates based on browser APIs. Actual specs may vary
One thing I do wonder is what sort of solutions there are for running your own model, but using it from a different machine. I don't necessarily want to run the model on the machine I'm also working from.
You can also use the kubernetes operator to run them on a cluster: https://ollama-operator.ayaka.io/pages/en/
Just FYI.
I also want to run vision like Yocto and basic LLM with TTS/STT
For wakewords I have used pico rhino voice
I want to use these I2S breakout mics
I haven't tried on a raspberry pi, but on Intel it uses a little less than 1s of CPU time per second of audio. Using https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/a... for chunked streaming inference, it takes 6 cores to process audio ~5x faster than realtime. I expect with all cores on a Pi 4 or 5, you'd probably be able to at least keep up with realtime.
(Batch inference, where you give it the whole audio file up front, is slightly more efficient, since chunked streaming inference is basically running batch inference on overlapping windows of audio.)
EDIT: there are also the multitalker-parakeet-streaming-0.6b-v1 and nemotron-speech-streaming-en-0.6b models, which have similar resource requirements but are built for true streaming inference instead of chunked inference. In my tests, these are slightly less accurate. In particular, they seem to completely omit any sentence at the beginning or end of a stream that was partially cut off.
It says I have an Arc 750 with 2 GB of shared RAM, because that's the GPU that renders my browser...but I actually have an RTX1000 Ada with 6 GB of GDDR6. It's kind of like an RTX 4050 (not listed in the dropdowns) with lower thermal limits. I also have 64 GB of LPDDR5 main memory.
It works - Qwen3 Coder Next, Devstral Small, Qwen3.5 4B, and others can run locally on my laptop in near real-time. They're not quite as good as the latest models, and I've tried some bigger ones (up to 24GB, it produces tokens about half as fast as I can type...which is disappointingly slow) that are slower but smarter.
But I don't run out of tokens.
I’d like to be able to use a local model (which one?) to power Copilot in vscode, and run coding agent(s) (not general purpose OpenClaw-like agents) on my M2 MacBook. I know it’ll be slow.
I suspect this is actually fairly easy to set up - if you know how.
You're probably not going to get anything working well as an agent on an M2 MacBook, but smaller models do surprisingly well for focused autocomplete. Maybe the Qwen3.5 9B model would run decently on your system?
https://github.com/russellballestrini/unfirehose-nextjs-logg...
thanks, I'll check for comments, feel free to fork but if you want to contribute you'll have to find me off of github, I develop privately on my own self hosted gitlab server. good luck & God bless.
In reality, gpt-oss-120b fits great on the machine with plenty of room to spare and easily runs inference north of 50 t/s depending on context.
ive been working with quite a few open weight models for the last year and especially for things like images, models from 6 months would return garbage data quickly, but these days qwen 3.5 is incredible, even the 9b model.
But yes, if there is a choice I want quality over speed. At same quality, I definitely want speed.
(In terms of intelligence, they tend to score similarly to a dense model that's as big as the geometric mean of the full model size and the active parameters, i.e. for GPT-OSS-20B, it's roughly as smart as a sqrt(20b*3.6b) ≈ 8.5b dense model, but produces tokens 2x faster.)
For MoE models, it should be using the active parameters in memory bandwidth computation, not the total parameters.
> GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.
First, the token generation speed is going to be comparable, but not the prefil speed (context processing is going to be much slower on a big MoE than on a small dense model).
Second, without speculative decoding, it is correct to say that a small dense model and a bigger MoE with the same amount of active parameters are going to be roughly as fast. But if you use a small dense model you will see token generation performance improvements with speculative decoding (up to x3 the speed), whereas you probably won't gain much from speculative decoding on a MoE model (because two consecutive tokens won't trigger the same “experts”, so you'd need to load more weight to the compute units, using more bandwidth).
> A Mixture of Experts model splits its parameters into groups called "experts." On each token, only a few experts are active — for example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token. This means you get the quality of a larger model with the speed of a smaller one. The tradeoff: the full model still needs to fit in memory, even though only part of it runs at inference time.
> A dense model activates all its parameters for every token — what you see is what you get. A MoE model has more total parameters but only uses a subset per token. Dense models are simpler and more predictable in terms of memory/speed. MoE models can punch above their weight in quality but need more VRAM than their active parameter count suggests.
2. Add a 150% size bonus to your site.
Otherwise, cool site, bookmarked.
I don't really understand how the interface to the NPU chip looks from the perspective of a non-system caller, if it exists at all. This is a Samsung device but I am wondering about the general principle.
John23832•2h ago
schaefer•1h ago
embedding-shape•1h ago