Personally I prefer the 35B MoE model, which is fast enough to be interactively useful, and capable, but I would probably use the 27B if I wanted to generate whole applications like that.
I am unconvinced that most "local" AI applications need anything much more powerful than the Gemma 4 12B model. Local agentic coding is a small niche, but there are plenty of ways a local model can help with development tasks.
I would really like to see a 12B or 16B Qwen 3.6.
I am currently playing with Ornith 1.0 in the MoE configuration, which is based on the 35B variant of Qwen 3.5; I am not sure if it is better than the 3.6 version.
Benchmarks say it is; my own silly tests either suggest otherwise or suggest that I have to talk to it a bit differently.
I really want to have a model that i can run locally on my 24gb m4 pro mbp for when i don't have internet to connect to my 3090 running the qwen, and i love how gemma 4 models 'feel', but i can't make them be competent. I am in the middle of finetuning both qwen3.5 9B and gemma 4 12B just to try and make those bridge closer to 27B for coding/agentic tasks (and am trying to ternarize and DQT 27B so that it fits in ~9gb pre-KV).
How do you run the gemma? What do you use it for (and in what harness), maybe llama.cpp and pi-mono just aren't for this model and that's what i'm doing wrong.
Qwen on the other hand got straight to work with astonishing competency on the same system.
From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.
https://arena.ai/leaderboard/code/webdev/pareto?license=open...
https://arena.ai/leaderboard/text/pareto?license=open-source
Qwen 3.6 27B will run in full offload with a 4-bit quantisation in 64GB on an M1 Max. It is quite slow.
I don't know about 48GB but 64GB should be enough.
So for example I'd favour a used M1 Max over a used M2 Pro, at least based on my naïve understanding. Not quite sure where the balance changes.
There appear to be some hardware improvements with the M3 and up regarding the Apple Neural Engine which I'd hope would show up in MLX performance; I remember seeing some optimisations in image generation models that are only possible on later hardware.
The GPU cores are progressively better I believe, but the memory bandwidth is lower. Though perhaps the M4 can get closer to actually saturating said bandwidth.
(And I must reiterate that my understanding of this stuff is pretty naïve.)
Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?
TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.
If I can generate voice at the same time as video, that would be useful.
The neural cores aren't suitable for LLMs/transformers and isn't used in LLM inference. On the M5 and later chips, it comes with neural accelerators, aka Tensor Cores, which speed up the 'prefill' (i.e. processing your context window) part, but don't do anything for inference.
The MLX vs GGUF debate is mostly irrelevant. The GGUF pathways are optimised for apple silicon to the extent of practically identical performance to MLX. MLX is just one way of using Apple GPUs, it comes with many optimisations in the box, but they're not hard and they're no longer MLX-exclusive.
>
> --jinja for tool calling support
Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year
Lora if effective could be a great reason to run local models.
Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?
I’m not having it build whole features from scratch, though. I give it pretty explicit instructions closer to the class or function level, and it still saves me an immense amount of time, while I’m very connected to the code that’s written.
Definitely the sweet spot for me.
Sent from my 8gb M2 Mac mini.
The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.
1. Maybe you should tell us what those limited experiments are.
2. Maybe you should actually try 3.6 because it's huge difference in most cases. Don't forget to tell us quants and don't forget to tell us scope.
3. Maybe actually show us data compared to frontier models instead of this... vibe comment. Pretty tired of this kind of comments on HN that doesn't require logic or evidence. Just vibes. Like the pelican riding a bicycle crap that everyone has taken for granted but has no objective way of assessing goodness.
Modifying existing code is way easier if you don't expect it to be smart about it. Don't say "add X feature" and let it explore the codebase and build its own understanding. Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Now you've done the hardest part of making the decisions and it just has to follow instructions while coloring within the lines.
tweaking sampler might help
https://github.com/ikawrakow/ik_llama.cpp
Edit: it's gonna be slow if you're not using any VRAM. But it's possible. Software isn't going to speed that up anytime soon, it's just a hardware bandwidth limit.
Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.
[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...
QAT, MTP, 128k context.
I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.
I'm running the NVFP4 alongside Gemma4 at the same quant on an OEM Spark
However, text-to-speech, speech-to-text, and non-code LLM use cases are so useful to have local, and don't require big hardware.
Having a universal reliable inference engine interface, I think, is the big unlock that needs to happen before app devs can ship these features.
Personal concrete use case: meeting recording app. This uses Parakeet + Qwen to create local transcriptions and post-cleanup, respectively.
Right now this app has to download and manage all these models, then bundle an inference engine to run them. It's a lot of code that probably should belong to the OS, or at least a standard interface.
While apps can offload some of this to llama.cpp or a similar process over http, that's another set of setup for the user to do before they can have a useful app.
Anyway, if you're getting started on a Mac, I'd suggest trying out oMLX (https://github.com/jundot/omlx) before messing with llama.cpp. In particular they have community benchmarks so you can see what kind of performance you're likely to get: https://omlx.ai/benchmarks. I wished each one had more configuration details though.
Certainly this is falsifiable easily by any of us doing it on a regular basis
> Qwen stuck in thought loops
This does happen when context is not managed effectively; creating plans, using subagents and compactions strategically resolves this
(I'm aware the price is, in absolute terms, more expensive where I live compared to the USA. That reinforces what I think, because anyone sane that would've bought one of those in another country would sell them as soon as they landed here and save that money.)
Jackrong has a few different ones available depending on what you're trying to do: https://huggingface.co/Jackrong
I do not have a crazy rig, a modest gaming one at that, but in trying to understand more about agents and their capabilities, I am SOL with my 16 GB of RAM and 8GB of VRAM. I can get most small, non tool calling models to perform well, but I've had major issues with anything over 9B doing anything more than reasoning (egregiously slow at higher parameter counts).
And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.
I am still mostly tinkering/learning rather than spilling out code, and I feel quite slow on it. So it doesn't matter too much to me if it is really slow. More the journey than the destination if that makes sense. I'm stubborn.
I have tried the Gemma 4 12B model (Unsloth's QAT version) with search/browse tools in LM Studio and Unsloth Studio, when I am trying to understand a new thing.
Basically I get it to write introductory starter documentation for me to absorb, because my big personal problem, these days, is focussing enough to start a project, digging in and getting going; I need the help.
I have found its limits on obscure packages (that it sometimes makes up) but before that it's a bit like stumbling on a blog post that happens to be about right for your particular need. Good enough to work through.
It's stuff I could ask Perplexity to do, or ChatGPT, to be fair, I just like LM Studio for this.
In your case: I don't believe it's the quant. I'm sure it's the model — it has good coding knowledge but it's clearly not specialised. It might be good enough at writing Python/PHP/JavaScript at a novice level. It is also quite good on WordPress tooling and functions.
But I wouldn't bother with it for agentic coding if you've got experience elsewhere. Might be interesting to see what you can do with the 9B Ornith model?
Qwen 3.6 MoE in its Unsloth version is another matter. Impressive and I am trying to find ways to support my old brain doing what I've done before.
For 24GB VRAM cards (e.g. 4090) you can use Q6_K (22.5GB) or Q5_K_M (19.5GB) quants, possibly offloading some of the weights to RAM.
Not really germane to your comment but I hope I don’t sound old when I say I remember a time when spinning up a PoC was a week of work, and a statement like yours was pure science fiction.
... but, the models that WILL run on 128GB (or 64GB or even 32GB) models today are a huge improvement on the best models that would run in the same amount of memory six months ago.
In the open model space an insane amount of effort goes into getting more powerful models to run with the same or less RAM. For example in the diffusion world many things that could not be run on easily under 24GB of VRAM actually run much better today with much less VRAM than they did a few years ago. You can do many things today with 8-16GB of VRAM that would not have been possible. At the same time the most advanced open models, like LTX 2.3 for video gen, still seem to respect 24GB of VRAM as the upper bound.
Similarly the standard "big" but localish open model for LLMs back in the day was Llama 3 70B, this was both a much worse and much larger model than Qwen 3.6 27B
So in two different spaces I've witnessed the "RAM required to run the best" decreasing or at least remaining stable, while the performance being achieved in both areas is astounding (LTX 2.3 is faster, better and more capable than the Wan 2.2 model that held popularity before it).
The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.
Hahhahahahhahahahhahahhahaha
Do you know how much VRAM/unified is needed for the 27B model, which is generally regarded as better between the two compared in the article, is needed with little to no KLD loss and at 256k context?
Also, once you worked out how much memory is needed for that, maybe tell us how much a non-Apple system that you can run that (probably similarly or faster) would cost?
And when you have answered that, can you tell us how much privacy costs? Maybe also tell us how private OpenRouter is?
Edit: looking at other replies that are basically pointing out the same thing I did, I guess it's my wording. It's frustrating that people who misinform others in some nicely packaged ways or just simply uninformed get to keep doing that if they sound nice. Thanks.
Ryzen AI Max 395+ with 128GB of unified memory can be found around $3-4k.
But 27B isn't that large, either, especially if you are ok with the quantized models. So this laptop choice seems to more be a "because they had it" rather than "this is what's necessary for this particular workflow"
If you want to run unquantized, you definitely need 128GB.
I don't know how much serious hands-free agentic coding I will ever do on my MacBook alone, but I do know that I would not have got so far into understanding this without tinkering with local models, llama.cpp, LM Studio, and LM Studio and all that.
I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.
Until, that is, I could poke around with setting it up on my own (secondhand) machine, watching the API calls, understanding some of the terminology. I didn't even buy the machine for that; it's just adequate to the task.
The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.
I found LM studio to be a nice starting point. Frindlier and more featureful than Ollama and not as intimidating as llama.cpp (though you will want to use that eventually)
- opencode with it's webui
- deer-flow with it's research/powered front end
They both run websites so you don't have to baby sit them (eg, keep your mac open). I've build a pdf compressor over a few days by first having deer flow try and research the frameworks and pipeline. It stalls out because its not really a fluid programmer. Once it stalls out, I transferred it (manually for now) to opencode and it's refactoring it because it's just a collective bundle of sticks and it needs a lot of testing to tweak out the limited scop context. LLMs can't really hold large scopes (locally anyway, from what I've read from HN, it's possible with longer context).
It'll complete in a few days with maybe 3-4 hours of full attention interaction, but it's running 3x that without my attention. Obviously, if I paid more attention it'd run quicker, but since it's local, it's not pumping out large volumes of code, it's mostly looping over tests and capabilities as observed.
It's running Qwen3.6 35B MoE on a AMD 128GB strix halo. If I switched to the dense models, perhaps it'd be smarter, but the trade off seems to be much slower gen.
I use my MBP essentially as my workstation, it's almost always plugged in. I have a MBA (M4, 24GB RAM) that I picked up for ~A$1500 or so, and that's an amazing daily driver. I don't do local LLM inference on that unit, I can just hit my own APIs (via LM Studio) on the MBP over Tailscale.
From what I understand, for a developer, $5000/month is maybe the high end, but $5000/year is fairly standard. (Is that accurate?) So if it pays back in 15 months, that's pretty decent. If it pays back in two months, that's spectacular.
In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.
So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.
Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.
Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument.
Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.
I think you might be a little to into the stew here.
217•1h ago
madduci•55m ago
dofm•53m ago
sleepyeldrazi•46m ago