I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.
Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next
I use opencode and have done a few toy projects and little changes in small repositories and can get pretty speedy and stable experience up to a 64k context.
It would probably fall apart if I wanted to use it on larger projects, but I've often set tasks running on it, stepped away for an hour, and had a solution when I return. It's definitely useful for smaller project, scaffolding, basic bug fixes, extra UI tweaks etc.
I don't think "usable" a binary thing though. I know you write lot about this, but it'd be interesting to understand what you're asking the local models to do, and what is it about what they do that you consider unusable on a relative monster of a laptop?
I've had mild success with GPT-OSS-120b (MXFP4, ends up taking ~66GB of VRAM for me with llama.cpp) and Codex.
I'm wondering if maybe one could crowdsource chat logs for GPT-OSS-120b running with Codex, then seed another post-training run to fine-tune the 20b variant with the good runs from 120b, if that'd make a big difference. Both models with the reasoning_effort set to high are actually quite good compared to other downloadable models, although the 120b is just about out of reach for 64GB so getting the 20b better for specific use cases seems like it'd be useful.
On M1 64GB Q4KM on llama.cpp gives only 20Tok/s while on MLX it is more than twice as fast. However, MLX has problems with kv cache consistency and especially with branching. So while in theory it is twice as fast as llama.cpp it often does the PP all over again which completely trashes performance especially with agentic coding.
So the agony is to decide whether to endure half the possible speed but getting much better kv-caching in return. Or to have twice the speed but then often you have again to sit through prompt processing.
But who knows, maybe Qwen gives them a hand? (hint,hint)
Now if you are not happy with the last answer, you maybe want to simply regenerate it or change your last question - this is branching of the conversation. Llama.cpp is capable of re-using the KV cache up to that point while MLX does not (I am using MLX server from MLX community project). I haven't tried with LMStudio. Maybe worth a try, thanks for the heads-up.
Overall, it's allowed me to maintain more consistent workflows as I'm less dependent on Opus. Now that Mastra has introduced the concept of Workspaces, which allow for more agentic development, this approach has become even more powerful.
On a misc note: What's being used to create the screen recordings? It looks so smooth!
The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.
I got stuff done with Sonnet 3.7 just fine, it did need a bunch of babysitting, but still it was a net positive to productivity. Now local models are at that level, closing up on the current SOTA.
When "anyone" can run an Opus 4.5 level model at home, we're going to be getting diminishing returns from closed online-only models.
Don't forget that they want to make money in the end. They release small models for free because the publicity is worth more than they could charge for them, but they won't just give away models that are good enough that people would pay significant amounts of money to use them.
There's no reason for a coding model to contain all of ao3 and wikipedia =)
It's one thing running the model without any context, but coding agents build it up close to the max and that slows down generation massively in my experience.
In practice, I've found the economics work like this:
1. Code generation (boilerplate, tests, migrations) - smaller models are fine, and latency matters more than peak capability 2. Architecture decisions, debugging subtle issues - worth the cost of frontier models 3. Refactoring existing code - the model needs to "understand" before changing, so context and reasoning matter more
The 3B active parameters claim is the key unlock here. If this actually runs well on consumer hardware with reasonable context windows, it becomes the obvious choice for category 1 tasks. The question is whether the SWE-Bench numbers hold up for real-world "agent turn" scenarios where you're doing hundreds of small operations.
It's that simple. Everyone else is trying to compete in other ways and Anthropic are pushing for dominate the market.
They'll eventually lose their performance edge and suddenly they will back to being cute and fluffy
I've cancelled a clause sub, but still have one.
I've tried all of the models available right now, and Claude Opus is by far the most capable.
I had an assertion triggered in a fairly complex open-source C library I was using, and Claude Opus not only found the cause, but wrote a self-container reproduction code I could add to a GitHub issue. And it also added tests for that issue, and fixed the underlying issue.
I am sincerely impressed by the capabilities of Claude Opus. Too bad its usage is so expensive.
I wonder what they are up to.
danielhanchen•1h ago
ranger_danger•1h ago
danielhanchen•1h ago
CamperBob2•11m ago
The green/yellow/red indicators for different levels of hardware support are really helpful, but far from enough IMO.
binsquare•58m ago
Great work as always btw!