Qwen3-Coder-Next

https://qwen.ai/blog?id=qwen3-coder-next

179•danielhanchen•1h ago

Comments

danielhanchen•1h ago

For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next

ranger_danger•1h ago

What is the difference between the UD and non-UD files?

danielhanchen•1h ago

UD stands for "Unsloth-Dynamic" which upcasts important layers to higher bits. Non UD is just standard llama.cpp quants. Both still use our calibration dataset.

CamperBob2•11m ago

Please consider authoring a single, straightforward introductory-level page somewhere that explains what all the filename components mean, and who should use which variants.

The green/yellow/red indicators for different levels of hardware support are really helpful, but far from enough IMO.

binsquare•58m ago

How did you do it so fast?

Great work as always btw!

simonw•1h ago

This GGUF is 48.4GB - https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF/tree/main/... - which should be usable on higher end laptops.

I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.

Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next

vessenes•1h ago

I'm thinking the next step would be to include this as a 'junior dev' and let Opus farm simple stuff out to it. It could be local, but also if it's on cerebras, it could be realllly fast.

ttoinou•1h ago

Cerebras already has GLM 4.7 in the code plans

vessenes•1h ago

Yep. But this is like 10x faster; 3B active parameters.

ttoinou•1h ago

Cerebras is already 200-800 tps, do you need even faster ?

overfeed•8m ago

Yes! I don't try to read agent tokens as they are generated, so if code generation decreases from 1 minute to 6 seconds, I'll be delighted. I'll even accept 10s -> 1s speedups. Considering how often I've seen agents spin wheels with different approaches, faster is always better, until models can 1-shot solutions without the repeated "No, wait..." / "Actually..." thinking loops

danielhanchen•1h ago

It works reasonably well for general tasks, so we're definitely getting there! Probably Qwen3 CLI might be better suited, but haven't tested it yet.

1dom•1h ago

I run Qwen3-Coder-30B-A3B-Instruct gguf on a VM with 13gb RAM and a 6gb RTX 2060 mobile GPU passed through to it with ik_llama, and I would describe it as usable, at least. It's running on an old (5 years, maybe more) Razer Blade laptop that has a broken display and 16gb RAM.

I use opencode and have done a few toy projects and little changes in small repositories and can get pretty speedy and stable experience up to a 64k context.

It would probably fall apart if I wanted to use it on larger projects, but I've often set tasks running on it, stepped away for an hour, and had a solution when I return. It's definitely useful for smaller project, scaffolding, basic bug fixes, extra UI tweaks etc.

I don't think "usable" a binary thing though. I know you write lot about this, but it'd be interesting to understand what you're asking the local models to do, and what is it about what they do that you consider unusable on a relative monster of a laptop?

embedding-shape•1h ago

> I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful

I've had mild success with GPT-OSS-120b (MXFP4, ends up taking ~66GB of VRAM for me with llama.cpp) and Codex.

I'm wondering if maybe one could crowdsource chat logs for GPT-OSS-120b running with Codex, then seed another post-training run to fine-tune the 20b variant with the good runs from 120b, if that'd make a big difference. Both models with the reasoning_effort set to high are actually quite good compared to other downloadable models, although the 120b is just about out of reach for 64GB so getting the 20b better for specific use cases seems like it'd be useful.

gigatexal•48m ago

I’ve a 128GB m3 max MacBook Pro. Running the gpt oss model on it via lmstudio once the context gets large enough the fans spin to 100 and it’s unbearable.

dust42•29m ago

Unfortunately Qwen3-next is not well supported on Apple silicon, it seems the Qwen team doesn't really care about Apple.

On M1 64GB Q4KM on llama.cpp gives only 20Tok/s while on MLX it is more than twice as fast. However, MLX has problems with kv cache consistency and especially with branching. So while in theory it is twice as fast as llama.cpp it often does the PP all over again which completely trashes performance especially with agentic coding.

So the agony is to decide whether to endure half the possible speed but getting much better kv-caching in return. Or to have twice the speed but then often you have again to sit through prompt processing.

But who knows, maybe Qwen gives them a hand? (hint,hint)

ttoinou•22m ago

I can run nightmedia/qwen3-next-80b-a3b-instruct-mlx at 60-74 tps using LM Studio. What did you try ? What benefit do you get from KV Caching ?

dust42•4m ago

KV caching means when you have 10k prompt, all follow up questions return immediately - this is standard with all inference engines.

Now if you are not happy with the last answer, you maybe want to simply regenerate it or change your last question - this is branching of the conversation. Llama.cpp is capable of re-using the KV cache up to that point while MLX does not (I am using MLX server from MLX community project). I haven't tried with LMStudio. Maybe worth a try, thanks for the heads-up.

dehrmann•26m ago

I wonder if the future in ~5 years is almost all local models? High-end computers and GPUs can already do it for decent models, but not sota models. 5 years is enough time to ramp up memory production, consumers to level-up their hardware, and models to optimize down to lower-end hardware while still being really good.

manbitesdog•11m ago

Plus a long queue of yet-undiscovered architectural improvements

infinitezest•7m ago

A lot of manufacturers are bailing on consumer lines to focus on enterprise from what I've read. Not great.

vessenes•1h ago

3B active parameters, and slightly worse than GLM 4.7. On benchmarks. That's pretty amazing! With better orchestration tools being deployed, I've been wondering if faster, dumber coding agents paired with wise orchestrators might be overall faster than using the say opus 4.5 on the bottom for coding. At least we might want to deploy to these guys for simple tasks.

doctorpangloss•1h ago

Time will tell. All this stuff will get more adoption when Anthropic, Google and OpenAI raise prices.

markab21•1h ago

It's getting a lot easier to do this using sub-agents with tools in Claude. I have a fleet of Mastra agents (TypeScript). I use those agents inside my project as CLI tools to do repetitive tasks that gobble tokens such as scanning code, web search, library search, and even SourceGraph traversal.

Overall, it's allowed me to maintain more consistent workflows as I'm less dependent on Opus. Now that Mastra has introduced the concept of Workspaces, which allow for more agentic development, this approach has become even more powerful.

solumunus•18m ago

Are you just exposing mastra cli commands to Claude Code in md context? I’d love you to elaborate on this if you have time.

endymion-light•1h ago

Looks great - i'll try to check it out on my gaming PC.

On a misc note: What's being used to create the screen recordings? It looks so smooth!

zamadatix•1h ago

Can anyone help me understand the "Number of Agent Turns" vs "SWE-Bench Pro (%)" figure? I.e. what does the spread of Qwen3-Coder-Next from ~50 to ~280 agent turns represent for a fixed score of 44.3%: that sometimes it takes that spread of agent turns to achieve said fixed score for the given model?

edude03•1h ago

Essentially the more turns you have the more the agent is likely to fail since the error compounds per turn. Agentic model are tuned for “long horizon tasks” ie being able to go many many turns on the same problem without failing.

zamadatix•1h ago

Much appreciated, but I mean more around "what do the error bars in the figure represent" than what the turn scaling itself is.

esafak•32m ago

For the tasks in SWE-Bench Pro they obtained a distribution of agent turns, summarized as the box plot. The box likely describes the inter-quartile range while the whiskers describe the some other range. You'd have to read their report to be sure. https://en.wikipedia.org/wiki/Box_plot

jsnell•32m ago

That's a box plot, so those are not error bars but a visualization of the distribution of a metric (min, max, median, 25th percentile, 75th percentile).

The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.

yorwba•31m ago

SWE-Bench Pro consists of 1865 tasks. https://arxiv.org/abs/2509.16941 Qwen3-Coder-Next solved 44.3% (826 or 827) of these tasks. To solve a single task, it took between ≈50 and ≈280 agent turns, ≈150 on average. In other words, a single pass through the dataset took ≈280000 agent turns. Kimi-K2.5 solved ≈84 fewer tasks, but also only took about a third as many agent turns.

throwaw12•1h ago

We are getting there, as a next step please release something to outperform Opus 4.5 and GPT 5.2 in coding tasks

gordonhart•59m ago

By the time that happens, Opus 5 and GPT-5.5 will be out. At that point will a GPT-5.2 tier open-weights model feel "good enough"? Based on my experience with frontier models, once you get a taste of the latest and greatest it's very hard to go back to a less capable model, even if that less capable model would have been SOTA 9 months ago.

tosh•52m ago

It feels like the gap between open weight and closed weight models is closing though.

theshrike79•32m ago

Mode like open local models are becoming "good enough".

I got stuff done with Sonnet 3.7 just fine, it did need a bunch of babysitting, but still it was a net positive to productivity. Now local models are at that level, closing up on the current SOTA.

When "anyone" can run an Opus 4.5 level model at home, we're going to be getting diminishing returns from closed online-only models.

cirrusfan•43m ago

I think it depends on what you use it for. Coding, where time is money? You probably want the Good Shit, but also want decent open weights models to keep prices sane rather than sama’s 20k/month nonsense. Something like a basic sentiment analysis? You can get good results out of a 30b MoE that runs at good pace on a midrange laptop. Researching things online with many sources and decent results I’d expect to be doable locally by the end of 2026 if you have 128GB ram, although it’ll take a while to resolve.

bwestergard•32m ago

What does it mean for U.S. AI firms if the new equilibrium is devs running open models on local hardware?

selectodude•23m ago

OpenAI isn’t cornering the market on DRAM for kicks…

rglullis•26m ago

I'm going in the opposite direction: with each new model, the more I try to optimize my existing workflows by breaking the tasks down so that I can delegate tasks to the less powerful models and only rely on the newer ones if the results are not acceptable.

yorwba•19m ago

When Alibaba succeeds at producing a GPT-5.2-equivalent model, they won't be releasing the weights. They'll only offer API access, like for the previous models in the Qwen Max series.

Don't forget that they want to make money in the end. They release small models for free because the publicity is worth more than they could charge for them, but they won't just give away models that are good enough that people would pay significant amounts of money to use them.

skhameneh•57m ago

It’s hard to elaborate just how wild this model might be if it performs as claimed. The claims are this can perform close to Sonnet 4.5 for assisted coding (SWE bench) while using only 3B active parameters. This is obscenely small for the claimed performance.

cirrusfan•51m ago

If it sounds too good to be true…

theshrike79•34m ago

Should be possible with optimised models, just drop all "generic" stuff and focus on coding performance.

There's no reason for a coding model to contain all of ao3 and wikipedia =)

noveltyaccount•7m ago

I think I like coding models that know a lot about the world. They can disambiguate my requirements and build better products.

alexellisuk•57m ago

Is this going to need 1x or 2x of those RTX PRO 6000s to allow for a decent KV for an active context length of 64-100k?

It's one thing running the model without any context, but coding agents build it up close to the max and that slows down generation massively in my experience.

Soerensen•56m ago

The agent orchestration point from vessenes is interesting - using faster, smaller models for routine tasks while reserving frontier models for complex reasoning.

In practice, I've found the economics work like this:

1. Code generation (boilerplate, tests, migrations) - smaller models are fine, and latency matters more than peak capability 2. Architecture decisions, debugging subtle issues - worth the cost of frontier models 3. Refactoring existing code - the model needs to "understand" before changing, so context and reasoning matter more

The 3B active parameters claim is the key unlock here. If this actually runs well on consumer hardware with reasonable context windows, it becomes the obvious choice for category 1 tasks. The question is whether the SWE-Bench numbers hold up for real-world "agent turn" scenarios where you're doing hundreds of small operations.

cirrusfan•51m ago

I find it really surprising that you’re fine with low end models for coding - I went through a lot of open-weights models, local and "local", and I consistently found the results underwhelming. The glm-4.7 was the smallest model I found to be somewhat reliable, but that’s a sizable 350b and stretches the definition of local-as-in-at-home.

NitpickLawyer•32m ago

You're replying to a bot, fyi :)

syntaxing•43m ago

Is Qwen next architecture ironed out in llama cpp?

orliesaurus•34m ago

how can anyone keep up with all these releases... what's next? Sonnet 5?

Squarex•20m ago

Well there are rumors sonnet 5 is coming today, so...

gessha•16m ago

Tune it out, come back in 6 months, the world is not going to end. In 6 months, you’re going to change your API endpoint and/or your subscription and then spend a day or two adjusting. Off to the races you go.

cedws•31m ago

I kind of lost interest in local models. Then Anthropic started saying I’m not allowed to use my Claude Code subscription with my preferred tools and it reminded me why we need to support open tools and models. I’ve cancelled my CC subscription, I’m not paying to support anticompetitive behaviour.

wahnfrieden•30m ago

OpenAI committed to allowing it btw. I don't know why Anthropic gets so much love here

jmathai•28m ago

Probably because the alternatives are OpenAI, Google, Meta. Not throwing shade at those companies but it's not hard to win the hearts of developers when that's your competition.

cedws•26m ago

Thanks, I’ll try out Codex to bridge until local models get to the level I need.

rustyhancock•26m ago

Cause they make the best coding model.

It's that simple. Everyone else is trying to compete in other ways and Anthropic are pushing for dominate the market.

They'll eventually lose their performance edge and suddenly they will back to being cute and fluffy

I've cancelled a clause sub, but still have one.

bheadmaster•14m ago

Agreed.

I've tried all of the models available right now, and Claude Opus is by far the most capable.

I had an assertion triggered in a fairly complex open-source C library I was using, and Claude Opus not only found the cause, but wrote a self-container reproduction code I could add to a GitHub issue. And it also added tests for that issue, and fixed the underlying issue.

I am sincerely impressed by the capabilities of Claude Opus. Too bad its usage is so expensive.

varispeed•8m ago

On the other hand I feel like 5.2 gets progressively dumbed down. It used to work well, but now initial few prompts go in right direction and then it goes off the rails reminding me more of a GPT-3.5.

I wonder what they are up to.

tomashubelbauer•17m ago

Anthropic banned my account when I whipped up a solution to control Claude Code running on my Mac from my phone when I'm out and about. No commercial angle, just a tool I made for myself since they wouldn't ship this feature (and still haven't). I wasn't their biggest fanboy to begin with, but it gave me the kick in the butt needed to go and explore alternatives until local models get good enough that I don't need to use hosted models altogether.

RationPhantoms•11m ago

There is weaponized malaise employed by these frontier model providers and I feel like that dark-pattern, what you pointed out, and others are employed to rate-limit certain subscriptions.

darkwater•9m ago

I control it with ssh and sometimes tmux (but termux+wireguard lead to a surprisingly generally stable connection). Why did you need more than that?

tomashubelbauer•5m ago

I didn't like the existing SSH applications for iOS and I already have a local app that I made that I have open 24/7, so I added a screen that used xterm.js and Bun.spawn with Bun.Terminal to mirror the process running on my Mac to my phone. This let me add a few bells and whistles that a generic SSH client wouldn't have, like notifications when Claude Code was done working etc.

giancarlostoro•10m ago

I do wonder if they locked things down due to people abusing their CC token.

ossicones•29m ago

What browser use agent are they using here?

valcron1000•13m ago

Still nothing to compete with GPT-OSS-20B for local image with 16 VRAM.

Sandboxing AI Agents in Linux

Project Panama: 2M books scanned and destroyed by Anthropic AI

AI helped me through burnout (but not how you think)

Deno Sandbox

ICE Map

The Gumbel-Max Trick

Show HN: Stigmergy pattern for multi-agent LLMs (80% fewer API calls)

Why the mid-30s are a major turning point for men's heart health

Show HN: Orchestrate Claude Code CLI from GitHub

Show HN: VeilStream – Per-Branch Preview Environments

Coding Agents Need More Than Examples. They Need Guardrails

Russia's APT28 Rapidly Weaponizes Newly Patched Office Vulnerability

What a Diff a VS Code Fork Makes: Antigravity, Cursor and Windsurf Compared

Show HN: Prism – 7 AI stories daily with credibility tags, no doomscrolling

PDF phishing attack leads to stolen Dropbox credentials

Distillable AI Models

'npx skills add' installs it globally for all AI agents

Why 6-7 is the best meme

Adam Smith's "New Imperialism"

Show HN: Homomorphically Encrypted Vector Database

Humans are infiltrating the social network for AI bots

Hosaka3 audiovisual stimulation to modify brainwaves

In Praise of Earnestness

Billions wiped off media and financial data groups after Anthropic AI launch

Intel and SoftBank Subsidiary Saimemory Collaborate to Advance Next-Gen Memory

Senior staff departing OpenAI as firm prioritizes ChatGPT development

Revisiting Disaggregated LLM Serving for Performance and Energy Implications

Adobe Animate is shutting down as company focuses on AI

A Demonstration of Self-Profiling

Are Wall Street Analysts Bullish on Salesforce Stock?

Qwen3-Coder-Next

Comments

Sandboxing AI Agents in Linux

Project Panama: 2M books scanned and destroyed by Anthropic AI

AI helped me through burnout (but not how you think)

Deno Sandbox

ICE Map

The Gumbel-Max Trick

Show HN: Stigmergy pattern for multi-agent LLMs (80% fewer API calls)

Why the mid-30s are a major turning point for men's heart health

Show HN: Orchestrate Claude Code CLI from GitHub

Show HN: VeilStream – Per-Branch Preview Environments

Coding Agents Need More Than Examples. They Need Guardrails

Russia's APT28 Rapidly Weaponizes Newly Patched Office Vulnerability

What a Diff a VS Code Fork Makes: Antigravity, Cursor and Windsurf Compared

Show HN: Prism – 7 AI stories daily with credibility tags, no doomscrolling

PDF phishing attack leads to stolen Dropbox credentials

Distillable AI Models

'npx skills add' installs it globally for all AI agents

Why 6-7 is the best meme

Adam Smith's "New Imperialism"

Show HN: Homomorphically Encrypted Vector Database

Humans are infiltrating the social network for AI bots

Hosaka3 audiovisual stimulation to modify brainwaves

In Praise of Earnestness

Billions wiped off media and financial data groups after Anthropic AI launch

Intel and SoftBank Subsidiary Saimemory Collaborate to Advance Next-Gen Memory

Senior staff departing OpenAI as firm prioritizes ChatGPT development

Revisiting Disaggregated LLM Serving for Performance and Energy Implications

Adobe Animate is shutting down as company focuses on AI

A Demonstration of Self-Profiling

Are Wall Street Analysts Bullish on Salesforce Stock?