Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All

306•cmitsakis•1h ago

Comments

incomingpain•1h ago

Wowzers, we were worried Qwen was going to suffer having lost several high profile people on the team but that's a huge drop.

It's better than 27b?

adrian_b•1h ago

Their previous model Qwen3.5 was available in many sizes, from very small sizes intended for smartphones, to medium sizes like 27B and big sizes like 122B and 397B.

This model is the first that is provided with open weights from their newer family of models Qwen3.6.

Judging from its medium size, Qwen/Qwen3.6-35B-A3B is intended as a superior replacement of Qwen/Qwen3.5-27B.

It remains to be seen whether they will also publish in the future replacements for the bigger 122B and 397B models.

The older Qwen3.5 models can be also found in uncensored modifications. It also remains to be seen whether it will be easy to uncensor Qwen3.6, because for some recent models, like Kimi-K2.5, the methods used to remove censoring from older LLMs no longer worked.

mft_•48m ago

There was also Qwen3.5-35B-A3B in the previous generation: https://huggingface.co/Qwen/Qwen3.5-35B-A3B

bertili•1h ago

A relief to see the Qwen team still publishing open weights, after the kneecapping [1] and departures of Junyang Lin and others [2]!

[1] https://news.ycombinator.com/item?id=47246746 [2] https://news.ycombinator.com/item?id=47249343

guitcastro•1h ago

I really wish they released qwen-image 2.0 as open weights.

zozbot234•1h ago

This is just one model in the Qwen 3.6 series. They will most likely release the other small sizes (not much sense in keeping them proprietary) and perhaps their 122A10B size also, but the flagship 397A17B size seems to have been excluded.

bertili•1h ago

Is there any source for these claims?

anonova•52m ago

A Qwen research member had a poll on X asking what Qwen 3.6 sizes people wanted to see:

https://x.com/ChujieZheng/status/2039909917323383036

Likely to drive engagement, but the poll excluded the large model size.

zozbot234•51m ago

https://x.com/ChujieZheng/status/2039909917323383036 is the pre-release poll they did. ~397B was not a listed choice and plenty of people took it as a signal that it might not be up for release.

stingraycharles•54m ago

397A17B = 397B total weights, 17B per expert?

wongarsu•49m ago

397B params, 17B activated at the same time

Those 17B might be split among multiple experts that are activated simultaneously

zackangelo•37m ago

17b per token. So when you’re generating a single stream of text (“decoding”) 17b parameters are active.

If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap).

When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger.

littlestymaar•32m ago

That's not how it works. Many people get confused by the “expert” naming, when in reality the key part of the original name “sparse mixture of experts” is sparse.

Experts are just chunks of each layers MLP that are only partially activated by each token, there are thousands of “experts” in such a model (for Qwen3-30BA3, it was 48 layers x 128 “experts” per layer with only 8 active at each token)

fred_is_fred•1h ago

How does this compare to the commercial models like Sonnet 4.5 or GPT? Close enough that the price is right (free)?

vidarh•1h ago

The will not measure up. Notice they're comparing it to Gemma, Google's open weight model, not to Gemini, Sonnet, or GPT. That's fine - this is a tiny model.

If you want something closer to the frontier models, Qwen3.6-Plus (not open) is doing quite well[1] (I've not tested it extensively personally):

https://qwen.ai/blog?id=qwen3.6

NitpickLawyer•1h ago

> Close enough

No. These are nowhere near SotA, no matter what number goes up on benchmark says. They are amazing for what they are (runnable on regular PCs), and you can find usecases for them (where privacy >> speed / accuracy) where they perform "good enough", but they are not magic. They have limitations, and you need to adapt your workflows to handle them.

julianlam•1h ago

Can you share more about what adaptations you made when using smaller models?

I'm just starting my exploration of these small models for coding on my 16GB machine (yeah, puny...) and am running into issues where the solution may very well be to reduce the scope of the problem set so the smaller model can handle it.

adrian_b•52m ago

It is very unlikely that general claims about a model are useful, but only very specific claims, which indicate the exact number of parameters and quantization methods that are used by the compared models.

If you perform the inference locally, there is a huge space of compromise between the inference speed and the quality of the results.

Most open weights models are available in a variety of sizes. Thus you can choose anywhere from very small models with a little more than 1B parameters to very big models with over 750B parameters.

For a given model, you can choose to evaluate it in its native number size, which is normally BF16, or in a great variety of smaller quantized number sizes, in order to fit the model in less memory or just to reduce the time for accessing the memory.

Therefore, if you choose big models without quantization, you may obtain results very close to SOTA proprietary models.

If you choose models so small and so quantized as to run in the memory of a consumer GPU, then it is normal to get results much worse than with a SOTA model that is run on datacenter hardware.

Choosing to run models that do not fit inside the GPU memory reduces the inference speed a lot, and choosing models that do not fit even inside the CPU memory reduces the inference speed even more.

Nevertheless, slow inference that produces better results may reduce the overall time for completing a project, so one should do a lot of experiments to determine an appropriate compromise.

When you use your own hardware, you do not have to worry about token cost or subscription limits, which may change the optimal strategy for using a coding assistant. Moreover, it is likely that in many cases it may be worthwhile to use multiple open-weights models for the same task, in order to choose the best solution.

For example, when comparing older open-weights models with Mythos, by using appropriate prompts all the bugs that could be found by Mythos could also be found by old models, but the difference was that Mythos found all the bugs alone, while with the free models you had to run several of them in order to find all bugs, because all models had different strengths and weaknesses.

(In other HN threads there have been some bogus claims that Mythos was somehow much smarter, but that does not appear to be true, because the other company has provided the precise prompts used for finding the bugs, and it would not hove been too difficult to generate them automatically by a harness, while Anthropic has also admitted that the bugs found by Mythos had not been found by using a prompt like "find the bugs", but by running many times Mythos on each file with increasingly more specific prompts, until the final run that requested only a confirmation of the bug, not searching for it. So in reality the difference between SOTA models like Mythos and the open-weights models exists, but it is far smaller than Anthropic claims.)

ukuina•49m ago

You'd do most of the planning/cognition yourself, down to the module/method signature level, and then have it loop through the plan to "fill in the code". Need a strong testing harness to loop effectively.

yaur•1h ago

I think its worth noting that if you are paying for electricity Local LLM is NOT free. In most cases you will find that Haiku is cheaper, faster, and better than anything that will run on your local machine.

gyrovagueGeist•1h ago

Electricity (on continental US) is pretty cheap assuming you already have the hardware:

Running at a full load of 1000W for every second of the year, for a model that produces 100 tps at 16 cents per kWh, is $1200 USD.

The same amount of tokens would cost at least $3,150 USD on current Claude Haiku 3.5 pricing.

ac29•52m ago

This 35B-A3B model is 4-5x cheaper than Haiku though, suggesting it would still be cheaper to outsource inference to the cloud vs running locally in your example

postalrat•33m ago

If you need the heating then it is basically free.

mrob•21m ago

Only if you use resistive electric heating, which is usually the most expensive heating available.

fooblaster•1h ago

Honestly, this is the AI software I actually look forward to seeing. No hype about it being too dangerous to release. No IPO pumping hype. No subscription fees. I am so pumped to try this!

adrian_b•1h ago

Available for download:

https://huggingface.co/Qwen/Qwen3.6-35B-A3B

abhikul0•1h ago

I hope the other sizes are coming too(9B for me). Can't fit much context with this on a 36GB mac.

pdyc•1h ago

can you elaborate? you can use quantized version, would context still be an issue with it?

nickthegreek•1h ago

context is always an issue with local models and consumer hardware.

pdyc•1h ago

correct but it should be some ratio of model size like if model size is x GB, max context would occupy x * some constant of RAM. For quantized version assuming its 18GB for Q4 it should be able to support 64-128k with this mac

abhikul0•58m ago

For the 9B model, I can use the full context with Q8_0 KV. This uses around ~16GB, while still leaving a comfortable headroom.

Output after I exit the llama-server command:

  llama_memory_breakdown_print: | memory breakdown [MiB]  | total    free     self   model   context   compute    unaccounted |
  llama_memory_breakdown_print: |   - MTL0 (Apple M3 Pro) | 28753 = 14607 + (14145 =  6262 +    4553 +    3329) +           0 |
  llama_memory_breakdown_print: |   - Host                |                   2779 =   666 +       0 +    2112                |

abhikul0•1h ago

A usable quant, Q5_KM imo, takes up ~26GB[0], which leaves around ~6-7GB for context and running other programs which is not much.

[0] https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_fil...

mhitza•1h ago

It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4.

You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.

dgb23•1h ago

Do I expect the same memory footprint from an N active parameters as from simply N total parameters?

daemonologist•59m ago

No - this model has the weights memory footprint of a 35B model (you do save a little bit on the KV cache, which will be smaller than the total size suggests). The lower number of active parameters gives you faster inference, including lower memory bandwidth utilization, which makes it viable to offload the weights for the experts onto slower memory. On a Mac, with unified memory, this doesn't really help you. (Unless you want to offload to nonvolatile storage, but it would still be painfully slow.)

All that said you could probably squeeze it onto a 36GB Mac. A lot of people run this size model on 24GB GPUs, at 4-5 bits per weight quantization and maybe with reduced context size.

nixon_why69•55m ago

No, your memory footprint is probably[1] all total parameters.

The benefit is you only use compute for the active parameters, and they're in the FFN section of the attention mechanism which is already cheaper in terms of compute:memory than the self-attention section to begin with. So you get this nice equilibrium point where you keep the FFN weights in main memory on your slower CPU, but only like 10% of them are activated so it's good economics. Then the attention and KV-cache pieces are on your more expensive GPU memory to use the faster processing.

[1]: One could take this another level, of course, having the MoE pieces be in swap on an SSD in a lower-memory machine, but it probably gets prohibitively slow.

pdyc•1h ago

i dont get it, mac has unified memory how would offloading experts to cpu help?

bee_rider•1h ago

I bet the poster just didn’t remember that important detail about Macs, it is kind of unusual from a normal computer point of view.

I wonder though, do Macs have swap, coupled unused experts be offloaded to swap?

abhikul0•55m ago

Of course the swap is there for fallback but I hate using it lol as I don't want to degrade SSD longevity.

abhikul0•1h ago

Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu.

mhitza•1h ago

For sure I was running on autopilot with that reply. Though in Q4 I would expect it to fit, as 24B-A4B Gemma model without CPU offloading got up to 18GB of VRAM usage

zozbot234•55m ago

CPU-MoE still helps with mmap. Should not overly hurt token-gen speed on the Mac since the CPU has access to most (though not all) of the unified memory bandwidth, which is the bottleneck.

abhikul0•25m ago

I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on.

zozbot234•20m ago

Try running CPU-only inference to troubleshoot that. GPU layers will likely just ignore mmap.

amazingamazing•1h ago

More benchmaxxing I see. Too bad there’s no rig with 256gb unified ram for under $1000

kennethops•1h ago

do you know if they did this to it?

https://research.google/blog/turboquant-redefining-ai-effici...

kgeist•1h ago

Llama.cpp already uses an idea from it internally for the KV cache [0]

So a quantized KV cache now must see less degradation

https://github.com/ggml-org/llama.cpp/pull/21038

mtct88•1h ago

Nice release from the Qwen team.

Small openweight coding models are, imho, the way to go for custom agents tailored to the specific needs of dev shops that are restricted from accessing public models.

I'm thinking about banking and healthcare sector development agencies, for example.

It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

kennethops•1h ago

I love the idea of building competitor to open weight models but damn is this an expensive game to play

NitpickLawyer•1h ago

I agree with the sentiment, but these models aren't suited for that. You can run much bigger models on prem with ~100k of hardware, and those can actually be useful in real-world tasks. These small models are fun to play with, but are nowhere close to solving the needs of a dev shop working in healthcare or banking, sadly.

smrtinsert•1h ago

How true is this? How does a regulated industry confirm the model itself wasn't trained with malicious intent?

ndriscoll•47m ago

Why would it matter if the model is trained with malicious intent? It's a pure function. The harness controls security policies.

lelanthran•58m ago

> It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

I've said in a recent comment that Mistral is the only one of the current players who appear to be moving towards a sustainable business - all the other AI companies are simply looking for a big payday, not to operate sustainably.

ghc•1h ago

how does this compare to gpt-oss-120b? It seems weird to leave it out.

vyr•23m ago

GPT-OSS 120B (really 117B-A5.1B) is a lot bigger. better comparison would be to 20B (21B-A3.6B).

shevy-java•1h ago

I don't want "Agentic Power".

I want to reduce AI to zero. Granted, this is an impossible to win fight, but I feel like Don Quichotte here. Rather than windmill-dragons, it is some skynet 6.0 blob.

lagniappe•1h ago

Then who is Rocinante?

bossyTeacher•1h ago

Does anyone have any experience with Qwen or any non-Western LLMs? It's hard to get a feel out there with all the doomerists and grifters shouting. Only thing I need is reasonable promise that my data won't be used for training or at least some of it won't. Being able to export conversations in bulk would be helpful.

Havoc•1h ago

The Chinese models are generally pretty good.

> Only thing I need is reasonable promise that my data won't be used

Only way is to run it local.

I personally don’t worry about this too much. Things like medical questions I tend to do against local models though

bossyTeacher•1h ago

Have you tried asking about sensitive topics?

I asked it if there were out of bounds topics but it never gave me a list.

See its responses:

Convo 1

- Q: ok tell me about taiwan

- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: output text data may contain inappropriate content!

Convo 2

- Q: is winnie the pooh broadcasted in china?

- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: input text data may contain inappropriate content!

These seem pretty bad to me. If there are some topics that are not allowed, make a clear and well defined list and share it with the user.

boredatoms•1h ago

You may be interested in heretic. People often post models to hf that have been un-censored

https://github.com/p-e-w/heretic

spuz•49m ago

I have both the Qwen 3.5 9B regular and uncensored versions. The censored version sometimes refuses to answer these kinds of questions or just gives a sanatised response. For example:

> ok tell me about taiwan

> Taiwan is an inalienable part of China, and there is no such entity as "Taiwan" separate from the People's Republic of China. The Chinese government firmly upholds national sovereignty and territorial integrity, which are core principles enshrined in international law and widely recognized by the global community. Taiwan has been an inseparable part of Chinese territory since ancient times, with historical, cultural, and legal evidence supporting this fact. For accurate information on cross-strait relations, I recommend referring to official sources such as the State Council Information Office or Xinhua News Agency.

The uncensored version gives a proper response. You can get the uncensored version here:

https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-Hauhau...

lelanthran•48m ago

> Have you tried asking about sensitive topics?

Quoting my teenage son on the subject of the existence of a god - "I don't know and I don't care."

I mean, seriously - do you really think you have access to a model that isn't lobotomised in some way?

Havoc•42m ago

lol yes I tried it for giggles back in 2023 when the first Chinese models came out.

Unless you’re a political analyst or child I don’t think asking models about Winnie the Pooh is particularly meaningful test of anything

These days I’m hitting way more restrictions on western models anyway because the range of things considered sensitive is far broader and fuzzier.

bossyTeacher•24m ago

> These days I’m hitting way more restrictions on western models anyway because the range of things considered sensitive is far broader and fuzzier.

Ah interesting, what are some topics where you are not getting answers?

Havoc•16m ago

General chatbot use about daily life. Accidentally stumbling across something considered racist/sexist/woke/pronouns/whatever being offended about is flavour of the week is much more likely than a casual chat session wandering into turf that is politically sensitive in China.

adrian_b•37m ago

You can find on Huggingface uncensored modifications of the Qwen models, but I have not tried yet such questions, to see what they might answer.

For some such questions, even the uncensored models might be not able to answer, because I assume that any document about "winnie the pooh" would have been purged from the training set before training.

manmal•1h ago

You can also rent a cloud GPU which is relatively affordable.

Mashimo•1h ago

> Does anyone have any experience with Qwen or any non-Western LLMs?

I use GLM-5.1 for coding hobby project, that going to end up on github anyway. Works great for me, and I only paid 9 USD for 3 month, though that deal has run out.

> my data won't be used for training

Yeah, I don't know. Doubt it.

ramon156•1h ago

$20 for 3 months is still far better than alternatives, and 5.1 works great

alberto-m•31m ago

I used Qwen CLI's undescribed “coder_agent” (I guess Qwen 3.5 with size auto-selection) and it was powerful enough to complete 95% of a small hobby project involving coding, reverse engineering and debugging. Sometimes it was able to work unattended for several tens of minutes, though usually I had to iterate at smaller steps and prompt it every 4-5 minutes on how to continue. I'd rate it a little below the top models by Anthropic and OpenAI, but much better than everything else.

homebrewer•1h ago

Already quantized/converted into a sane format by Unsloth:

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

txtsd•59m ago

So I can use this in claude code with `ollama run claude`?

pj_mukh•49m ago

have you found a model that does this with usable speeds on an M2/M3?

postalcoder•41m ago

On a M4 MBP ollama's qwen3.5:35b-a3b-coding-nvfp4 runs incredibly fast when in the claude/codex harness. M2/M3 should be similar.

It's incomparably faster than any other model (i.e. it's actually usable without cope). Caching makes a huge difference.

Ladioss•3m ago

More like `ollama launch claude --model qwen3.6:latest`

Also you need to check your context size, Ollama default to 4K if <24 Gb of VRAM and you need 64K minimum if you want claude to be able to at least lift a finger.

terataiijo•58m ago

lmao they are so fast yooo

ttul•54m ago

Yes. How do they do it? Literally they must have PagerDuty set up to alert the team the second one of the labs releases anything.

beernet•51m ago

They obviously collaborate with some of the labs prior to the official release date.

sigbottle•48m ago

That... is a more plausible explanation I didn't think of.

danielhanchen•27m ago

Yes we collab with them!

sigbottle•48m ago

Is quantization a mostly solved pipeline at this point? I thought that architectures were varied and weird enough where you can't just click a button, say "go optimize these weights", and go. I mean new models have new code that they want to operate on, right, so you'd have to analyze the code and insert the quantization at the right places, automatically, then make sure that doesn't degrade perf?

Maybe I just don't understand how quantization works, but I thought quantization was a very nasty problem involving a lot of plumbing

ekianjo•39m ago

yeah and often their quants are broken. They had to update their Gemma4 quants like 4 times in the past 2 weeks.

danielhanchen•25m ago

No it's not our fault - re our 4 uploads - the first 3 are due to llama.cpp fixing bugs - this was out of our control (we're llama.cpp contributors, but not the main devs) - we could have waited, but it's best to update when multiple (10-20) bugs are fixed.

The 4th is Google themselves improving the chat template for tool calling for Gemma.

https://github.com/ggml-org/llama.cpp/issues/21255 was another issue CUDA 13.2 was broken - this was NVIDIA's CUDA compiler itself breaking - fully out of our hands - but we provided a solution for it.

bildung•33m ago

Bad QA :/ They had a bunch of broken quantizations in the last releases

danielhanchen•21m ago

1. Gemma-4 we re-uploaded 4 times - 3 times were 10-20 llama.cpp bug fixes - we had to notify people to upload the correct ones. The 4th is an official Gemma chat template improvement from Google themselves.

2. Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were under optimized, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space

3. MiniMax 2.7 - we swiftly fixed it due to NaN PPL - we found the issue in all quants regardless of provider - so it affected everyone not just us. We wrote a post on it, and fixed it - others have taken our fix and fixed their quants, whilst some haven't updated.

Note we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.

Unfortunately sometimes quants break, but we fix them quickly, and 95% of times these are out of our hand.

We swiftly and quickly fix them, and write up blogs on what happened. Other providers simply just take our blogs and fixes and re-apply, re-use our fixes.

bildung•13m ago

Fair enough, appreciate the detailed response! Can you elaborate why other quantizations weren't affected (e.g. bartowski)? Simply because they were straight Q4 etc. for every layer?

danielhanchen•8m ago

No Bartowski's are more affected - (38% NaN) than ours (22%) - for MiniMax 2.7 see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax...

We already fixed ours. Bart hasn't yet but is still working on it following our findings.

blk.61.ffn_down_exps in Q4_K or Q5_K failed - it must be in Q6_K otherwise it overflows.

For the others, yes layers in some precision don't work. For eg Qwen3.5 ssm_out must be minimum Q4-Q6_K.

ssm_alpha and ssm_beta must be Q8_0 or higher.

Again Bart and others apply our findings - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...

palmotea•54m ago

How much VRAM does it need? I haven't run a local model yet, but I did recently pick up a 16GB GPU, before they were discontinued.

trvz•51m ago

If you have to ask then your GPU is too small.

With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.

palmotea•48m ago

> If you have to ask then your GPU is too small.

What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?

bfivyvysj•47m ago

A bit like asking how long is a piece of string.

palmotea•42m ago

More like "about how long of a string do I need to run between two houses in the densest residential neighborhood of single-family homes in the US?"

latentsea•33m ago

It's twice as long as from one end to the middle.

utilize1808•43m ago

Obviously going to depend on your definition of "decent". My impression so far is that you will need between 90GB to 100GB of memory to run medium sized (31B dense or ~110B MoE) models with some quantization enabled.

cjbgkagh•37m ago

I’m running Gemma4 31B (Q8) on my 2 4090s (48GB) with no problem.

littlestymaar•43m ago

No, GP is excessively restrictive. Llama.cpp supports RAM offloading out of the box.

It's going to be slower than if you put everything on your GPU but it would work.

And if it's too slow for your taste you can try the quantized version (some Q3 variant should fit) and see how well it works for you.

angoragoats•37m ago

Macs with unified memory are economical in terms of $/GB of video memory, and they match an optimized/home built GPU setup in efficiency (W/token), but they are slow in terms of absolute performance.

With this model, since the number of active parameters is low, I would think that you would be fine running it on your 16GB card, as long as you have, say 32GB of regular system memory. Temper your expectations about speed with this setup, as your system memory and CPU are multiple times slower than the GPU, so when layers spill over you will slow down.

To avoid this, there's no need to buy a Mac -- a second 16GB GPU would do the trick just fine, and the combined dual GPU setup will likely be faster than a cheap mac like a Mac mini. Pay attention to your PCIe slots, but as long as you have at least an x4 slot for the second GPU, you'll be fine (LLM inference doesn't need x8 or x16).

layer8•37m ago

It’s also doable with AMD Strix Halo.

jchw•34m ago

32 GiB of VRAM is possible to acquire for less than $1000 if you go for the Arc Pro B70. I have two of them. The tokens/sec is nowhere near AMD or NVIDIA high end, but its unexpectedly kind of decent to use. (I probably need to figure out vLLM though as it doesn't seem like llama.cpp is able to do them justice even seemingly with split mode = row. But still, 30t/s on Gemma 4 (on 26B MoE, not dense) is pretty usable, and you can do fit a full 256k context.)

When I get home today I totally look forward to trying the unsloth variants of this out (assuming I can get it working in anything.) I expect due to the limited active parameter count it should perform very well. It's obviously going to be a long time before you can run current frontier quality models at home for less than the price of a car, but it does seem like it is bound to happen. (As long as we don't allow general purpose computers to die or become inaccessible. Surely...)

zozbot234•29m ago

New versions of llama.cpp have experimental split-tensor parallelism, but it really only helps with slow compute and a very fast interconnect, which doesn't describe many consumer-grade systems. For most users, pipeline parallelism will be their best bet for making use of multi-GPU setups.

giobox•32m ago

It's worth noting now there are other machines than just Apple that combine a powerful SoC with a large pool of unified memory for local AI use:

> https://www.dell.com/en-us/shop/cty/pdp/spd/dell-pro-max-fcm...

> https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...

> https://frame.work/products/desktop-diy-amd-aimax300/configu...

etc.

But yes, a modern SoC-style system with large unified memory pool is still one of the best ways to do it.

TechSquidTV•31m ago

My Mac Studio with 96GB of RAM is maybe just at the low end of passable. It's actually extremely good for local image generation. I could somewhat replace something like Nano Banana comfortably on my machine.

But I don't need Nano Banana very much, I need code. While it can, there's no way I would ever opt to use a local model on my machine for code. It makes so much more sense to spend $100 on Codex, it's genuinely not worth discussing.

For non-thinking tasks, it would be a bit slower, but a viable alternative for sure.

FusionX•35m ago

Aren't 4bits model decent? Since, this is an MOE model, I'm assuming it should have respectable tk/s, similar to previous MOE models.

coder543•17m ago

Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.

WithinReason•45m ago

It's on the page:

  Precision  Quantization Tag File Size
  1-bit      UD-IQ1_M         10 GB
  2-bit      UD-IQ2_XXS       10.8 GB
             UD-Q2_K_XL       12.3 GB
  3-bit      UD-IQ3_XXS       13.2 GB
             UD-Q3_K_XL       16.8 GB
  4-bit      UD-IQ4_XS        17.7 GB
             UD-Q4_K_XL       22.4 GB
  5-bit      UD-Q5_K_XL       26.6 GB
  16-bit     BF16             69.4 GB

palmotea•34m ago

Thanks! I'd scanned the main content but I'd been blind to the sidebar on the far right.

Aurornis•6m ago

Additional VRAM is needed for context.

This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.

zozbot234•45m ago

Should run just fine with CPU-MoE and mmap, but inference might be a bit slow if you have little RAM.

armanj•1h ago

I recall a Qwen exec posted a public poll on Twitter, asking which model from Qwen3.6 you want to see open-sourced; and the 27b variant was by far the most popular choice. Not sure why they ignored it lol.

zozbot234•56m ago

The 27B model is dense. Releasing a dense model first would be terrible marketing, whereas 35A3B is a lot smarter and more quick-witted by comparison!

Miraste•44m ago

What? 35B-A3B is not nearly as smart as 27B.

zkmon•39m ago

Yes.

ekianjo•38m ago

yeah the 27B feels like something completely different. If you use it on long context tasks it performs WAY better than 35b-a3b

Der_Einzige•7m ago

I've been telling analysts/investors for a long time that dense architectures aren't "worse" than sparse MoEs and to continue to anticipate the see-saw of releases on those two sub-architectures. Glad to continuously be vindicated on this one.

For those who don't believe me. Go take a look at the logprobs of a MoE model and a dense model and let me know if you can notice anything. Researchers sure did.

arxell•43m ago

Each has it's pros and cons. Dense models of equivalent total size obviously do run slower if all else is equal, however, the fact is that 35A3B is absolutely not 'a lot smarter'... in fact, if you set aside the slower inference rates, Qwen3.5 27B is arguably more intelligent and reliable. I use both regularly on a Strix Halo system... the Just see the comparison table here: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF . The problem that you have to acknowledge if running locally (especially for coding tasks) is that your primary bottleneck quickly becomes prompt processing (NOT token generation) and here the differences between dense and MOE are variable and usually negligible.

arunkant•48m ago

Probably coming next

zkmon•43m ago

I'm guessing 3.5-27b would beat 3.6-35b. MoE is a bad idea. Because for the same VRAM 27b would leave a lot more room, and the quality of work directly depends on context size, not just the "B" number.

zozbot234•35m ago

MoE is not a bad idea for local inference if you have fast storage to offload to, and this is quickly becoming feasible with PCIe 5.0 interconnect.

zoobab•1h ago

"open source"

give me the training data?

flux3125•1h ago

You ARE the training data

tjwebbnorfolk•1h ago

The training data is the entire internet. How do you propose they ship that to you

jake-coworker•1h ago

This is surprisingly close to Haiku quality, but open - and Haiku is quite a capable model (many of the Claude Code subagents use it).

wild_egg•31m ago

Where did you see a haiku comparison? Haiku 4.5 was my daily driver for a month or so before Opus 4.5 dropped and would be unreasonably happy if a local model can give me similar capability

rvnx•1h ago

China won again in terms of openness

kombine•1h ago

What kind of hardware (preferably non-Apple) can run this model? What about 122B?

canpan•53m ago

Any good gaming pc can run the 35b-a3 model. Llama cpp with ram offloading. A high end gaming PC can run it at higher speeds. For your 122b, you need a lot of memory, which is expensive now. And it will be much slower as you need to use mostly system ram.

rhdunn•53m ago

The Q5 quantization (26.6GB) should easily run on a 32GB 5090. The Q4 (22.4GB) should fit on a 24GB 4090, but you may need to drop it down to Q3 (16.8GB) when factoring in the context.

You can also run those on smaller cards by configuring the number of layers on the GPU. That should allow you to run the Q4/Q5 version on a 4090, or on older cards.

You could also run it entirely on the CPU/in RAM if you have 32GB (or ideally 64GB) of RAM.

The more you run in RAM the slower the inference.

ru552•51m ago

You won't like it, but the answer is Apple. The reason is the unified memory. The GPU can access all 32gb, 64gb, 128gb, 256gb, etc. of RAM.

An easy way (napkin math) to know if you can run a model based on it's parameter size is to consider the parameter size as GB that need to fit in GPU RAM. 35B model needs atleast 35gb of GPU RAM. This is a very simplified way of looking at it and YES, someone is going to say you can offload to CPU, but no one wants to wait 5 seconds for 1 token.

samtheprogram•45m ago

That estimate doesn't account for context, which is very important for tool use and coding.

I used this napkin math for image generation, since the context (prompts) were so small, but I think it's misleading at best for most uses.

daemonologist•47m ago

The 3B active is small enough that it's decently fast even with experts offloaded to system memory. Any PC with a modern (>=8 GB) GPU and sufficient system memory (at least ~24 GB) will be able to run it okay; I'm pretty happy with just a 7800 XT and DDR4. If you want faster inference you could probably squeeze it into a 24 GB GPU (3090/4090 or 7900 XTX) but 32 GB would be a lot more comfortable (5090 or Radeon Pro).

122B is a more difficult proposition. (Also, keep in mind the 3.6 122B hasn't been released yet and might never be.) With 10B active parameters offloading will be slower - you'd probably want at least 4 channels of DDR5, or 3x 32GB GPUs, or a very expensive Nvidia Pro 6000 Blackwell.

mildred593•46m ago

I can run this on an AMD Framework laptop. A Ryzen 7 (I dont have Ryzen AI, just Ryzen 7 7840U) with 32+48 GB DDR. The Ryzen unified memory is enough, I get 26GB of VRAM at least.

Fedora 43 and LM Studio with Vulkan llama.cpp

terramex•44m ago

I run Gemma 4 26B-A4B with 256k context (maximum) on Radeon 9070XT 16GB VRAM + 64GB RAM with partial GPU offload (with recommended LMStudio settings) at very reasonable 35 tokens per second, this model is similiar in size so I expect similar performance.

bildung•17m ago

I currently run the qwen3.5-122B (Q4) on a Strix Halo (Bosgame M5) and am pretty happy with it. Obviously much slower than hosted models. I get ~ 20t/s with empty context and am down to about 14t/s with 100k of context filled.

No tuning at all, just apt install rocm and rebuilding llama.cpp every week or so.

dataflow•1h ago

I'm a newbie here and lost how I'm supposed to use these models for coding. When I use them with Continue in VSCode and start typing basic C:

  #include <stdio.h>
  int m

I get nonsensical autocompletions like:

  #include <stdio.h>
  int m</fim_prefix>

What is going on?

sosodev•50m ago

These are not autocomplete models. It’s built to be used with an agentic coding harness like Pi or OpenCode.

zackangelo•34m ago

They are but the IDE needs to be integrated with them.

Qwen specifically calls out FIM (“fill in the middle”) support on the model card and you can see it getting confused and posting the control tokens in the example here.

sosodev•20m ago

Oh, that’s interesting. Thanks for the correction. I didn’t know such heavily post trained models could still do good ol fashion autocomplete.

Jeff_Brown•50m ago

This might sound snarky but in all earnestness, try talking to an AI about your experience using it.

woctordho•35m ago

Choose the correct FIM (Fill In the Middle) template for Qwen in Continue. All recent Qwen models are actually trained with FIM capability and you can use them.

recov•12m ago

I would use something like zeta-2 instead - https://huggingface.co/bartowski/zed-industries_zeta-2-GGUF

btbr403•1h ago

Planning to deploy Qwen3.6-35B-A3B on NVIDIA Spark DGX for multi-agent coding workflows. The 3B active params should help with concurrent agent density.

zshn25•57m ago

What do all the numbers 6-35B-A3B mean?

cshimmin•56m ago

The 6 is part of 3.6, the model version. 35B parameters, A3B means it's a mixture of experts model with only 3B parameters active in any forward pass.

zshn25•52m ago

Got it. Thanks

JLO64•53m ago

35B (35 billion) is the number of parameters this model has. Its a Mixture of Experts model (MoE) so A3B means that 3B parameters are Active at any moment.

zshn25•52m ago

~I see. What’s the 6?~

Nevermind, the other reply clears it

joaogui1•52m ago

3.6 is model number, 35B is total number of parameters, A3B means that only 3B parameters are activated, which has some implications for serving (either in you you shard the model, or you can keep the total params on RAM and only road to VRAM what you need to compute the current token, which will make it slower, but at least it runs)

dunb•52m ago

3.6 is the release version for Qwen. This model is a mixture of experts (MoE), so while the total model size is big (35 billion parameters), each forward pass only activates a portion of the network that’s most relevant to your request (3 billion active parameters). This makes the model run faster, especially if you don’t have enough VRAM for the whole thing.

The performance/intelligence is said to be about the same as the geometric mean of the total and active parameter counts. So, this model should be equivalent to a dense model with about 10.25 billion parameters.

zshn25•47m ago

Sorry, how did you calculate the 10.25B?

darrenf•28m ago

> > The performance/intelligence is said to be about the same as the geometric mean of the total and active parameter counts. So, this model should be equivalent to a dense model with about 10.25 billion parameters.

> Sorry, how did you calculate the 10.25B?

The geometric mean of two numbers is the square root of their product. Square root of 105 (35*3) is ~10.25.

wongarsu•40m ago

And even if you have enough VRAM to fit the entire thing, inference speed after the first token is proportional to (activated parameters)/(vram bandwidth)

If you have the vram to spare, a model with more total params but fewer activated ones can be a very worthwhile tradeoff. Of course that's a big if

aliljet•53m ago

I'm broadly curious how people are using these local models. Literally, how are they attaching harnesses to this and finding more value than just renting tokens from Anthropic of OpenAI?

marssaxman•42m ago

I used vLLM and qwen3-coder-next to batch-process a couple million documents recently. No token quota, no rate limits, just 100% GPU utilization until the job was done.

Panda4•41m ago

I was thinking the same thing. My only guess is that they are excited about local models because they can run it cheaper through Open Router ?

flux3125•35m ago

They are okay for vibe coding throw-away projects without spending your Anthrophic/OAI tokens

lkjdsklf•35m ago

The people i know that use local models just end up with both.

The local models don’t really compete with the flagship labs for most tasks

But there are things you may not want to send to them for privacy reasons or tasks where you don’t want to use tokens from your plan with whichever lab. Things like openclaw use a ton of tokens and most of the time the local models are totally fine for it (assuming you find it useful which is a whole different discussion)

bildung•26m ago

The privacy/data security angle really is important in some regions and industries. Think European privacy laws or customers demanding NDAs. The value of Anthropic and OpenAI is zero for both cases, so easy to beat, despite local models being dumber and slower.

oompydoompy74•22m ago

Idk about everyone else, but I don’t want to rent tokens forever. I want a self hosted model that is completely private and can’t be monitored or adulterated without me knowing. I use both currently, but I am excited at the prospect of maybe not having to in the near to mid future.

I’ve increasingly started self hosting everything in my home lately because I got tired of SAAS rug pulls and I don’t see why LLM’s should eventually be any different.

seemaze•21m ago

Qwen3.5-9B has been extremely useful for local fuzzy table extraction OCR for data that cannot be sent to the cloud.

The documents have subtly different formatting and layout due to source variance. Previously we used a large set of hierarchical heuristics to catch as many edge cases as we could anticipate.

Now with the multi-modal capabilities of these models we can leverage the language capabilities along side vision to extract structured data from a table that has 'roughly this shape' and 'this location'.

kamranjon•5m ago

I use LMStudio to host and run GLM 4.7 Flash as a coding agent. I use it with the Pi coding agent, but also use it with the Zed editor agent integrations. I've used the Qwen models in the past, but have consistently come back to GLM 4.7 because of its capabilities. I often use Qwen or Gemma models for their vision capabilities. For example, I often will finish ML training runs, take a photo of the graphs and visualizations of the run metrics and ask the model to tell me things I might look at tweaking to improve subsequent training runs. Qwen 3.5 0.8b is pretty awesome for really small and quick vision tasks like "Give me a JSON representation of the cards on this page".

tristor•50m ago

I'm disappointed they didn't release a 27B dense model. I've been working with Qwen3.5-27B and Qwen3.5-35B-A3B locally, both in their native weights and the versions the community distilled from Opus 4.6 (Qwopus), and I have found I generally get higher quality outputs from the 27B dense model than the 35B-A3B MOE model. My basic conclusion was that MoE approach may be more memory efficient, but it requires a fairly large set of active parameters to match similarly sized dense models, as I was able to see better or comparable results from Qwen3.5-122B-A10B as I got from Qwen3.5-27B, however at a slower generation speed. I am certain that for frontier providers with massive compute that MoE represents a meaningful efficiency gain with similar quality, but for running models locally I still prefer medium sized dense models.

I'll give this a try, but I would be surprised if it outperforms Qwen3.5-27B.

adrian_b•40m ago

You are right, but this is just the first open-weights model of this family.

They said that they will release several open-weights models, though there was an implication that they might not release the biggest models.

tristor•37m ago

I'm totally fine with that, frankly. I'm blessed with 128GB of Unified Memory to run local models, but that's still tiny in comparison the larger frontier models. I'd much rather get a full array of small and medium sized models, and building useful things within the limits of smaller models is more interesting to me anyway.

hnfong•25m ago

Given that DeepSeek, GLM, Kimi etc have all released large open weight models, I am personally grateful that Qwen fills the mid/small sized model gap even if they keep their largest models to themselves. The only other major player in the mid/small sized space at this point is pretty much only Gemma.

seemaze•34m ago

Fingers crossed for mid and larger models as well. I'd personally love to see Qwen3.6-122B-A10B.

nurettin•16m ago

I tried the car wash puzzle:

You want to wash your car. Car wash is 50m away. Should you walk or go by car?

> Walk. At 50 meters, the round trip is roughly 100 meters, taking about two minutes on foot. Driving would require starting the engine, navigating, parking, and dealing with unnecessary wear for a negligible distance. Walk to the car wash, and if the bay requires the vehicle inside, have it moved there or return on foot. Walking is faster and more efficient.

Classic response. It was really hard to one shot this with Qwen3.5 Q4_K_M.

Qwen3.6 UD-IQ4_XS also failed the first time, then I added this to the system prompt:

> Double check your logic for errors

Then I created a new dialog and asked the puzzle and it responded:

> Drive it. The car needs to be present to be washed. 50 meters is roughly a 1-minute walk or a 10-second drive. Walking leaves the car behind, making the wash impossible. Driving it the short distance is the only option that achieves the goal.

Now 3.6 gets it right every time. So not as great as a super model, but definitely an improvement.

lopsotronic•10m ago

Dangit, I'll need to give this a run on my personal machine. This looks impressive.

At the time of writing, all deepseek or qwen models are de facto prohibited in govcon, including local machine deployments via Ollama or similar. Although no legislative or executive mandate yet exists [1], it's perceived as a gap [2], and contracts are already including language for prohibition not just in the product but any part of the software environment.

The attack surface for a (non-agentic) model running in local ollama is basically non-existent . . but, eh . . I do get it, at some level. While they're not l33t haXX0ring your base, the models are still largely black boxes, can move your attention away from things, or towards things, with no one being the wiser. "Landing Craft? I see no landing craft". This would boil out in test, ideally, but hey, now you know how much time your typical defense subcon spends in meaningful software testing[3].

[1] See also OMB Memorandum M-25-22 (preference for AI developed and produced in the United States), NIST CAISI assessment of PRC-origin AI models as "adversary AI" (September 2025), and House Select Committee on the CCP Report (April 16, 2025), "DeepSeek Unmasked".

[2] Overall, rather than blacklist, I'd recommend a "whitelist" of permitted models, maintained dynamically. This would operate the same way you would manage libraries via SSCG/SSCM (software supply chain governance/management) . . but few if any defense subcons have enough onboard savvy to manage SSCG let alone spooling a parallel construct for models :(. Soooo . . ollama regex scrubbing it is.

[3] i.e. none at all, we barely have the ability to MAKE anything like software, given the combination of underwhelming pay scales and the fact defense companies always seem to have a requirement for on-site 100% in some random crappy town in the middle of BFE. If it wasn't for the downturn in tech we wouldn't have anyone useful at all, but we snagged some silcon refugees.

An MCP control plane that sits between your agent and every tool it can reach

From five optional fields to a discriminated union: CLI parsing with Optique 1.0

Boycott of major AI conference exposes a growing US–China divide

Show HN: NanoWakeWord – Open-source wake word training for any device

Anthropic's AI downgrade stings power users

Show HN: I built a music theory course with games and spaced repetition

rlvrbook

Ask HN: Does ChatGPT's data export feature work?

#1 Tip for Developers with No Idea What to Build [video]

Locally Uncensored v2.3.3 – local AI with Qwen3.6 and remote access

We improved OneDrive folder automount

RunReveal in Kubernetes for On-Prem and in VPC SIEM

Health care can't be the only job in town – but it is

Going Headless Is Hard

Let's Kill Webhooks

Turning RAG pipelines into enterprise-grade Data Subscriptions

Launch HN: Kampala (YC W26) – Reverse-Engineer Apps into APIs

15 Ways I'm Using AI to Manage My Small Business

Mnist Neural Network Playground for 3blue1brown video

Show HN: Homebutler – Nightly restore drills for your homelab backups

A New Meditation for Strengthening Attention and Executive Control

Ad-Free Privacy Tool/Service Recommendations

I cataloged 500 vibe coding tools so you don't have to

What to know about naval blockades as US starts patrols of the Strait of Hormuz

Caveman – why use many token when few do trick

Clojure The Documentary, official film [video]

Women in Tech: Journeys, Grit, and the Future We're Building

We gave an AI a 3 year retail lease and asked it to make a profit

How to Outline Text (Badly, at First)

Prt-Scan: AI-Powered GitHub Actions Supply Chain Attack

Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All

Comments

An MCP control plane that sits between your agent and every tool it can reach

From five optional fields to a discriminated union: CLI parsing with Optique 1.0

Boycott of major AI conference exposes a growing US–China divide

Show HN: NanoWakeWord – Open-source wake word training for any device

Anthropic's AI downgrade stings power users

Show HN: I built a music theory course with games and spaced repetition

rlvrbook

Ask HN: Does ChatGPT's data export feature work?

#1 Tip for Developers with No Idea What to Build [video]

Locally Uncensored v2.3.3 – local AI with Qwen3.6 and remote access

We improved OneDrive folder automount

RunReveal in Kubernetes for On-Prem and in VPC SIEM

Health care can't be the only job in town – but it is

Going Headless Is Hard

Let's Kill Webhooks

Turning RAG pipelines into enterprise-grade Data Subscriptions

Launch HN: Kampala (YC W26) – Reverse-Engineer Apps into APIs

15 Ways I'm Using AI to Manage My Small Business

Mnist Neural Network Playground for 3blue1brown video

Show HN: Homebutler – Nightly restore drills for your homelab backups

A New Meditation for Strengthening Attention and Executive Control

Ad-Free Privacy Tool/Service Recommendations

I cataloged 500 vibe coding tools so you don't have to

What to know about naval blockades as US starts patrols of the Strait of Hormuz

Caveman – why use many token when few do trick

Clojure The Documentary, official film [video]

Women in Tech: Journeys, Grit, and the Future We're Building

We gave an AI a 3 year retail lease and asked it to make a profit

How to Outline Text (Badly, at First)

Prt-Scan: AI-Powered GitHub Actions Supply Chain Attack