Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...
But Deepseek R1 adds embed_tokens and shared_head.head tensors, which are [129280, 7168] or about 2GB in size at FP8.
Qwen3-Next doesn't have that: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob...
So it saves a few GB in active parameters for MTP, which is a Big Deal. This is one of the changes that helps significantly speeds up inference.
Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster.
If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster.
This stuff can run on a local machine without internet access, correct?
And it can pretty much match Nano Banana? https://github.com/PicoTrex/Awesome-Nano-Banana-images/blob/...
Also -- what are the specs for a machine to run it (even if slowly!)
Make sure to lurk on r/LocalLlama.
the model discussed here is text model, so similar to ChatGPT. You can also run it on your local machine, but not yet, as apps need to be updated with Qwen 3 Next support (llama.cpp, Ollama, etc)
This has nothing to do with nano banana, or image generation. For that you want the qwen image edit[1] models.
Yes.
> And it can pretty much match Nano Banana?
No, Qwen3-Next is not a multimodal model, it has no image generation function.
But in practice you need a bit more than that. You also need some space for context, and then for kv cache, potentially a model graph, etc.
So you'll see in practice that you need 20-50% more RAM than this rule of thumb.
For this model, you'll need anywhere from 50GB (tight) to 200GB (full) RAM. But it also depends how you run it. With MoE models, you can selectively load some experts (parts of the model) in VRAM, while offloading some in RAM. Or you could run it fully on CPU+RAM, since the active parameters are low - 3B. This should work pretty well even on older systems (DDR4).
Here's a classic ASCII art representation of SpongeBob SquarePants:
.------.
/ o o \
| |
| \___/ |
\_______/
llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
"An ASCII of spongebob"
Here's an ASCII art of SpongeBob SquarePants:
.--..--..--..--..--..--.
.' \ (`._ (_) _ \
.' | '._) (_) |
\ _.')\ .----..--.' /
|(_.' | / .-\-. \---.
\ 0| | ( O| O) | |
| _ | .--.____.'._.-. |
\ (_) | o -` .-` |
| \ |`-._ _ _ _ _\ /
\ | | `. |_||_| |
| o | \_ \ | -. .-.
|.-. \ `--..-' O | `.`-' .'
_.' .' | `-.-' /-.__ ' .-'
.' `-.` '.|='=.='=.='=.='=|._/_ `-'.'
`-._ `. |________/\_____| `-.'
.' ).| '=' '='\/ '=' |
`._.` '---------------'
//___\ //___\
|| ||
||_.-. ||_.-.
(_.--__) (_.--__)
Meta: I generated a few dozen spongebobs last night on the same model and NONE where as good as this. Most started well but collapsed into decoherence at the end - missing the legs off. Then this morning the very same prompt to the same model API produced a perfect bob on the first attempt. Can utilization affect response quality, if all else remains constant? Or was it just random luck?Edit: Ok, the very next attempt, a few minutes later, failed, so I guess it is just random, and you have about a 1 in 10 chance of getting a perfect spongebob from qwen3-coder, and ~0 chance with qwen3-next.
Humans do it too. I have given up on my country's non-local information sources, because I could recognize original sources that are being deliberately omitted. There's a satiric webpage that is basically a reddit scrape. Most of users don't notice and those who do, don't seem to care.
llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
"An ASCII of spongebob"
Here's an ASCII art of SpongeBob SquarePants:
```
.--..--..--..--..--..--.
.' \ (`._ (_) _ \
.' | '._) (_) |
\ _.')\ .----..--. /
|(_.' | / .-\-. \
\ 0| | ( O| O) |
| _ | .--.____.'._.-.
/.' ) | (_.' .-'"`-. _.-._.-.--.-.
/ .''. | .' `-. .-'-. .-'"`-.`-._)
.'.' | | | | | | | | | |
.'.' | | | | | | | | | |
.'.' | | | | | | | | | |
.'.' | | | | | | | | | |
.'.' | | | | | | | | | |
.'.' | | | | | | | | | |
```
The larger model already has it in the training corpus so it's not particularly a good measure though. I'd much rather see the capabilities of a model in trying to represent in ascii something that it's unlikely to have in it's training.
Maybe a pelican riding a bike as ascii for both?
And it appears like it's thinking about it! /s
It's amazing how far and how short we've come with software architectures.
I use ollama every day for spam filtering: gemma3:27b works great, but I use gpt-oss:20b on a daily basis because it's so much faster and comparable in performance.
Jgoauh•2h ago
NitpickLawyer•1h ago
That being said, qwen models are extremely overfit. They can do some things well, but they are very limited in generalisation, compared to closed models. I don't know if it's simply scale, or training recipes, or regimes. But if you test it ood the models utterly fail to deliver, where the closed models still provide value.
vintermann•1h ago
NitpickLawyer•1h ago
- in math, if they can solve a problem, or a class of problems, they'll solve it. If you use a "thinking" model + maj@x, you'll get strong results. But if you try for example to have the model consider a particular way or method of exploring a problem, it'll default to "solving" mode. It's near impossible to have it do something else with a math problem, other than solving it. Say "explore this part, in this way, using this method". Can't do it. It'll maybe play a bit, but then enter "solving" mdoe and continue to solve it as it was trained.
In practice, this means that "massive parallel" test time compute becomes harder to do with these models, because you can't "guide" them towards certain aspects of a problem. They are extremely "stubborn".
- in coding it's even more obvious. Ask them to produce any 0shot often tested and often shown things (spa, game, visualisation, etc) - and they do it. Convincingly.
But ask them to look at a piece of code and extract meaning, and they fail. Or ask them to reverse an implementation. Figure out what a function does and reverse its use, or make it do something else, and they fail.
elbear•1h ago
vintermann•54m ago
It does sound like an artifact of the dialog/thinking tuning though.