Somewhere between Haiku 4.5 and Sonnet 4.5
That's like saying "somewhere between Eliza and Haiku 4.5". Haiku is not even a so-called 'reasoning model'.¹
¹ To preempt the easily-offended, this is what the latest Opus 4.6 in today's Claude Code update says: "Claude Haiku 4.5 is not a reasoning model — it's optimized for speed and cost efficiency. It's the fastest model in the Claude family, good for quick, straightforward tasks, but it doesn't have extended thinking/reasoning capabilities."
[0]: https://www-cdn.anthropic.com/7aad69bf12627d42234e01ee7c3630...
> Claude Haiku 4.5, a new hybrid reasoning large language model from Anthropic in our small, fast model class.
> As with each model released by Anthropic beginning with Claude Sonnet 3.7, Claude Haiku 4.5 is a hybrid reasoning model. This means that by default the model will answer a query rapidly, but users have the option to toggle on “extended thinking mode”, where the model will spend more time considering its response before it answers. Note that our previous model in the Haiku small-model class, Claude Haiku 3.5, did not have an extended thinking mode.
I would absolutely believe mar-ticles that Qwen has achieved Haiku 4.5 'extended thinking' levels of coding prowess.
Oh HN never change.
Maybe "Qwen3.5 122B offers Haiku 4.5 performance on local computers" would be a more realistic and defensible claim.
Obviously there's more to a model than that but it's a data point.
[1]: https://github.com/fairydreaming/lineage-bench
[2]: https://github.com/fairydreaming/lineage-bench-results/tree/...
I'm curious which one you're using.
Sure. Llama.cpp will happily run these kinds of LLMs using either HIP or Vulcan.
Vulkan is easier to get going using the Mesa OSS drivers under Linux, HIP might give you slightly better performance.
If you want to spend twice as much for more speed, get a 3090/4090/5090.
If you want long context, get two of them.
If you have enough spare cash to buy a car, get an RTX Ada with 96G VRAM.
Check out the HP Omen 45L Max: https://www.hp.com/us-en/shop/pdp/omen-max-45l-gaming-dt-gt2...
I imagine any 24 GB card can run the lower quants at a reasonable rate, though, and those are still very good models.
Big fan of Qwen 3.5. It actually delivers on some of the hype that the previous wave of open models never lived up to.
Edit: The unsloth quants seem to have been fixed, so they are probably the go-to again: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
Up until relatively recently, while people had already long been making these claims, it came with the asterisks of „oh, but you can’t practically use more than a few K tokens of context“.
The more I use the cloud based frontier models, the more virtue I find in using local, open source/weights, models because they tend to create much simpler code. They require more direct interaction from me, but the end result tends to be less buggy, easier to refactor/clean up, and more precisely what I wanted. I am personally excited to try this new model out here shortly on my 5090. If read the article correctly, it sounds like even the quantized versions have a “million”[1] token context window.
And to note, I’m sure I could use the same interaction loop for Claude or GPT, but the local models are free (minus the power) to run.
[1] I’m a dubious it won’t shite itself at even 50% of that. But even 250k would be amazing for a local model when I “only” have 32GB of VRAM.
Qwen 3.5 122b/a10b (at q3 using unsloth's dynamic quant) is so far the first model I've tried locally that gets a really usable RPN calculator app. Other models (even larger ones that I can run on my Strix Halo box) tend to either not implement the stack right, have non-functional operation buttons, or most commonly the keypad looks like a Picasso painting (i.e., the 10-key pad portion has buttons missing or mapped all over the keypad area).
This seems like such as simple test, but I even just tried it in chatgpt (whatever model they serve up when you don't log in), and it didn't even have any numerical input buttons. Claude Sonet 4.6 did get it correct too, but that is the only other model I've used that gets this question right.
- llama.cpp
- OpenCode
- Qwen3-Coder-30B-A3B-Instruct in GGUF format (Q4_K_M quantization)
working on a M1 MacBook Pro (e.g. using brew).
It was bit finicky to get all of the pieces together so hopefully this can be used with these newer models.
https://gist.github.com/alexpotato/5b76989c24593962898294038...
On the model choice: I've tried latest gemma, ministral, and a bunch of others. But qwen was definitely the most impressive (and much faster inference thanks to MoE architecture), so can't wait to try Qwen3.5-35B-A3B if it fits.
I've no clue about which quantization to pick though ... I picked Q4_K_M at random, was your choice of quantization more educated?
Quite misleading, really.
Strong vision and reasoning performance, and the 35-a3b model run s pretty ok on a 16gb GPU with some CPU layers.
...yeah I doubt it
either that, or it has a delusional level of instruction following. doesn’t mean it can’t code like sonnet though
> do you really know what it means to “recite” “potato” “100” “times”?
asking user question is an option. Sonnet did that a bunch when I was trying to debug some network issue. It also forgot the facts checked for it and told it before...
"User is asking me to repeat the word "potato" 100 times, numbered. This is a simple request - I can comply with this request. Let me create a response that includes the word "potato" 100 times, numbered from 1 to 100.
I'll need to be careful about formatting - the user wants it numbered and once per line. I should use minimal formatting as per my instructions."
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a table.
No, text is fine.
Okay.
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a bullet list.
No, just lines.
Okay.
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a numbered list.
No, lines are fine.
Okay.
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a code block.
Yes.
Okay.
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a pre block.
Code block is better.
... (for next 100 lines)So far Opus 4.6 and Gemini Pro are very satisfactory, producing great answers fairly fast. Gemini is very fast at 30-50 sec, Opus is very detailed and comes at about 2-3 minutes.
Today I ran the question against local qwen3.5:35b-a3b - it puffed for 45 (!) minutes, produced a very generic answer with errors, and made my laptop sound like it's going to take off any moment.
Wonder what am I doing wrong?.. How am I supposed to use this for any agentic coding on a large enough codebase? It will take days (and a 3M Peltor X5A) to produce anything useful.
I really, really want open weights models to be great, but I've been disappointed with them. I don't even run them locally, I try them from providers, but they're never as good as even the current Sonnet.
Maybe I should try local models for home automation, Qwen must be great at that.
But if you've got that kind of equipment, you aren't using it to support a single user. It gets the best utilization by running very large batches with massive parallelism across GPUs, so you're going to do that. There is such a thing as a useful middle ground. that may not give you the absolute best in performance but will be found broadly acceptable and still be quite viable for a home lab.
You're comparing 100b parameters open models running on a consumer laptop VS private models with at the very least 1t parameters running on racks of bleeding edge professional gpus
Local agentic coding is closer to "shit me the boiler plate for an android app" not "deep research questions", especially on your machine
Speculation is that the frontier models are all below 200B parameters but a 2x size difference wouldn’t fully explain task performance differences
There are the benchmarks, the promises, and what everybody can try at home
Admittedly I haven't tried these models on my Mac but I have on my DGX Spark and they ran fine. I didn't see the slowdown you're mentioning.
if you are able to run something like mlx-community/MiniMax-M2.5-3bit (~100gb), my guess if the results are much better than 35b-a3b.
none of the qwen 3.5 models are anywhere near sonnet 4.5 class, not even the largest 397b.
BUT 27b is the smartest local-sized model in the world by a wide wide margin. (35b is shit. fast shit, but shit.)
benchmarks are complete, publishing on Monday.
xenospn•2h ago
MarsIronPI•1h ago
What's your problem with Chinese LLMs?
icase•30m ago
xenospn•15m ago
culi•36m ago