Or you can rent a newer one for $300/mo on the cloud
The RAM is for the 400gb of experts.
If you have 500GB of SSD, llama.cpp does disk offloading -> it'll be slow though less than 1 token / s
3 t/s isn't going to be a lot of fun to use.
For reference, a RTX 3090 has about 900GB/sec memory bandwidth, and a Mac Studio 512GB has 819GB/sec memory bandwidth.
So you just need a workstation with 8 channel DDR5 memory, and 8 sticks of RAM, and stick a 3090 GPU inside of it. Should be cheaper than $5000, for 512GB of DDR5-6400 that runs at a memory bandwidth of 409GB/sec, plus a RTX 3090.
Qwen says this is similar in coding performance to Sonnet 4, not Opus.
How casually we enter the sci-fi era.
With the performance apparently comparable to Sonnet some of the heavy Claude Code users could be interested in running it locally. They have instructions for configuring it for use by Claude Code. Huge bills for usage are regularly shared on X, so maybe it could even be economical (like for a team of 6 or something sharing a local instance).
So likely it needs 2x the memory.
e: They did announce smaller variants will be released.
You will see people making quantized distilled versions but they never give benchmark results.
That machine will set you back around $10,000.
https://learn.microsoft.com/en-us/azure/virtual-machines/siz...
Edit: I just aked chatgpt and it says with no memory bandwidth bottleneck, i can still only achieve around 1 token/s from a 96 core cpu.
Edit: actually forgot the MoE part, so that makes sense.
I would totally buy a device like this for $10k if it were designed to run Linux.
I'm not sure if there are higher capacity gddr6 & 7's rams to buy. I semi doubt you can add more without more channels, to some degree, but also, AMD just shipped R9700 based on rx9070 but with double the ram. But something like Strix Halo, an API with more lpddr channels could work. Word is that Strix Halo's 2027 successor Medusa Halo will go to 6 channels and it's hard to see a significant advantage without that win; the processing is already throughput constrained-ish and a leap on memory bandwidth will definitely be required. Dual channel 128b isn't enough!
There's also MRDIMMs standard, which multiplexes multiple chips. That promises a doubling of both capacity and throughout.
Apple's definitely done two brilliant costly things, by putting very wide (but not really fast) memory on package (Intel had dabbled in doing similar with regular width ram in consumer space a while ago with Lakefield). And then by tiling multiple cores together, making it so that if they had four perfect chips next to each other they could ship it as one. Incredibly brilliant maneuver to get fantastic yields, and to scale very big.
It has 8 channels of DDR5-8000.
It might be technically correct to call it 8 channels of LPDDR5 but 256-bits would only be 4 channels of DDR5.
Im sure the engineers are doing the best work they can. I just dont think leadership is as interested in making a good product as they are in creating a nice exit down the line
https://github.com/QwenLM/qwen-code https://github.com/QwenLM/qwen-code/blob/main/LICENSE
I hope these OSS CC clones converge at some point.
Actually it is mentioned in the page:
we’re also open-sourcing a command-line tool for agentic coding: Qwen Code. Forked from Gemini Code
I’ve instead used a Gemini via plain ol’ chat, first building a competitive, larger context than Claude can hold then manually bringing detailed plans and patches to Gemini for feedback with excellent results.
I presumed mcp wouldn’t give me the focused results I get from completely controlling Gemini.
And that making CC interface via the MCP would also use up context on that side.
For example, you can drive one model to a very good point through several turns, and then have the second “red team” the result of the first.
Then return that to the first model with all of its built up context.
This is particularly useful in big plans doing work on complex systems.
Even with a detailed plan, it is not unusual for Claude code to get “stuck” which can look like trying the same thing repeatedly.
You can just stop that, ask CC to summarize the current problem and attempted solutions into a “detailed technical briefing.”
Have CC then list all related files to the problem including tests, then provide the briefing and all of the files to the second LLM.
This is particularly good for large contexts that might take multiple turns to get into Gemini.
You can have the consulted model wait to provide any feedback until you’ve said your done adding context.
And then boom, you get a detailed solution without even having to directly focus on whatever minor step CC is stuck on. You stay high level.
In general, CC is immediately cured and will finish its task. This is a great time to flip it into planning mode and get plan alignment.
Get Claude to output an update on its detailed plan including what has already been accomplished then again—-ship it to the consulting model.
If you did a detailed system specification in advance, (which CC hopefully was originally also working from) You can then ask the consulting model to review the work done and planned next steps.
Inevitably the consulting model will have suggestions to improve CC’s work so far and plans. Send it on back and you’re getting outstanding results.
Update: Here is what o3 thinks about this topic: https://chatgpt.com/share/688030a9-8700-800b-8104-cca4cb1d0f...
You set the environment variable ANTHROPIC_BASE_URL to an OpenAI-compatible endpoint and ANTHROPIC_AUTH_TOKEN to the API token for the service.
I used Kimi-K2 on Moonshot [1] with Claude Code with no issues.
There's also Claude Code Router and similar apps for routing CC to a bunch of different models [2].
Our main focuses were to be 1) CLI-first and 2) truly an open source community. We have 5 independent maintainers with full commit access --they aren't from the same org or entity (disclaimer: one has joined me at my startup Gobii where we're working on web browsing agents.)
I'd love someone to do a comparison with CC, but IME we hold our own against Cursor, Windsurf, and other agentic coding solutions.
But yes, there really needs to be a canonical FOSS solution that is not tied to any specific large company or model.
It focuses especially on large context and longer tasks with many steps.
It would be great if it starts supporting other models too natively. Wouldn't require people to fork.
Also docs on running it in a 24GB GPU + 128 to 256GB of RAM here: https://docs.unsloth.ai/basics/qwen3-coder
Recommended context: 65,536 tokens (can be increased)
That should be recommended token output, as shown in the official docs as: Adequate Output Length: We recommend using an output length of 65,536 tokens for most queries, which is adequate for instruct models.
If you don't have enough RAM, then < 1 token / s
Important layers are in 8bit, 6bit. Less important ones are left in 2bit! I talk more about it here: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs
Great to know that this is already a thing and I assume model "compression" is going to be the next hot topic
I'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close. But since for the foreseeable future, I'll probably sometimes want to "call in" a bigger model that I can't realistically or affordably host on my own computer, I love having the option of high-quality open-weight models for this, and I also like the idea of "paying in" for the smaller open-weight models I play around with by renting access to their larger counterparts.
Congrats to the Qwen team on this release! I'm excited to try it out.
Very interesting. Any subs or threads you could recommend/link to?
Thanks
They don't need to match bigger models, though. They just need to be good enough for a specific task!
This is more obvious when you look at the things language models are best at, like translation. You just don't need a super huge model for translation, and in fact you might sometimes prefer a smaller one because being able to do something in real-time, or being able to run on a mobile device, is more important than marginal accuracy gains for some applications.
I'll also say that due to the hallucination problem, beyond whatever knowledge is required for being more or less coherent and "knowing" what to write in web search queries, I'm not sure I find more "knowledgeable" LLMs very valuable. Even with proprietary SOTA models hosted on someone else's cloud hardware, I basically never want an LLM to answer "off the dome"; IME it's almost always wrong! (Maybe this is less true for others whose work focuses on the absolute most popular libraries and languages, idk.) And if an LLM I use is always going to be consulting documentation at runtime, maybe that knowledge difference isn't quite so vital— summarization is one of those things that seems much, much easier for language models than writing code or "reasoning".
All of that is to say:
Sure, bigger is better! But for some tasks, my needs are still below the ceiling of the capabilities of a smaller model, and that's where I'm focusing on local usage. For now that's mostly language-focused tasks entirely apart from coding (translation, transcription, TTS, maybe summarization). It may also include simple coding tasks today (e.g., fancy auto-complete, "ghost-text" style). I think it's reasonable to hope that it will eventually include more substantial programming tasks— even if larger models are still preferable for more sophisticated tasks (like "vibe coding", maybe).
If I end up having a lot of fun, in a year or two I'll probably try to put together a machine that can indeed run larger models. :)
This reminds me of ~”the best camera is the one you have with you” idea.
Though, large models are an http request away, there are plenty of reasons to want to run one locally. Not the least of which is getting useful results in the absence of internet.
I feel like I'm the exact opposite here (despite heavily mistrusting these models in general): if I came to the model to ask it a question, and it decides to do a Google search, it pisses me off as I not only could do that, I did do that, and if that had worked out I wouldn't be bothering to ask the model.
FWIW, I do imagine we are doing very different things, though: most of the time, when I'm working with a model, I'm trying to do something so complex that I also asked my human friends and they didn't know the answer either, and my attempts to search for the answer are failing as I don't even know the terminology.
Is this an effort to chastise the viewpoint advanced? Because his viewpoint makes sense to me: I can run biggish models on my 128GB Macbook but not huge ones-- even 2b quantized ones suck too many resources.
So I run a combination of local stuff and remote stuff depending upon various factors (cost, sensitivity of information, convenience/whether I'm at home, amount of battery left, etc ;)
Yes, bigger models are better, but often smaller is good enough.
I don't have 10-20k$ to spend on this stuff. Which is about the minimum to run a 480B model, with huge quantisation. And pretty slow because for that price all you get is an old Xeon with a lot of memory or some old nvidia datacenter cards. If you want a good setup it will cost a lot more.
So small models it is. Sure, the bigger models are better but because the improvements come so fast it means I'm only 6 months to a year behind the big ones at any time. Is that worth 20k? For me no.
Under the hood, the way it works, is that when you have final probabilities, it really doesn't matter if the most likely token is selected with 59% or 75% - in either case it gets selected. If the 59% case gets there with smaller amount of compute, and that holds across the board for the training set, the model will have similar performance.
In theory, it should be possible to narrow down models even smaller to match the performance of big models, because I really doubt that you do need transformers for every single forward pass. There are probably plenty of shortcuts you can take in terms of compute for sets of tokens in the context. For example, coding structure is much more deterministic than natural text, so you probably don't need as much compute to generate accurate code.
You do need a big model first to train a small model though.
As for running huge models locally, its not enough to run them, you need good throughput as well. If you spend $2k on a graphics card, that is way more expensive than realistic usage with a paid API, and slower output as well.
I was surprised in the AlphaEvolve paper how much they relied on the flash model because they were optimizing for speed of generating ideas.
Untrue. The big important issue for LLMs is hallucination, and making your model bigger does little to solve it.
Increasing model size is a technological dead end. The future advanced LLM is not that.
Likewise, I found that the regular Qwen3-30B-A3B worked pretty well on a pair of L4 GPUs (60 tokens/second, 48 GB of memory) which is good enough for on-prem use where cloud options aren't allowed, but I'd very much like a similar code specific model, because the tool calling in something like RooCode just didn't work with the regular model.
In those circumstances, it isn't really a comparison between cloud and on-prem, it's on-prem vs nothing.
assuming it doesn't all implode due to a lack of profitability, it should be obvious
That said, code+git+agent is only acceptable way for technical staff to interact with AI. Tools with sparkles button can go to hell.
https://a16z.com/llmflation-llm-inference-cost/ https://openrouter.ai/anthropic/claude-sonnet-4
Obsession?
Its a bit like the media cycle. The more jacked in you are, the more behind you feel. I’m less certain there will be winners as much as losers, but for sure the time investment on staying up to date on these things will not pay dividends to the average hn reader
A look at fandom wikis is humbling. People will persist and go very deep into stuff they care about.
In this case: Read a lot, try to build a lot, learn, learn from mistakes, compare.
I think the bottleneck is file read/write tooling right now
Saw a repo recently with probably 80% of those
Now I have a git repo I add as a submodule and tell each tool to read through and create their own WHATEVER.md
Closest you get is https://github.com/opencode-ai/opencode in GO.
Alibaba Plus: input: $1 to $6 output: $5 to $60
Alibaba OpenSource: input: $1.50 to $4.50 output: $7.50 to $22.50
So it doesn't look that cheap comparing to Kimi k2 or their non coder version (Qwen3 235B A22B 2507).
What's more confusing this "up to" pricing that supposed to can reach $60 for output - with agents it's not that easy to control context.
05%: Making code changes
10%: Running build pipelines
20%: Learning about changed process and people via zoom calls, teams chat and emails
15%: Raising incident tickets for issues outside of my control
20%: Submitting forms, attending reviews and chasing approvals
20%: Reaching out to people for dependencies, following up
10%: Finding and reading up some obscure and conflicting internal wiki page, which is likely to be outdated
jddj•10h ago
Open, small, if the benchmarks are to be believed sonnet 4~ish, tool use?
danielhanchen•10h ago
sourcecodeplz•9h ago
"Today, we're announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we're excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct."
https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct
danielhanchen•9h ago
fotcorn•9h ago
You won't be out of work creating ggufs anytime soon :)
danielhanchen•9h ago
stuartjohnson12•9h ago
https://winbuzzer.com/2025/01/29/alibabas-new-qwen-2-5-max-m...
Alibaba is not a company whose culture is conducive to earnest acknowledgement that they are behind SOTA.
swyx•9h ago
this is disingenous. there are a bunch of hurdles to using open models over closed models and you know them as well as the rest of us.
omneity•8h ago
daemonologist•9h ago
sourcecodeplz•9h ago