Qwen3-Coder: Agentic coding in the world

https://qwenlm.github.io/blog/qwen3-coder/

531•danielhanchen•10h ago

Comments

jddj•10h ago

Odd to see this languishing at the bottom of /new. Looks very interesting.

Open, small, if the benchmarks are to be believed sonnet 4~ish, tool use?

danielhanchen•10h ago

Ye the model looks extremely powerful! I think they're also maybe making a small variant as well, but unsure yet!

sourcecodeplz•9h ago

Yes they are:

"Today, we're announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we're excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct."

https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct

danielhanchen•9h ago

Oh yes fantastic! Excited for them!

fotcorn•9h ago

It says that there are multiple sizes in the second sentence of the huggingface page: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct

You won't be out of work creating ggufs anytime soon :)

danielhanchen•9h ago

stuartjohnson12•9h ago

Qwen has previously engaged in deceptive benchmark hacking. They previously claimed SOTA coding performance back in January and there's a good reason that no software engineer you know was writing code with Qwen 2.5.

https://winbuzzer.com/2025/01/29/alibabas-new-qwen-2-5-max-m...

Alibaba is not a company whose culture is conducive to earnest acknowledgement that they are behind SOTA.

swyx•9h ago

> there's a good reason that no software engineer you know was writing code with Qwen 2.5.

this is disingenous. there are a bunch of hurdles to using open models over closed models and you know them as well as the rest of us.

omneity•8h ago

Also dishonest since the reason Qwen 2.5 got so popular is not so much paper performance.

daemonologist•9h ago

Maybe not the big general purpose models, but Qwen 2.5 Coder was quite popular. Aside from people using it directly I believe Zed's Zeta was a fine-tune of the base model.

sourcecodeplz•9h ago

Benchmarks are one thing but the people really using these models, do it for a reason. Qwen team is top in open models, esp. for coding.

mohsen1•9h ago

Open weight models matching Cloud 4 is exciting! It's really possible to run this locally since it's MoE

danielhanchen•9h ago

Ye! Super excited for Coder!!

ilaksh•9h ago

Where do you put the 480 GB to run it at any kind of speed? You have that much RAM?

Cheer2171•9h ago

You can get a used 5 year old Xeon Dell or Lenovo Workstation and 8x64GB of ECC DDR4 RAM for about $1500-$2000.

Or you can rent a newer one for $300/mo on the cloud

sourcecodeplz•9h ago

Everyone keeps saying this but it is not really useful. Without a dedicated GPU & VRAM, you are waiting overnight for a response... The MoE models are great but they need dedicated GPU & VRAM to work fast.

jychang•8h ago

Well, yeah, you're supposed to put in a GPU. It's a MoE model, the common tensors should be on the GPU, which also does prompt processing.

The RAM is for the 400gb of experts.

danielhanchen•9h ago

You don't actually need 480GB of RAM, but if you want at least 3 tokens / s, it's a must.

If you have 500GB of SSD, llama.cpp does disk offloading -> it'll be slow though less than 1 token / s

UncleOxidant•9h ago

> but if you want at least 3 tokens / s

3 t/s isn't going to be a lot of fun to use.

segmondy•7h ago

beg to differ, I'm living fine with 1.5tk/sec

danielhanchen•5h ago

Spec decoding on a small draft model could help increase it by say 30 to 50%!

segmondy•4h ago

i'm not willing to trade any more quality for performance. no draft, no cache for kv either. i'll take the performance cost, it just makes me think carefully about my prompt. i rarely every need more than one prompt to get my answers. :D

zackangelo•2h ago

Draft model doesn’t degrade quality!

teaearlgraycold•9h ago

As far as inference costs go 480GB of RAM is cheap.

binarymax•8h ago

You rent an a100x8 or higher and pay $10k a month in costs, which will work well if you have a whole team using it and you have the cash. I’ve seen people spending $200-500 per day on Claude code. So if this model is comparable to Opus then it’s worth it.

jychang•8h ago

If you're running it for personal use, you don't need to put all of it onto GPU vram. Cheap DDR5 ram is fine. You just need a GPU in the system to do compute for the prompt processing and to hold the common tensors that run for every token.

For reference, a RTX 3090 has about 900GB/sec memory bandwidth, and a Mac Studio 512GB has 819GB/sec memory bandwidth.

So you just need a workstation with 8 channel DDR5 memory, and 8 sticks of RAM, and stick a 3090 GPU inside of it. Should be cheaper than $5000, for 512GB of DDR5-6400 that runs at a memory bandwidth of 409GB/sec, plus a RTX 3090.

ac29•4h ago

> So if this model is comparable to Opus then it’s worth it.

Qwen says this is similar in coding performance to Sonnet 4, not Opus.

generalizations•9h ago

> Additionally, we are actively exploring whether the Coding Agent can achieve self-improvement

How casually we enter the sci-fi era.

yakz•9h ago

I don’t get the feeling that the amount of money being spent is at all casual.

jasonjmcghee•4h ago

We have self driving cars, humanoid robots, and thinking machines. I think we're there.

jasonthorsness•9h ago

What sort of hardware will run Qwen3-Coder-480B-A35B-Instruct?

With the performance apparently comparable to Sonnet some of the heavy Claude Code users could be interested in running it locally. They have instructions for configuring it for use by Claude Code. Huge bills for usage are regularly shared on X, so maybe it could even be economical (like for a team of 6 or something sharing a local instance).

sourcecodeplz•9h ago

With RAM you would need at least 500gb to load it but some 100-200gb more for context too. Pair it with a 24gb GPU and the speed will be 10t/s, at least, I estimate.

danielhanchen•9h ago

Oh yes for the FP8, you will need 500GB ish. 4bit around 250GB - offloading MoE experts / layers to RAM will definitely help - as you mentioned a 24GB card should be enough!

vFunct•8h ago

Do we know if the full model is FP8 or FP16/BF16? The hugging face page says BF16: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct

So likely it needs 2x the memory.

danielhanchen•8h ago

I think it's BF16 trained then quantized to FP8, but unsure fully - I was also trying to find out if they used FP8 for training natively!

jychang•8h ago

Qwen uses 16bit, Kimi and Deepseek uses FP8.

danielhanchen•5h ago

Oh ok cool thanks!

danielhanchen•9h ago

I'm currently trying to make dynamic GGUF quants for them! It should use 24GB of VRAM + 128GB of RAM for dynamic 2bit or so - they should be up in an hour or so: https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruc.... On running them locally, I do have docs as well: https://docs.unsloth.ai/basics/qwen3-coder

gardnr•9h ago

Legend

danielhanchen•9h ago

zettabomb•9h ago

Any significant benefits at 3 or 4 bit? I have access to twice that much VRAM and system RAM but of course that could potentially be better used for KV cache.

sourcecodeplz•9h ago

For coding you want more precision so the higher the quant the better. But there is discussion if a smaller model in higher quant is better than a larger one in lower quant. Need to test for yourself with your use cases I'm afraid.

e: They did announce smaller variants will be released.

danielhanchen•9h ago

Yes the higher the quant, the better! The other approach is dynamically choosing to upcast some layers!

segmondy•7h ago

I can say that this really works great, I'm a heavy user of the unsloth dyanmic quants. I run DeepSeek v3/r1 in Q3, and ernie-300b and KimiK2 in Q3 too. Amazing performance. I run Qwen3-235b in both Q4 and Q8 and can barely tell the difference so much so that I just keep Q4 since it's twice as fast.

danielhanchen•5h ago

Thanks for using them! :)

someone13•3h ago

What hardware do you use, out of curiosity?

fzzzy•9h ago

I would say that three or four bit are likely to be significantly better. But that’s just from my previous experience with quants. Personally, I try not to use anything smaller than a Q4.

danielhanchen•9h ago

So dynamic quants like what I upload are not actually 4bit! It's a mixture of 4bit to 8bit with important layers being in higher precision! I wrote about our method here: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs

ilaksh•9h ago

To run the real version with the bench arks they give, it would be a nonquantized non distilled version. So I am guessing that is a cluster of 8 H200s if you want to be more or less up to date. They have B200s now which are much faster but also much more expensive. $300,000+

You will see people making quantized distilled versions but they never give benchmark results.

danielhanchen•9h ago

Oh you can run the Q8_0 / Q8_K_XL which is nearly equivalent to FP8 (maybe off by 0.01% or less) -> you will need 500GB of VRAM + RAM + Disk space. Via MoE layer offloading, it should function ok

summarity•9h ago

This should work well for MLX Distributed. The low activation MoE is great for multi node inference.

ilaksh•8h ago

1. What hardware for that. 2. Can you do a benchmark?

btian•9h ago

Do need to be super fancy. Just RTX Pro 6000 and 256GB of RAM.

simonw•8h ago

There's a 4bit version here that uses around 272GB of RAM on a 512GB M3 Mac Studio: https://huggingface.co/mlx-community/Qwen3-Coder-480B-A35B-I... - see video: https://x.com/awnihannun/status/1947771502058672219

That machine will set you back around $10,000.

jychang•8h ago

You can get similar performance on an Azure HX vm:

https://learn.microsoft.com/en-us/azure/virtual-machines/siz...

osti•7h ago

How? These don't even have GPU's right?

jychang•7h ago

They have similar memory bandwidth compared to the Mac Studio. You can run it off CPU at the same speed.

osti•6h ago

Interesting, so with enough memory bandwidth, even the server CPU has enough compute to do inference on a rather large model? Enough to compete against M4 gpu?

Edit: I just aked chatgpt and it says with no memory bandwidth bottleneck, i can still only achieve around 1 token/s from a 96 core cpu.

timschmidt•6h ago

For a single user prompting with one or few prompts at a time, compute is not the bottleneck. Memory bandwidth is. This is because the entire model's weights must be run through the algorithm many times per prompt. This is also why multiplexing many prompts at the same time is relatively easy and effective, as many matrix multiplications can happen in the time it takes to do a single fetch from memory.

osti•6h ago

Yes, but with a 400B parameter model, at fp16 it's 800GB right? So with 800GB/s memory bandwidth, you'd still only be able to bring them in once per second.

Edit: actually forgot the MoE part, so that makes sense.

timschmidt•6h ago

Approximately, yes. For MoE models, there is less required bandwidth, as you're generally only processing the weights from one or two experts at a time. Though which experts can change from token to token, so it's best if all fit in RAM. The sort of machines hyperscalers are using to run these things have essentially 8x APUs each with about that much bandwidth, connected to other similar boxes via infiniband or 800gbps ethernet. Since it's relatively straightforward to split up the matrix math for parallel computation, segmenting the memory in this way allows for near linear increases in memory bandwidth and inference performance. And is effectively the same thing you're doing when adding GPUs.

kentonv•7h ago

Ugh, why is Apple the only one shipping consumer GPUs with tons of RAM?

I would totally buy a device like this for $10k if it were designed to run Linux.

jauntywundrkind•7h ago

Intel already has a great value GPU. Everyone wants them to disrupt the game, destroy the product niches. It's general purpose compute performance is quite ass alas but maybe that doesn't matter for AI?

I'm not sure if there are higher capacity gddr6 & 7's rams to buy. I semi doubt you can add more without more channels, to some degree, but also, AMD just shipped R9700 based on rx9070 but with double the ram. But something like Strix Halo, an API with more lpddr channels could work. Word is that Strix Halo's 2027 successor Medusa Halo will go to 6 channels and it's hard to see a significant advantage without that win; the processing is already throughput constrained-ish and a leap on memory bandwidth will definitely be required. Dual channel 128b isn't enough!

There's also MRDIMMs standard, which multiplexes multiple chips. That promises a doubling of both capacity and throughout.

Apple's definitely done two brilliant costly things, by putting very wide (but not really fast) memory on package (Intel had dabbled in doing similar with regular width ram in consumer space a while ago with Lakefield). And then by tiling multiple cores together, making it so that if they had four perfect chips next to each other they could ship it as one. Incredibly brilliant maneuver to get fantastic yields, and to scale very big.

sagarm•7h ago

You can get 128GB @ ~500GB/s now for ~$2k: https://a.co/d/bjoreRm

It has 8 channels of DDR5-8000.

kentonv•6h ago

Per above, you need 272GB to run Qwen3-Coder (at 4 bit quantization).

Avlin67•5h ago

wrong it is approx half bandwith

ac29•4h ago

AMD says "256-bit LPDDR5x"

It might be technically correct to call it 8 channels of LPDDR5 but 256-bits would only be 4 channels of DDR5.

p_l•2h ago

DDR5 uses 32bit channels as well. A DDR5 DIMM holds two channels accessed separately.

sbrother•5h ago

You can buy a RTX 6000 Pro Blackwell for $8000-ish which has 96GB VRAM and is much faster than the Apple integrated GPU.

thenaturalist•2h ago

In depth comparison of an RTX vs. M3 Pro with 96 GB VRAM: https://www.youtube.com/watch?v=wzPMdp9Qz6Q

Avlin67•5h ago

xeon 6980P which now costs 6K€ instead of 17K€

827a•3h ago

The initial set of prices on OpenRouter look pretty similar to Claude Sonnet 4, sadly.

rbren•9h ago

Glad to see everyone centering on using OpenHands [1] as the scaffold! Nothing more frustrating than seeing "private scaffold" on a public benchmark report.

[1] https://github.com/All-Hands-AI/OpenHands

swyx•9h ago

more info on AllHands from robert (above) https://youtu.be/o_hhkJtlbSs

KaoruAoiShiho•8h ago

How is cognition so incompetent? They got hundreds of millions of dollars but now they're not just supplanted by Cursor and Claude Code but also by their literal clone, an outfit that was originally called "OpenDevin".

samrus•4h ago

The AI space is attracting alot of grifters. Even the initial announcement for devin was reaking of elon musk style overpromising.

Im sure the engineers are doing the best work they can. I just dont think leadership is as interested in making a good product as they are in creating a nice exit down the line

flakiness•9h ago

The "qwen-code" app seems to be a gemini-cli fork.

https://github.com/QwenLM/qwen-code https://github.com/QwenLM/qwen-code/blob/main/LICENSE

I hope these OSS CC clones converge at some point.

Actually it is mentioned in the page:

   we’re also open-sourcing a command-line tool for agentic coding: Qwen Code. Forked from Gemini Code

rapind•9h ago

I currently use claude-code as the director basically, but outsource heavy thinking to openai and gemini pro via zen mcp. I could instead use gemini-cli as it's also supported by zen. I would imagine it's trivial to add qwen-coder support if it's based on gemini-cli.

bredren•7h ago

How was your experience using Gemini via Zen?

I’ve instead used a Gemini via plain ol’ chat, first building a competitive, larger context than Claude can hold then manually bringing detailed plans and patches to Gemini for feedback with excellent results.

I presumed mcp wouldn’t give me the focused results I get from completely controlling Gemini.

And that making CC interface via the MCP would also use up context on that side.

rapind•1h ago

I just use it for architecture planning mostly when I want more info and to feed more info to claude. Tougher problems where 3 brains are better.

apwell23•6h ago

what is the benefit of outsourcing to other models. do you see any noticable differences?

bredren•4h ago

There are big gains to be had by having one top tier model review the work of another.

For example, you can drive one model to a very good point through several turns, and then have the second “red team” the result of the first.

Then return that to the first model with all of its built up context.

This is particularly useful in big plans doing work on complex systems.

Even with a detailed plan, it is not unusual for Claude code to get “stuck” which can look like trying the same thing repeatedly.

You can just stop that, ask CC to summarize the current problem and attempted solutions into a “detailed technical briefing.”

Have CC then list all related files to the problem including tests, then provide the briefing and all of the files to the second LLM.

This is particularly good for large contexts that might take multiple turns to get into Gemini.

You can have the consulted model wait to provide any feedback until you’ve said your done adding context.

And then boom, you get a detailed solution without even having to directly focus on whatever minor step CC is stuck on. You stay high level.

In general, CC is immediately cured and will finish its task. This is a great time to flip it into planning mode and get plan alignment.

Get Claude to output an update on its detailed plan including what has already been accomplished then again—-ship it to the consulting model.

If you did a detailed system specification in advance, (which CC hopefully was originally also working from) You can then ask the consulting model to review the work done and planned next steps.

Inevitably the consulting model will have suggestions to improve CC’s work so far and plans. Send it on back and you’re getting outstanding results.

mrbonner•9h ago

They also support Claude Code. But my understanding is Claude Code is closed source and only support Clade API endpoint. How do they make it work?

vtail•9h ago

Claude uses OpenAI-compatible APIs, and Claude Code respects environment variables that change the base url/token.

segmondy•8h ago

no it doesn't, claude uses anthropic API. you need to run an anthropic2openAPI proxy

vtail•7h ago

thank you, I stand corrected

Update: Here is what o3 thinks about this topic: https://chatgpt.com/share/688030a9-8700-800b-8104-cca4cb1d0f...

alwillis•8h ago

But my understanding is Claude Code is closed source and only support Clade API endpoint. How do they make it work?

You set the environment variable ANTHROPIC_BASE_URL to an OpenAI-compatible endpoint and ANTHROPIC_AUTH_TOKEN to the API token for the service.

I used Kimi-K2 on Moonshot [1] with Claude Code with no issues.

There's also Claude Code Router and similar apps for routing CC to a bunch of different models [2].

[1]: https://platform.moonshot.ai/

[2]: https://github.com/musistudio/claude-code-router

mrbonner•6h ago

That makes sense. Thanks Do you know if this works with AWS Berdrock as well? Or do I need to sort out to use the proxy approach?

jimmydoe•4h ago

Bedrock is officially support by Claude code.

Zacharias030•6h ago

How good is it in comparison? This is an interesting apples to apples situation:)

Imanari•5h ago

You can use any model from openrouter with CC via https://github.com/musistudio/claude-code-router

ai-christianson•8h ago

We shipped RA.Aid, an agentic evolution of what aider started, back in late '24, well before CC shipped.

Our main focuses were to be 1) CLI-first and 2) truly an open source community. We have 5 independent maintainers with full commit access --they aren't from the same org or entity (disclaimer: one has joined me at my startup Gobii where we're working on web browsing agents.)

I'd love someone to do a comparison with CC, but IME we hold our own against Cursor, Windsurf, and other agentic coding solutions.

But yes, there really needs to be a canonical FOSS solution that is not tied to any specific large company or model.

danenania•6h ago

I’ll throw out a mention for my project Plandex[1], which predates Claude Code and combines models from multiple providers (Anthropic, Google, and OpenAI by default). It can also use open source and local models.

It focuses especially on large context and longer tasks with many steps.

1 - https://github.com/plandex-ai/plandex

esafak•4h ago

Have you measured and compared your agent's efficiency and success rate against anything? I am curious. It would help people decide; there are many coding agents now.

danenania•2h ago

Working on it. I’m making a push currently on long horizon tasks, where Plandex already does well vs. alternatives, and plan to include side-by-side comparisons with the release.

mkagenius•45m ago

Also, kudos to Gemini CLI team for making it open source (unlike claude) and that too easily tunable to new models like Qwen.

It would be great if it starts supporting other models too natively. Wouldn't require people to fork.

rapind•9h ago

I just checked and it's up on OpenRouter. (not affiliated) https://openrouter.ai/qwen/qwen3-coder

danielhanchen•9h ago

I'm currently making 2bit to 8bit GGUFs for local deployment! Will be up in an hour or so at https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruc...

Also docs on running it in a 24GB GPU + 128 to 256GB of RAM here: https://docs.unsloth.ai/basics/qwen3-coder

mathrawka•9h ago

Looks like the docs have a typo:

    Recommended context: 65,536 tokens (can be increased)

That should be recommended token output, as shown in the official docs as:

    Adequate Output Length: We recommend using an output length of 65,536 tokens for most queries, which is adequate for instruct models.

danielhanchen•9h ago

Oh thanks - so the output can be any length you like - I'm actually also making 1 million context length GGUFs as well! https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruc...

sgammon•7h ago

Cool, thanks! I'd like to try it

danielhanchen•6h ago

It just got uploaded! I made some docs as well on how to run it at https://docs.unsloth.ai/basics/qwen3-coder

jdright•5h ago

Any idea if there is a way to run on 256gb ram + 16gb vram with usable performance, even if barely?

danielhanchen•5h ago

Yes! 3bit maybe 4bit can also fit! llama.cpp has MoE offloading so your GPU holds the active experts and non MoE layers, thus you only need 16GB to 24GB of VRAM! I wrote about how to do in this section: https://docs.unsloth.ai/basics/qwen3-coder#improving-generat...

andai•4h ago

I've been reading about your dynamic quants, very cool. Does your library let me produce these, or only run them? I'm new to this stuff.

danielhanchen•3h ago

Thank you! Oh currently not sadly - we might publish some stuff on it in the future!

Jayakumark•4h ago

What will be the approx token/s prompt processing and generation speed with this setup on RTX 4090?

danielhanchen•3h ago

I also just made IQ1_M which needs 160GB! If you have 160-24 = 136 ish of RAM as well, then you should get 3 tokens to 5 ish per second.

If you don't have enough RAM, then < 1 token / s

Abishek_Muthian•3h ago

Thank you for your work, does the Qwen3-Coder offer significant advantage over Qwen2.5-coder for non-agentic tasks like just plain autocomplete and chat?

danielhanchen•3h ago

Oh it should be better, especially since the model was specifically designed for coding tasks! You can disable the tool calling parts of the model!

babuloseo•3h ago

hello sir

danielhanchen•3h ago

hi!

gnulinux•2h ago

Do 2bit quantizations really work? All the ones I've seen/tried were completely broken even when 4bit+ quantizations worked perfectly. Even if it works for these extremely large models, is it really much better than using something slightly smaller on 4 or 5 bit quant?

danielhanchen•2h ago

Oh the Unsloth dynamic ones are not 2bit at all - it's a mixture of 2, 3, 4, 5, 6 and sometimes 8bit.

Important layers are in 8bit, 6bit. Less important ones are left in 2bit! I talk more about it here: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs

CMCDragonkai•2h ago

How do you decide which layers are the important ones?

danielhanchen•1h ago

I wrote approximately in the blog about it and linked some papers! I also wrote about it here - https://unsloth.ai/blog/dynamic-4bit - one has to inspect the activation and weight quantization errors!

blensor•19m ago

Not an AI researcher here so this is probably common knowledge for people in this field, but I saw a video about the quantization recently and wondered exactly about that, if it's possible to compress a net by using more precision where it counts and less precision where it's not important. And also wondered how one would go about deciding which parts count and which don't

Great to know that this is already a thing and I assume model "compression" is going to be the next hot topic

pxc•9h ago

> Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant first

I'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close. But since for the foreseeable future, I'll probably sometimes want to "call in" a bigger model that I can't realistically or affordably host on my own computer, I love having the option of high-quality open-weight models for this, and I also like the idea of "paying in" for the smaller open-weight models I play around with by renting access to their larger counterparts.

Congrats to the Qwen team on this release! I'm excited to try it out.

segmondy•8h ago

small models can never match bigger models, the bigger models just know more and are smarter. the smaller models can get smarter, but as they do, the bigger models get smart too. HN is weird because at one point this was the location where I found the most technically folks, and now for LLM I find them at reddit. tons of folks are running huge models, get to researching and you will find out you can realistically host your own.

nico•8h ago

> and now for LLM I find them at reddit. tons of folks are running huge models

Very interesting. Any subs or threads you could recommend/link to?

Thanks

segmondy•7h ago

join us at r/LocalLlama

ActorNightly•5h ago

Basically just run ollama and run the quantized models. Don't expect high generation speeds though.

pxc•7h ago

> small models can never match bigger models, the bigger models just know more and are smarter.

They don't need to match bigger models, though. They just need to be good enough for a specific task!

This is more obvious when you look at the things language models are best at, like translation. You just don't need a super huge model for translation, and in fact you might sometimes prefer a smaller one because being able to do something in real-time, or being able to run on a mobile device, is more important than marginal accuracy gains for some applications.

I'll also say that due to the hallucination problem, beyond whatever knowledge is required for being more or less coherent and "knowing" what to write in web search queries, I'm not sure I find more "knowledgeable" LLMs very valuable. Even with proprietary SOTA models hosted on someone else's cloud hardware, I basically never want an LLM to answer "off the dome"; IME it's almost always wrong! (Maybe this is less true for others whose work focuses on the absolute most popular libraries and languages, idk.) And if an LLM I use is always going to be consulting documentation at runtime, maybe that knowledge difference isn't quite so vital— summarization is one of those things that seems much, much easier for language models than writing code or "reasoning".

All of that is to say:

Sure, bigger is better! But for some tasks, my needs are still below the ceiling of the capabilities of a smaller model, and that's where I'm focusing on local usage. For now that's mostly language-focused tasks entirely apart from coding (translation, transcription, TTS, maybe summarization). It may also include simple coding tasks today (e.g., fancy auto-complete, "ghost-text" style). I think it's reasonable to hope that it will eventually include more substantial programming tasks— even if larger models are still preferable for more sophisticated tasks (like "vibe coding", maybe).

If I end up having a lot of fun, in a year or two I'll probably try to put together a machine that can indeed run larger models. :)

conradkay•5h ago

For coding though it seems like people are willing to pay a lot more for a slightly better model.

omeid2•5h ago

The problem with local vs remote isn't so much about paid. It is about compliance and privacy.

bredren•4h ago

>you might sometimes prefer a smaller one because being able to do something in real-time, or being able to run on a mobile device, is more important than marginal accuracy gains for some applications.

This reminds me of ~”the best camera is the one you have with you” idea.

Though, large models are an http request away, there are plenty of reasons to want to run one locally. Not the least of which is getting useful results in the absence of internet.

saurik•3h ago

> Even with proprietary SOTA models hosted on someone else's cloud hardware, I basically never want an LLM to answer "off the dome"; IME it's almost always wrong! (Maybe this is less true for others whose work focuses on the absolute most popular libraries and languages, idk.)

I feel like I'm the exact opposite here (despite heavily mistrusting these models in general): if I came to the model to ask it a question, and it decides to do a Google search, it pisses me off as I not only could do that, I did do that, and if that had worked out I wouldn't be bothering to ask the model.

FWIW, I do imagine we are doing very different things, though: most of the time, when I'm working with a model, I'm trying to do something so complex that I also asked my human friends and they didn't know the answer either, and my attempts to search for the answer are failing as I don't even know the terminology.

larodi•53m ago

All of these models are suitable for translation and that is what they are most suitable for. The architecture inherits from seq2seq and original transformers was created to benefit Google translations.

Eggpants•6h ago

The large models are using tools/functions to make them useful. Sooner or later open source will provide a good set of tools/functions for coding as well.

BriggyDwiggs42•6h ago

The small model only needs to get as good as the big model is today, not as the big model is in the future.

mlyle•5h ago

> HN is weird because at one point this was the location where I found the most technically folks, and now for LLM I find them at reddit.

Is this an effort to chastise the viewpoint advanced? Because his viewpoint makes sense to me: I can run biggish models on my 128GB Macbook but not huge ones-- even 2b quantized ones suck too many resources.

So I run a combination of local stuff and remote stuff depending upon various factors (cost, sensitivity of information, convenience/whether I'm at home, amount of battery left, etc ;)

Yes, bigger models are better, but often smaller is good enough.

wkat4242•5h ago

Well yes tons of people are running them but they're all pretty well off.

I don't have 10-20k$ to spend on this stuff. Which is about the minimum to run a 480B model, with huge quantisation. And pretty slow because for that price all you get is an old Xeon with a lot of memory or some old nvidia datacenter cards. If you want a good setup it will cost a lot more.

So small models it is. Sure, the bigger models are better but because the improvements come so fast it means I'm only 6 months to a year behind the big ones at any time. Is that worth 20k? For me no.

y1n0•5h ago

I'd be interested in smaller models that were less general, with a training corpus more concentrated. A bash scripting model, or a clojure model, or a zig model, etc.

ActorNightly•5h ago

Not really true. Gemma from Google with quantized aware training does an amazing job.

Under the hood, the way it works, is that when you have final probabilities, it really doesn't matter if the most likely token is selected with 59% or 75% - in either case it gets selected. If the 59% case gets there with smaller amount of compute, and that holds across the board for the training set, the model will have similar performance.

In theory, it should be possible to narrow down models even smaller to match the performance of big models, because I really doubt that you do need transformers for every single forward pass. There are probably plenty of shortcuts you can take in terms of compute for sets of tokens in the context. For example, coding structure is much more deterministic than natural text, so you probably don't need as much compute to generate accurate code.

You do need a big model first to train a small model though.

As for running huge models locally, its not enough to run them, you need good throughput as well. If you spend $2k on a graphics card, that is way more expensive than realistic usage with a paid API, and slower output as well.

ants_everywhere•4h ago

There's a niche for small-and-cheap, especially if they're fast.

I was surprised in the AlphaEvolve paper how much they relied on the flash model because they were optimizing for speed of generating ideas.

otabdeveloper4•3h ago

> small models can never match bigger models, the bigger models just know more and are smarter

Untrue. The big important issue for LLMs is hallucination, and making your model bigger does little to solve it.

Increasing model size is a technological dead end. The future advanced LLM is not that.

giorgioz•1h ago

which sub-reddits do you recommend?

KronisLV•1h ago

> I'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close.

Likewise, I found that the regular Qwen3-30B-A3B worked pretty well on a pair of L4 GPUs (60 tokens/second, 48 GB of memory) which is good enough for on-prem use where cloud options aren't allowed, but I'd very much like a similar code specific model, because the tool calling in something like RooCode just didn't work with the regular model.

In those circumstances, it isn't really a comparison between cloud and on-prem, it's on-prem vs nothing.

NitpickLawyer•53m ago

Give devstral a try, fp8 should fit in 48GB, it was surprisingly good for a 24B local model, w/ cline/roo. Handles itself well, doesn't get stuck much, most of the things work OK (considering the size ofc)

larodi•54m ago

Been using ggerganov’s llama vscode plugin with the smaller 2.5 models and it actually works super nice on a M3 Max

vFunct•8h ago

Much faster than Claude Sonnet 4 with similar results.

thenaturalist•6h ago

Care to share more specifics/ your comparison case?

campers•5h ago

Looking forward to using this on Cerebras!

indigodaddy•8h ago

How does one keep up with all this change? I wish we could fast-forward like 2-3 years to see if an actual winner has landed by then. I feel like at that point there will be THE tool, with no one thinking twice about using anything else.

int_19h•8h ago

Why do you believe so? The leaderboard is highly unstable right now and there are no signs of that subsiding. I would expect the same situation 2-3 years forward, just possibly with somewhat different players.

blibble•7h ago

don't bother at all

assuming it doesn't all implode due to a lack of profitability, it should be obvious

aitchnyu•2h ago

The underlying models are apparently profitable. Inference costs are in a exponential fall that makes Gordon Moore faint. OpenRouter shows Anthropic, AWS, Google host Claude at same rates, apparently nobody is price dumping.

That said, code+git+agent is only acceptable way for technical staff to interact with AI. Tools with sparkles button can go to hell.

https://a16z.com/llmflation-llm-inference-cost/ https://openrouter.ai/anthropic/claude-sonnet-4

segmondy•7h ago

One keeps up with it, by keeping up with it. Folks keep up with latest social media gossip, the news, TV shows, or whatever interests them. You just stay on it. Weekend I got to running Kimi K2, last 2 days I have been driving Ernie4.5-300B, Just finished downloading the latest Qwen3-235b this morning and started using it this evening. Tonight I'll start downloading this 480B, might take 2-3 days with my crappy internet and then I'll get to it.

Obsession?

Sabinus•4h ago

Do you write about your assessments of model capabilities and the results of your experiments?

Zacharias030•3h ago

what kind of hardware do you run it on?

SchemaLoad•6h ago

Just ignore it until something looks useful. There's no reason to keep up, it's not like it takes 3 years experience to type in a prompt box.

barrell•2h ago

Yeah second this. I find model updates mildly interesting, but besides grok 4 I haven’t even tried a new model all year.

Its a bit like the media cycle. The more jacked in you are, the more behind you feel. I’m less certain there will be winners as much as losers, but for sure the time investment on staying up to date on these things will not pay dividends to the average hn reader

lizardking•5h ago

It's hard to avoid if you frequent HN

stets•2h ago

I'm using claude code and making stuff. I'm keeping an eye and being aware of these new tools but I wait for the dust to settle and see if people switch or are still hyped after the hype dies down. X / HackerNews are good for keeping plugged in.

jasonvorhe•1h ago

Mass adoption is rarely a quality indicator. I wouldn't want to pay for the mainstream VHS model(s) when I could use Betamax (perhaps even cheaper).

A look at fandom wikis is humbling. People will persist and go very deep into stuff they care about.

In this case: Read a lot, try to build a lot, learn, learn from mistakes, compare.

mogili•7h ago

I'm waiting on this to be released on Groq or Cerebras for high speed vibe coding.

adenta•4h ago

I was only getting like 200 tk/s with groq on K2, was expecting it to be faster tbh.

I think the bottleneck is file read/write tooling right now

nnx•7h ago

This suggests adding a `QWEN.md` in the repo for agents instructions. Where are we with `AGENTS.md`? In a team repo it's getting ridiculous to have a duplicate markdown file for every agent out there.

singhrac•7h ago

I just symlink to AGENTS.md, the instructions are all the same (and gitignore the model-specific version).

drewbitt•5h ago

CLAUDE.md MISTRAL.md GEMINI.md QWEN.md GROK.md .cursorrules .windsurfrules .copilot-instructions

Saw a repo recently with probably 80% of those

sunaookami•3h ago

I just make a file ".llmrules" and symlink these files to it. It clutters the repo root, yes...

mattigames•3h ago

Maybe there could be an agent that is in charge of this and it's trained to automatically create a file for any new agent, it could even temporarily delete local copies of MD files that no agents are using at the moment to free the visual clutter when navigating the repo.

theshrike79•1h ago

I tried making an MCP with the common shit I need to tell the agents, but it didn't pan out.

Now I have a git repo I add as a submodule and tell each tool to read through and create their own WHATEVER.md

lvl155•6h ago

Can someone please make these CLI from Rust/Ratatui.

mostlyk•1h ago

I made one using Rust predating Gemini-CLI https://github.com/MostlyKIGuess/Yappus-Term , but it's more of a search tool than coding.

Closest you get is https://github.com/opencode-ai/opencode in GO.

crvdgc•24m ago

OpenAI's Codex has a Rust and Ratatui implementation. I believe it's now the default verison. (Previously the TypeScript implementation was the default.)

https://github.com/openai/codex/tree/main/codex-rs

sunaookami•3h ago

Thank god I already made an Alibaba Cloud account last year because this interface sucks big time. At least you get 1 mio. tokens free (once?). Bit confusing that they forked the Gemini CLI but you still have to set environment variables for OpenAI?

NitpickLawyer•2h ago

At this point the openai compatible API is the de facto standard. You probably want to set both the base_url and api key, so you can test with 3rd party providers.

nisten•3h ago

I've been using it all day, it rips. Had to bump up toolcalling limit in cline to 100 and it just went through the app no issues, got the mobile app built, fixed throug hthe linter errors... wasn't even hosting it with the toolcall template on with the vllm nightly, just stock vllm it understood the toolcall instructions just fine

ramoz•2h ago

Nice, what model & on what hardware?

nxobject•2h ago

Welp, time to switch aider models for the _second_ time in a week...

manmal•1h ago

How good is it at editing files? Many write/replace errors?

jijji•2h ago

I'm confused why would this LLM require API keys to openAI?

niea_11•1h ago

The env variables names are misleading. They don't require api keys to OpenAI. Apparently, their tool can connect to any open ai compatible api and that's how you configure your crendentials. You can point it to openrouter or nebius.com to use other models.

pzo•2h ago

does anyone understand pricing ? On OpenRouter (https://openrouter.ai/qwen/qwen3-coder) you have:

Alibaba Plus: input: $1 to $6 output: $5 to $60

Alibaba OpenSource: input: $1.50 to $4.50 output: $7.50 to $22.50

So it doesn't look that cheap comparing to Kimi k2 or their non coder version (Qwen3 235B A22B 2507).

What's more confusing this "up to" pricing that supposed to can reach $60 for output - with agents it's not that easy to control context.

barrenko•40m ago

I wanted to start using the Alibaba cloud for a personal project six months ago, couldn't make sense of the pricing and just gave up, so it's not new in my humble experience...

zkmon•15m ago

At my work, here is a typical breakdown of time spent by work areas for a software engineer. Which of these areas can be sped up by using agentic coding?

05%: Making code changes

10%: Running build pipelines

20%: Learning about changed process and people via zoom calls, teams chat and emails

15%: Raising incident tickets for issues outside of my control

20%: Submitting forms, attending reviews and chasing approvals

20%: Reaching out to people for dependencies, following up

10%: Finding and reading up some obscure and conflicting internal wiki page, which is likely to be outdated

Extending Emacs with Fennel (2024)

Rescuing two PDP-11s from a former British Telecom underground shelter (2023)

Qwen3-Coder: Agentic coding in the world

Mathematics for Computer Science (2024)

When Is WebAssembly Going to Get DOM Support?

Show HN: WTFfmpeg – Natural Language to FFmpeg Translator

Depot (YC W23) Is Hiring a Technical Content Writer (Remote)

Org tutorials

More than you wanted to know about how Game Boy cartridges work

Why you can't color calibrate deep space photos

Android Earthquake Alerts: A global system for early warning

Algorithms for Modern Processor Architectures

Managing EFI boot loaders for Linux: Controlling secure boot (2015)

Swift-erlang-actor-system

AI coding agents are removing programming language barriers

We built an air-gapped Jira alternative for regulated industries

I watched Gemini CLI hallucinate and delete my files

Countries across the world see food price shocks from climate extremes

Don't animate height

AI groups spend to replace low-cost 'data labellers' with high-paid experts

Subliminal learning: Models transmit behaviors via hidden signals in data

TODOs aren't for doing

Fourier lightfield multiview stereoscope for large field-of-view 3D imaging

TapTrap: Animation‑Driven Tapjacking on Android

Show HN: A word of the day that doesn't suck

Gemini North telescope discovers long-predicted stellar companion of Betelgeuse

Font Comparison: Atkinson Hyperlegible Mono vs. JetBrains Mono and Fira Code

Many lung cancers are now in nonsmokers

Show HN: Phind.design – Image editor & design tool powered by 4o / custom models

Project Lyra – Exploring Interstellar Objects

Extending Emacs with Fennel (2024)

Rescuing two PDP-11s from a former British Telecom underground shelter (2023)

Qwen3-Coder: Agentic coding in the world

Mathematics for Computer Science (2024)

When Is WebAssembly Going to Get DOM Support?

Show HN: WTFfmpeg – Natural Language to FFmpeg Translator

Depot (YC W23) Is Hiring a Technical Content Writer (Remote)

Org tutorials

More than you wanted to know about how Game Boy cartridges work

Why you can't color calibrate deep space photos

Android Earthquake Alerts: A global system for early warning

Algorithms for Modern Processor Architectures

Managing EFI boot loaders for Linux: Controlling secure boot (2015)

Swift-erlang-actor-system

AI coding agents are removing programming language barriers

We built an air-gapped Jira alternative for regulated industries

I watched Gemini CLI hallucinate and delete my files

Countries across the world see food price shocks from climate extremes

Don't animate height

AI groups spend to replace low-cost 'data labellers' with high-paid experts

Subliminal learning: Models transmit behaviors via hidden signals in data

TODOs aren't for doing

Fourier lightfield multiview stereoscope for large field-of-view 3D imaging

TapTrap: Animation‑Driven Tapjacking on Android

Show HN: A word of the day that doesn't suck

Gemini North telescope discovers long-predicted stellar companion of Betelgeuse

Font Comparison: Atkinson Hyperlegible Mono vs. JetBrains Mono and Fira Code

Many lung cancers are now in nonsmokers

Show HN: Phind.design – Image editor & design tool powered by 4o / custom models

Project Lyra – Exploring Interstellar Objects

Qwen3-Coder: Agentic coding in the world

Comments