Or you can rent a newer one for $300/mo on the cloud
The RAM is for the 400gb of experts.
If you have 500GB of SSD, llama.cpp does disk offloading -> it'll be slow though less than 1 token / s
3 t/s isn't going to be a lot of fun to use.
For reference, a RTX 3090 has about 900GB/sec memory bandwidth, and a Mac Studio 512GB has 819GB/sec memory bandwidth.
So you just need a workstation with 8 channel DDR5 memory, and 8 sticks of RAM, and stick a 3090 GPU inside of it. Should be cheaper than $5000, for 512GB of DDR5-6400 that runs at a memory bandwidth of 409GB/sec, plus a RTX 3090.
Qwen says this is similar in coding performance to Sonnet 4, not Opus.
How casually we enter the sci-fi era.
With the performance apparently comparable to Sonnet some of the heavy Claude Code users could be interested in running it locally. They have instructions for configuring it for use by Claude Code. Huge bills for usage are regularly shared on X, so maybe it could even be economical (like for a team of 6 or something sharing a local instance).
So likely it needs 2x the memory.
e: They did announce smaller variants will be released.
https://arxiv.org/abs/2505.24832
LLMs usually have about 3.6 bits of data per parameter. You're losing a lot of information if quantized to 2 bits. 4 bit quants are the sweet spot where there's not much quality loss.
You will see people making quantized distilled versions but they never give benchmark results.
That machine will set you back around $10,000.
https://learn.microsoft.com/en-us/azure/virtual-machines/siz...
Edit: I just aked chatgpt and it says with no memory bandwidth bottleneck, i can still only achieve around 1 token/s from a 96 core cpu.
Edit: actually forgot the MoE part, so that makes sense.
And this is why I'm so excited about MoE models! qwen3:30b-a3b runs at the speed of a 3B parameter model. It's completely realistic to run on a plain CPU with 20 GB RAM for the model.
Here's Deepseek R1 running off of RAM at 8tok/sec: https://www.youtube.com/watch?v=wKZHoGlllu4
If a $4,000 Mac does something at X tok/s, a $400 AMD PC on pure CPU does it at 0.1*X tok/s.
Assuming good choices for how that money is spent. You can always waste more money. As others have said, it's all about memory bandwidth. AMD's "AI Max+ 395" is gonna make this interesting.
And of course you can always just not have enough RAM to even run the model. This tends to happen with consumer discrete GPUs not having that much VRAM, they were built for gaming.
I would totally buy a device like this for $10k if it were designed to run Linux.
I'm not sure if there are higher capacity gddr6 & 7's rams to buy. I semi doubt you can add more without more channels, to some degree, but also, AMD just shipped R9700 based on rx9070 but with double the ram. But something like Strix Halo, an API with more lpddr channels could work. Word is that Strix Halo's 2027 successor Medusa Halo will go to 6 channels and it's hard to see a significant advantage without that win; the processing is already throughput constrained-ish and a leap on memory bandwidth will definitely be required. Dual channel 128b isn't enough!
There's also MRDIMMs standard, which multiplexes multiple chips. That promises a doubling of both capacity and throughout.
Apple's definitely done two brilliant costly things, by putting very wide (but not really fast) memory on package (Intel had dabbled in doing similar with regular width ram in consumer space a while ago with Lakefield). And then by tiling multiple cores together, making it so that if they had four perfect chips next to each other they could ship it as one. Incredibly brilliant maneuver to get fantastic yields, and to scale very big.
It has 8 channels of DDR5-8000.
It might be technically correct to call it 8 channels of LPDDR5 but 256-bits would only be 4 channels of DDR5.
Sure you can build a cluster of RTX 6000s but then you start having to buy high-end motherboards and network cards to achieve the bandwidth necessary for it to go fast. Also it's obscenely expensive.
Im sure the engineers are doing the best work they can. I just dont think leadership is as interested in making a good product as they are in creating a nice exit down the line
Openhands is clearly the best ive used so far. Even gemini cli is lesser.
https://github.com/QwenLM/qwen-code https://github.com/QwenLM/qwen-code/blob/main/LICENSE
I hope these OSS CC clones converge at some point.
Actually it is mentioned in the page:
we’re also open-sourcing a command-line tool for agentic coding: Qwen Code. Forked from Gemini CodeI’ve instead used a Gemini via plain ol’ chat, first building a competitive, larger context than Claude can hold then manually bringing detailed plans and patches to Gemini for feedback with excellent results.
I presumed mcp wouldn’t give me the focused results I get from completely controlling Gemini.
And that making CC interface via the MCP would also use up context on that side.
For example, you can drive one model to a very good point through several turns, and then have the second “red team” the result of the first.
Then return that to the first model with all of its built up context.
This is particularly useful in big plans doing work on complex systems.
Even with a detailed plan, it is not unusual for Claude code to get “stuck” which can look like trying the same thing repeatedly.
You can just stop that, ask CC to summarize the current problem and attempted solutions into a “detailed technical briefing.”
Have CC then list all related files to the problem including tests, then provide the briefing and all of the files to the second LLM.
This is particularly good for large contexts that might take multiple turns to get into Gemini.
You can have the consulted model wait to provide any feedback until you’ve said your done adding context.
And then boom, you get a detailed solution without even having to directly focus on whatever minor step CC is stuck on. You stay high level.
In general, CC is immediately cured and will finish its task. This is a great time to flip it into planning mode and get plan alignment.
Get Claude to output an update on its detailed plan including what has already been accomplished then again—-ship it to the consulting model.
If you did a detailed system specification in advance, (which CC hopefully was originally also working from) You can then ask the consulting model to review the work done and planned next steps.
Inevitably the consulting model will have suggestions to improve CC’s work so far and plans. Send it on back and you’re getting outstanding results.
Update: Here is what o3 thinks about this topic: https://chatgpt.com/share/688030a9-8700-800b-8104-cca4cb1d0f...
You set the environment variable ANTHROPIC_BASE_URL to an OpenAI-compatible endpoint and ANTHROPIC_AUTH_TOKEN to the API token for the service.
I used Kimi-K2 on Moonshot [1] with Claude Code with no issues.
There's also Claude Code Router and similar apps for routing CC to a bunch of different models [2].
Our main focuses were to be 1) CLI-first and 2) truly an open source community. We have 5 independent maintainers with full commit access --they aren't from the same org or entity (disclaimer: one has joined me at my startup Gobii where we're working on web browsing agents.)
I'd love someone to do a comparison with CC, but IME we hold our own against Cursor, Windsurf, and other agentic coding solutions.
But yes, there really needs to be a canonical FOSS solution that is not tied to any specific large company or model.
It focuses especially on large context and longer tasks with many steps.
I’ve found getting CC to farm out to subagents to be the only way to keep context under control, but would love to bring in a different model as another subagent to review the work of the others.
It would be great if it starts supporting other models too natively. Wouldn't require people to fork.
They had made a bunch of hard-coded assumptions
Or they simply did that because it is much faster. Adding configuration options requires more testing and input handling. Later on, they can then accept PR where someone needs it a lot, saving their own time.
That's precisely half of the point of OSS and I am pretty much okay with that.
Imo, the point of custom CLIs is that each model is trained to handle tool calls differently. In my experience, the tool call performance is wildly different (although they have started converging recently). Convergence is meaningful only when the models and their performance are commoditized and we haven't reached that stage yet.
Also docs on running it in a 24GB GPU + 128 to 256GB of RAM here: https://docs.unsloth.ai/basics/qwen3-coder
Recommended context: 65,536 tokens (can be increased)
That should be recommended token output, as shown in the official docs as: Adequate Output Length: We recommend using an output length of 65,536 tokens for most queries, which is adequate for instruct models.If you don't have enough RAM, then < 1 token / s
Important layers are in 8bit, 6bit. Less important ones are left in 2bit! I talk more about it here: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs
sounds convincing, eh ... /s
On the less cynical note, approach does look interesting but I'd also like to understand how and why does it work, if it works at all.
For example in Phi 3 for example, the end of sentence token was wrong - if we use this, then our quants would be calibrated incorrectly, since chatting with the model will use the actual correct token.
Another is Llama 4 - https://github.com/ggml-org/llama.cpp/pull/12889 in which I fixed a RoPE issue - if we didn't fix it first, then again the calibration process would be incorrect.
Great to know that this is already a thing and I assume model "compression" is going to be the next hot topic
I am wondering about the possibility that different use cases might require different "intelligent quantization", i.e., quantization for LLM for financial analysis might be different from LLM for code generation. I am currently doing a postdoc in this. Interested in doing research together?
Yes different use cases will be different - oh interesting! Sorry I doubt I can be of much in our research - I'm mainly an engineering guy so less research focused!
If so, I'd love detailed instructions.
The guide you posted earlier goes over my (and likely many others') head!
I have pretty bad ADHD. And I've only run locally using kobold; dilettante at DIY AI.
So, yeah, I'm a bit lost in it.
Would llama.cpp support multiple (rtx 3090, no nvlink hw bridge) GPUs over PCIe4? (Rest of the machine is 32 CPU cores, 256GB RAM)
You will be mostly using 1 of your 3090s. The other one will be basically doing nothing. You CAN put the MoE weights on the 2nd 3090, but it's not going to speed up inference much, like <5% speedup. As in, if you lack a GPU, you'd be looking at <1 token/sec speeds depending on how fast your CPU does flops, and if you have a single 3090 you'd be doing 10tokens/ec, but with 2 3090s you'll still just be doing maybe 11tok/sec. These numbers are made up, but you get the idea.
Qwen3 Coder 480B is 261GB for IQ4_XS, 276GB for Q4_K_XL, so you'll be putting all the expert weights in RAM. That's why your RAM bandwidth is your limiting factor. I hope you're running off a workstation with dual cpus and 12 sticks of DDR5 RAM per CPU, which allows you to have 24 channel DDR5 RAM.
The (approximate) equation for milliseconds per token, is:
Time for token generation = (number of params active in the model)*(quantization size in bits)/8 bits*[(percent of active params in common weights)/(memory bandwidth of GPU) + (percent of active params in experts)/(memory bandwidth of system RAM)].
This equation ignores prefill (prompt processing) time. This assumes the CPU and GPU is fast enough compute-wise to do the math, and the bottleneck is memory bandwidth (this is usually true).
So for example, if you are running Kimi K2 (32b active params per token, 74% of those params are experts, 26% of those params are common params/shared expert) at Q4 quantization (4 bits per param), and have a 3090 gpu (935GB/sec) and an AMD Epyc 9005 cpu with 12 channel DDR5-6400 (614GB/sec memory bandwidth), then:
Time for token generation = (32b params)*(4bits/param)/8 bits*[(26%)/(935 GB/s) + (74%)/(614GB/sec)] = 23.73 ms/token or ~42 tokens/sec. https://www.wolframalpha.com/input?i=1+sec+%2F+%2816GB+*+%5B...
Notice how this equation explains how the second 3090 is pretty much useless. If you load up the common weights on the first 3090 (which is standard procedure), then the 2nd 3090 is just "fast memory" for some expert weights. If the quantized model is 256GB (rough estimate, I don't know the model size off the top of my head), and common weights are 11GB (this is true for Kimi K2, I don't know if it's true for Qwen3, but this is a decent rough estimate), then you have 245GB of "expert" weights. Yes, this is generally the correct ratio for MoE models, Deepseek R1 included. If you put 24GB of that 245GB on your second 3090, you have 935GB/sec speed on... 24/245 or ~10% of each token. In my Kimi K2 example above, you start off with 18.08ms per token spent reading the model from RAM, so even if your 24GB on your GPU was infinitely fast, it would still take... about 16ms per token reading from RAM. Or in total about 22ms/token, or in total 45 tokens/sec. That's with an infinitely fast 2nd GPU, you get a speedup of merely 3 tokens/sec.
https://unexcitedneurons.substack.com/p/how-to-calculate-hom...
In my case it is a fairly old system I built from cheap eBay parts. Threadripper 3970X with 8x32GB dual channel 2666Mhz DDR4.
What would be a reasonable throughput level to expect from running 8-bit or 16-bit versions on 8x H200 DGX systems?
You should be able to get 40 to 50 tokens / s in the minimum. High throughput mode + a small draft model might get you 100 tokens / s generation
I'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close. But since for the foreseeable future, I'll probably sometimes want to "call in" a bigger model that I can't realistically or affordably host on my own computer, I love having the option of high-quality open-weight models for this, and I also like the idea of "paying in" for the smaller open-weight models I play around with by renting access to their larger counterparts.
Congrats to the Qwen team on this release! I'm excited to try it out.
Very interesting. Any subs or threads you could recommend/link to?
Thanks
They don't need to match bigger models, though. They just need to be good enough for a specific task!
This is more obvious when you look at the things language models are best at, like translation. You just don't need a super huge model for translation, and in fact you might sometimes prefer a smaller one because being able to do something in real-time, or being able to run on a mobile device, is more important than marginal accuracy gains for some applications.
I'll also say that due to the hallucination problem, beyond whatever knowledge is required for being more or less coherent and "knowing" what to write in web search queries, I'm not sure I find more "knowledgeable" LLMs very valuable. Even with proprietary SOTA models hosted on someone else's cloud hardware, I basically never want an LLM to answer "off the dome"; IME it's almost always wrong! (Maybe this is less true for others whose work focuses on the absolute most popular libraries and languages, idk.) And if an LLM I use is always going to be consulting documentation at runtime, maybe that knowledge difference isn't quite so vital— summarization is one of those things that seems much, much easier for language models than writing code or "reasoning".
All of that is to say:
Sure, bigger is better! But for some tasks, my needs are still below the ceiling of the capabilities of a smaller model, and that's where I'm focusing on local usage. For now that's mostly language-focused tasks entirely apart from coding (translation, transcription, TTS, maybe summarization). It may also include simple coding tasks today (e.g., fancy auto-complete, "ghost-text" style). I think it's reasonable to hope that it will eventually include more substantial programming tasks— even if larger models are still preferable for more sophisticated tasks (like "vibe coding", maybe).
If I end up having a lot of fun, in a year or two I'll probably try to put together a machine that can indeed run larger models. :)
This reminds me of ~”the best camera is the one you have with you” idea.
Though, large models are an http request away, there are plenty of reasons to want to run one locally. Not the least of which is getting useful results in the absence of internet.
I feel like I'm the exact opposite here (despite heavily mistrusting these models in general): if I came to the model to ask it a question, and it decides to do a Google search, it pisses me off as I not only could do that, I did do that, and if that had worked out I wouldn't be bothering to ask the model.
FWIW, I do imagine we are doing very different things, though: most of the time, when I'm working with a model, I'm trying to do something so complex that I also asked my human friends and they didn't know the answer either, and my attempts to search for the answer are failing as I don't even know the terminology.
When a model does a single web search and emulates a compressed version of the "I'm Feeling Lucky" button, I am disappointed, too. ;)
I usually want the model to perform multiple web searches, do some summarization, refine/adjust search terms, etc. I tend to avoid asking LLMs things that I know I'll find the answer to directly in some upstream official documentation, or a local man page. I've long been and remain a big "RTFM" person; imo it's still both more efficient and more accurate when you know what you're looking for.
But if I'm asking an LLM to write code for me, I usually still enable web search on my query to the LLM, because I don't trust it to "remember" APIs. (I also usually rewrite most or all of the code because I'm particular about style.)
Is this an effort to chastise the viewpoint advanced? Because his viewpoint makes sense to me: I can run biggish models on my 128GB Macbook but not huge ones-- even 2b quantized ones suck too many resources.
So I run a combination of local stuff and remote stuff depending upon various factors (cost, sensitivity of information, convenience/whether I'm at home, amount of battery left, etc ;)
Yes, bigger models are better, but often smaller is good enough.
I don't have 10-20k$ to spend on this stuff. Which is about the minimum to run a 480B model, with huge quantisation. And pretty slow because for that price all you get is an old Xeon with a lot of memory or some old nvidia datacenter cards. If you want a good setup it will cost a lot more.
So small models it is. Sure, the bigger models are better but because the improvements come so fast it means I'm only 6 months to a year behind the big ones at any time. Is that worth 20k? For me no.
Under the hood, the way it works, is that when you have final probabilities, it really doesn't matter if the most likely token is selected with 59% or 75% - in either case it gets selected. If the 59% case gets there with smaller amount of compute, and that holds across the board for the training set, the model will have similar performance.
In theory, it should be possible to narrow down models even smaller to match the performance of big models, because I really doubt that you do need transformers for every single forward pass. There are probably plenty of shortcuts you can take in terms of compute for sets of tokens in the context. For example, coding structure is much more deterministic than natural text, so you probably don't need as much compute to generate accurate code.
You do need a big model first to train a small model though.
As for running huge models locally, its not enough to run them, you need good throughput as well. If you spend $2k on a graphics card, that is way more expensive than realistic usage with a paid API, and slower output as well.
I was surprised in the AlphaEvolve paper how much they relied on the flash model because they were optimizing for speed of generating ideas.
Untrue. The big important issue for LLMs is hallucination, and making your model bigger does little to solve it.
Increasing model size is a technological dead end. The future advanced LLM is not that.
Likewise, I found that the regular Qwen3-30B-A3B worked pretty well on a pair of L4 GPUs (60 tokens/second, 48 GB of memory) which is good enough for on-prem use where cloud options aren't allowed, but I'd very much like a similar code specific model, because the tool calling in something like RooCode just didn't work with the regular model.
In those circumstances, it isn't really a comparison between cloud and on-prem, it's on-prem vs nothing.
Might have to swap out Ollama for vLLM though and see how different things are.
Oh, that might be it. Using gguf is slower than say AWQ if you want 4bit, or fp8 if you want the best quality (especially on Ada arch that I think your GPUs are).
edit: vLLM is better for Tensor Parallel and also better for batched inference, some agentic stuff can do multiple queries in parallel. We run devstral fp8 on 2x A6000 (old, not even Ada) and even with marlin kernels we get ~35-40 t/s gen and 2-3k pp on a single session, with ~4 parallel sessions supported at full context. But in practice it can work with 6 people using it concurrently, as not all sessions get to the max context. You'd get 1/2 of that for 2x L4, but should see higher t/s in generation since you have Ada GPUs (native support for fp8).
Sadly it falls short during real world coding usage, but fingers crossed that a similarly sized coder variant of Qwen 3 can fill in that gap for me.
This is my script for the Q4_K_XL version from unsloth at 45k context:
llama-server.exe --host 0.0.0.0 --no-webui --alias "Qwen3-30B-A3B-Q4_K_XL" --model "F:\models\unsloth\Qwen3-30B-A3B-128K-GGUF\Qwen3-30B-A3B-128K-UD-Q4_K_XL.gguf" --ctx-size 45000 --n-gpu-layers 99 --slots --metrics --batch-size 2048 --ubatch-size 2048 --temp 0.6 --top-p 0.95 --min-p 0 --presence-penalty 1.5 --repeat-penalty 1.1 --jinja --reasoning-format deepseek --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn --no-mmap --threads 8 --cache-reuse 256 --override-tensor "blk\.([0-9][02468])\.ffn_._exps\.=CPU"
It has also been helpful (when run locally, of course) for addressing questions-- good faith questions, not censorship tests to which I already know the answers-- about Chinese history and culture that the DeepSeek app's censorship is a little too conservative for. This is a really fun use case actually, asking models from different parts of the world to summarize and describe historical events and comparing the quality of their answers, their biases, etc. Qwen3-30B-A3B is fast enough that this can be as fun as playing with the big, commercial, online models, even if its answers are not equally detailed or accurate.
yep, when you hire an immigrate software engineer, you don't ask them if Israel has a right to exist, or whether Vladivostok is part of china. Unless you are a DoD vendor which there won't be an interview anyway.
It very much reminds of tabbing autocomplete with IntelliSense step by step, but in a more diffusion-like way.
but my tool-set is a mixture of agentic and autocomplete, not 100% of each. I try to keep a clear focus of the architecture, and actually own the code by reading most of it, keeping straight the parts of the code the way i like.
Is this the one? https://github.com/ggml-org/llama.vscode it sems to be built for code completion rather than outright agent mode
The plugin itself provides chat also, but my gut feeling is that ggerganov runs several models at the some time, given he uses a 192gb machine.
Have not tried this scenario yet, but looking at my API bill I’m probably going to try 100% local dev at some point. Besides vibe coding with existing tools seems to not work that good for enterprise size codebases.
It feels intuitively obvious (so maybe wrong?) that a 32B Java Coder would be far better at coding Java than a generalist 32B Coder.
First, Java code tends to be written a certain way, and for certain goals and business domains.
Let’s say 90% of modern Java is a mix of: * students learning to program and writing algorithms * corporate legacy software from non-tech focused companies
If you want to build something that is uncommon in that subset, it will likely struggle due to a lack of training data. And if you wanted to build something like a game, the majority of your training data is going to be based on ancient versions of Java, back when game development was more common in Java.
Comparatively, including C in your training data gives you exposure to a whole separate set of domain data for training, like IoT devices, kernels, etc.
Including Go will likely include a lot more networking and infrastructure code than Java would have had, which means there is also more context to pull from in what networking services expect.
Code for those domains follow different patterns, but the concepts can still be useful in writing Java code.
Now, there may be a middle ground where you could have a model that is still general for many coding languages, but given extra data and fine-tuning focused on domain-specific Java things — like more of a “32B CorporateJava Coder” model — based around the very specific architecture of Spring. And you’d be willing to accept that model to fail at writing games in Java.
It’s interesting to think about for sure - but I do feel like domain-specific might be more useful than language-specific
fine tuned rather than created from scratch though.
assuming it doesn't all implode due to a lack of profitability, it should be obvious
That said, code+git+agent is only acceptable way for technical staff to interact with AI. Tools with sparkles button can go to hell.
https://a16z.com/llmflation-llm-inference-cost/ https://openrouter.ai/anthropic/claude-sonnet-4
Obsession?
Having to tack on top of that 2-4h of work per day is not normal, and again, it's probably unhealthy.
Its a bit like the media cycle. The more jacked in you are, the more behind you feel. I’m less certain there will be winners as much as losers, but for sure the time investment on staying up to date on these things will not pay dividends to the average hn reader
This should be written on the coffin of full stack development.
A look at fandom wikis is humbling. People will persist and go very deep into stuff they care about.
In this case: Read a lot, try to build a lot, learn, learn from mistakes, compare.
Oh, but it is.
Imagine you were then, back in those days. A few years after VHS won, you couldn't find your favorite movies on Betamax. There was a lot more hardware, and cheaper, available, for VHS.
Mass adoption largely wins out over almost everything.
Case in point from software: Visual Basic, PHP, Javascript, Python (though Python is slightly more technically sound than the other ones), early MySQL, MongoDB, early Windows, early Android.
Figuring out how to stay sane while staying abreast of developments will be a key skill to cultivate.
I’m pretty skeptical there will be a single model with a defensible moat TBH. Like cloud compute, there is both economy of scale and room for multiple vendors (not least because bigco’s want multiple competing bids).
1. Where they can be used as autocompletion in an IDE at speeds comparable with Intellisense 2. And where they're good enough to generate most code reliably, while using a local LLM 3. While running on hardware costing in total max 2000€ 4. And definitely with just a few "standard" pre-configured Open Source/open weights LLMs where I don't have to become an LLM engineer to figure out the million knobs
I have no clue how Intellisense works behind the scenes, yet I use it every day. Same story here.
Given how much better the bleeding edge models are now than 6 months ago, as long as any model is getting smarter I don’t see stagnation as a possibility. If Gemini starts being better at coding than Claude, you’re gonna switch over if your livelihood is dependent on it.
As Heraclitus said "The only constant in life is change"
(and maybe Emacs)
We don't actually need a winner, we need 2-3-4 big, mature commercial contenders for the state of the art stuff, and 2-3-4 big, mature Open Source/open weights models that can be run on decent consumer hardware at near real-time speeds, and we're all set.
Sure, there will probably be a long tail, but the average programmer probably won't care much about those, just like they don't care about Erlang, D, MoonScript, etc.
I think the bottleneck is file read/write tooling right now
Saw a repo recently with probably 80% of those
Now I have a git repo I add as a submodule and tell each tool to read through and create their own WHATEVER.md
In which case you'd have 1 markdown file and at least for the ones that are invoked via the CLI, just set up a Makefile entry point that leads them to the correct location.
> This node.js CLI tool processes CLAUDE.md files with hierarchical collection and recursive @-import resolution. Walks directory tree from current to ~/.claude/, collecting all CLAUDE.md files and processing them with file import resolution. Saves processed context files with resolved imports next to the original CLAUDE.md files or in a specific location (configurable).
I mostly use Claude Code, but every now and then go with Gemini, and having to maintain two sets of (hierarchical) instructions was annoying. And then opencode showed up, which meant yet another tool I wanted to try out and …well.
Library to help with this. Not great that a library is necessary, but useful until this converges to a standard (if it ever does).
Closest you get is https://github.com/opencode-ai/opencode in GO.
Alibaba Plus: input: $1 to $6 output: $5 to $60
Alibaba OpenSource: input: $1.50 to $4.50 output: $7.50 to $22.50
So it doesn't look that cheap comparing to Kimi k2 or their non coder version (Qwen3 235B A22B 2507).
What's more confusing this "up to" pricing that supposed to can reach $60 for output - with agents it's not that easy to control context.
It seems now a every expensive model to run with alibaba as provider. You only get this low price for input <32k. For input <256k both gemini 2.5 pro and o3 is cheaper.
[0] https://simonwillison.net/
[1] https://static.simonwillison.net/static/2025/qwen3-coder-plu...
05%: Making code changes
10%: Running build pipelines
20%: Learning about changed process and people via zoom calls, teams chat and emails
15%: Raising incident tickets for issues outside of my control
20%: Submitting forms, attending reviews and chasing approvals
20%: Reaching out to people for dependencies, following up
10%: Finding and reading up some obscure and conflicting internal wiki page, which is likely to be outdated
Many of those things could be improved today without AI but e.g. raising Incidents for issues outside of your control could also give you a suggestion already that you just have to tick off.
Not saying we are there yet but hard to imagine it's not possible.
It's probably messier than you think.
Coding, debugging builds, paperwork, doc chasing are all tasks that AI is improving on rapidly.
It’s true to say that time writing code is usually a minority of a developer’s work time, and so an AI that makes coding 20% faster may only translate to a modest dev productivity boost. But 5% time spent coding is a sign of serious organizational disfunction.
Many reasons to touch existing code.
Still, to write few hundred lines, doesn't take a whole week.
- Agentic DevOps: provisions infra and solves platform issues as soon as a support ticket is created.
- Agentic Technical Writer: one GenAI agent writes the docs and keeps the wiki up to date, while another 100 agents review it all and flag hallucinations.
- Agentic Manager: attends meetings, parses emails and logs 24x7 and creates daily reports, shares these reports with other teams, and manages the calendar of the developers to shield them from distractions.
- Agentic Director: spots patterns in the data and approves things faster, without the fear of getting fired.
- Agentic CEO: helps with decision-making, gives motivational speeches, and aligns vision with strategy.
- Agentic Pet: a virtual mascot you have to feed four times a day, Monday to Friday, from your office's IP address. Miss a meal and it dies, and HR gets notified. (This was my boss's idea)
sign of serious organizational disfunction.
You're not wrong, but it's a "dysfunction" that many successful tech companies have learned to leverage.The reality is, most engineers spend far less than half their time writing new code. This is where the 80/20 principle comes into play. It's common for 80% of a company's revenue to come from 20% of its features. That core, revenue-generating code is often mature and requires more maintenance than new code. Its stability allows the company to afford what you call "dysfunction": having a large portion of engineers work on speculative features and "big bets" that might never see the light of day.
So, while it looks like a bug from a pure "coding hours" perspective, for many businesses, it's a strategic feature!
1) aligning the work of multiple developers
2) ensuring that developer attention is focused only on the right problems
3) updating stakeholders on progress of code buildout
4) preventing too much code being produced because of the maintenance burden
If agentic tooling reduces the cost of code ownership, annd allows individual developers to make more changes across a broader scope of a codebase more quickly, all of this organizational overhead also needs to be revisited.
I can point at a huge doc for some API and get the important things right away, or ask questions of it. I can get it to review PRs so I can quickly get the gist of the changes before digging into the code myself.
For coding, I don't find agents boost my productivity that much where I was already productive. However, they definitely allow me to do things I was unable to before (or would have taken very long as I wasn't an expert) – for example my type signatures have improved massively, in places where normally I would have been lazy and typed as any I now ask claude to come up with some proper types.
I've had it write code for things that I'm not great at, like geometry, or dataviz. But these are not necessarily increasing my productivity, they reduce my reliance on libraries and such, but they might actually make me less productive.
I actually tried to use Qwen3[1] to analyse customer cases and it was worse than useless at it.
[1] We can't use any online model as these bug reports contain large amounts of PII, customer data, etc.
You only have to describe how you want commits written once and then the AI will just handle it. Is not that anyone of us can't write good commits, but humans get tired, lose focus, get interrupted, etc.
Just in my short time using Claude Code, it generally writes pretty good commits; it often adds more detail than I normally would not because I'm not capable but because there's a certain amount of cognitive overhead when it comes to writing good commits and it gets harder as our mental energy decreases.
I found this custom command [1] for Claude Code and it reminded me that there's no way a human can consistently do this every single time, perhaps a dozen times per day, unless they're doing nothing else--no meetings, no phone calls, etc. And we know that's not possible:
[1]: https://github.com/qdhenry/Claude-Command-Suite/blob/main/.c...
# Git Status Command
Show detailed git repository status
*Command originally created by IndyDevDan (YouTube: https://www.youtube.com/@indydevdan) / DislerH (GitHub: https://github.com/disler)*
## Instructions
Analyze the current state of the git repository by performing the following steps:
1. *Run Git Status Commands*
- Execute `git status` to see current working tree state
- Run `git diff HEAD origin/main` to check differences with remote
- Execute `git branch --show-current` to display current branch
- Check for uncommitted changes and untracked files
2. *Analyze Repository State*
- Identify staged vs unstaged changes
- List any untracked files
- Check if branch is ahead/behind remote
- Review any merge conflicts if present
3. *Read Key Files*
- Review README.md for project context
- Check for any recent changes in important files
- Understand project structure if needed
4. *Provide Summary*
- Current branch and its relationship to main/master
- Number of commits ahead/behind
- List of modified files with change types
- Any action items (commits needed, pulls required, etc.)
This command helps developers quickly understand:
- What changes are pending
- The repository's sync status
- Whether any actions are needed before continuing work
Arguments: $ARGUMENTSThe LLM is better than you at math, too.
https://www.reuters.com/world/asia-pacific/google-clinches-m...
Plenty of us are using LLM/agentic coding in highly regulated production applications. If you're not getting very impressive results in backend and frontend, it's purely a skill issue on your part. "This hammer sucks because I hit my thumb every time!"
I am not really sure what to say except that if you are simply looking for a way to insult people, just admit you are a mean person and you won't have to justify in ways that make no sense. But if you really only hate LLMs, you can do that in ways that don't involve insulting people. But to be so full of disdain for a technology that it turns you irrational is something that should be a bit concerning.
It seems to me the only reason someone would feel the need to do such a thing is to validate their own experience. If everyone else seems to be finding value in a tool, but you cannot, it must be because everyone else just isn't doing important things with it.
As I said earlier, I would be concerned about such behavior if I found myself doing it.
In the former case… I’m interested to hear how they’re better? Do you choose an agent with the full context of the changes to write the message, so it knows where you started, why certain things didn’t work? Or are you prompting a fresh context with your summary and asking it to make it into a commit message? Or something else?
If I’m using a CLI:
the agent already has: - the context from the chat - the ticket number via me or when it created the ticket - meta info via project memory or other terminal commands like API call etc - Info on commit format from project memory
So it boils down to asking it to commit and update the ticket when we’re done with the task in that case. Having a good workflow is key
For your question: I still read and validated/correct, in the end I’m the one committing the code! So it’s the usual requirements from there. If someone would use their LLM the results would vary, here they have an approved summary. This is why human in the loop is essential.
But I find it very interesting how others find prompting more productive for their use cases. It's definitely a new skill. Over years I also built my skill to write commits, so it comes natural to me as opposed to prompting, which requires extra effort and thinking in a different way and context and it doesn't work well for something that I do basically automatically already.
Give it a try it’s kind of impressive
It’s definitely a new skill.
Are you in heavily regulated industry or dysfunctional organization?
Most big tech optimize their build pipelines a lot to reduce commit to deploy (or validation/test process) which keeps engineers focus on the same task while problem/solution is fresh.
Also, you're not making an argument against agentic coding, you're actually making an argument for it - you don't have time to code, so you need someone or something to code for you.
- Running build pipelines: make cli tool to initiate them, monitor them and notify you on completion/error (audio). Allows to chain multiple things. Run in background terminal.
- Learning about changed process and people via zoom calls, teams chat and emails: pass logs of chats and emails to LLM with particular focus. Demand zoom calls transcripts published for that purposes (we use meet)
- Raising incident tickets for issues outside of my control: automate this with agent: allow it to access as much as needed, and guide it with short guidance - all doable via claude code + custom MCP
- Submitting forms, attending reviews and chasing approvals - best thing to automate. They want forms? They will have forms. Chasing approvals - fire and forget + queue management, same.
- Reaching out to people for dependencies, following up: LLM as personal assistant is classic job. Code this away.
- Finding and reading up some obscure and conflicting internal wiki page, which is likely to be outdated: index all data and put it into RAG, let agent dig deeper.
Most of the time you spend is on scheduling micro-tasks, switching between them and maintaining unspoken queue of checking various saas frontends. Formalize micro-task management, automate endpoints, and delegate it to your own selfware (ad-hoc tools chain you vibe coded for yourself only, tailored for particular working environment).
I do this all (almost) to automate away non-coding tasks. Life is fun again.
Hope this helps.
agentic coding will not fix these systemic issues caused by organizational dysfunction. agentic coding will allow the software created by these companies to be rewritten from scratch for 1/100th the cost with better reliability and performance though.
the resistance to AI adoption inside corporations that operate like this is intense and will probably intensify.
it takes a combination of external competitive pressure, investor pressure, attrition, PE takeovers, etc, to grind down internal resistance, which takes years or decades depending on the situation.
Cheaper yes. More reliable? Absolutely not. Not with today’s models at least.
There are so many problems in the world we need to stop cramming into the same bus.
Example: Partition a linked list in linear time. None of these models seems to be able to get, that `reverse` or converting the whole list to a vector are in themselves linear operations and therefore forbid themselves. When you tell them to not use those, they still continue to do so and blatantly claim, that they are not using them. Á la:
"You are right, ... . The following code avoids using `reverse`, ... :
[code that still uses reverse]"
And in languages like Python they will cheat, because Python's list is more like an array, where random access is O(1).
This means they only work well, when you are doing something quite mainstream, where the amount of training data is a significantly strong signal in the noise. But even there they often struggle. For example I found them somewhat useful for doing Django things, but just as often they gave bullshit code, or it took a lot of back and forth to get something useful out of them.
I think it is embarrassing, that with sooo much training data, they are still unable to do much more than going by frequency in training data when suggesting "solutions". They are "learning" differently than a human being. When a human being sees a new concept, they can often apply that new concept, even if that concept does not happen to be needed that often, as long as they remember the concept. But in these LLMs it seems they deem everything that isn't mainstream irrelevant.
I get this kind of lying from Gemini 2.5 Flash sometimes. It's super frustrating and dissolves all the wonder that accumulated when the LLM was giving useful responses. When it happens, I abandon the session and either figure out the problem myself or try a fresh session with more detailed prompting.
-- | Takes a predicate and a list, and returns a pair of lists.
-- | The first list contains elements that satisfy the predicate.
-- | The second contains the rest.
partitionRecursive :: (a -> Bool) -> [a] -> ([a], [a])
partitionRecursive _ [] = ([], []) -- Base case: An empty list results in two empty lists.
partitionRecursive p (x:xs) =
-- Recursively partition the rest of the list
let (trues, falses) = partitionRecursive p xs
in if p x
-- If the predicate is true for x, add it to the 'trues' list.
then (x : trues, falses)
-- Otherwise, add it to the 'falses' list.
else (trues, x : falses)I just checked my code and while I think the partition example still shows the problem, the problem I used to check is a similar one, but different one:
Split a list at an element that satisfies a predicate. Here is some code for that in Scheme:
(define split-at
(λ (lst pred)
"Each element of LST is checked using the predicate. If the
predicate is satisfied for an element, then that element
will be seen as the separator. Return 2 values: The split
off part and the remaining part of the list LST."
(let iter ([lst° lst]
[index° 0]
[cont
(λ (acc-split-off rem-lst)
(values acc-split-off rem-lst))])
(cond
[(null? lst°)
(cont '() lst°)]
[(pred (car lst°) index°)
(cont '() (cdr lst°))]
[else
(iter (cdr lst°)
(+ index° 1)
(λ (new-tail rem-lst)
(cont (cons (car lst°) new-tail)
rem-lst)))]))))
For this kind of stuff with constructed continuations they somehow never get it. They will do `reverse` and `list->vector`, and `append` all day long or some other attempt of working around what you specify they shall not do. The concept of building up a continuation seems completely unknown to them.I would have expected good providers like Together, Fireworks, etc support it, but I can't find it, except if I run vllm myself on self-hosted instances.
I wonder if there's a python expert that can be isolated.
I downloaded the 4bit quant to my mac studio 512GB. 7-8 minutes until first tokens with a big Cline prompt for it to chew on. Performance is exceptional. It nailed all the tool calls, loaded my memory bank, and reasoned about a golang code base well enough to write a blog post on the topic: https://convergence.ninja/post/blogs/000016-ForeverFantasyFr...
Writing blog posts is one of the tests I use for these models. It is a very involved process including a Q&A phase, drafting phase, approval, and deployment. The filenames follow a certain pattern. The file has to be uploaded to s3 in a certain location to trigger the deployment. It's a complex custom task that I automated.
Even the 4bit model was capable of this, but was incapable of actually working on my code, prefering to halucinate methods that would be convenient rather than admitting it didn't know what it was doing. This is the 4 bit "lobotomized" model though. I'm excited to see how it performs at full power.
Edit:
I ran my fun test on it and it unfortunately failed.
> ”How can I detect whether a user is running in a RemoteApp context using C# and .NET? That is, not a full RDP desktop session, but a published RemoteApp as if the app is running locally. The reason I’m asking is that we have an unfortunate bug in a third party library that only shows up in this scenario, and needs a specific workaround when it happens.”
It started by trying to read hallucinated environment variables that just aren’t there. Gemini 2.5 Pro had the same issue and IIRC also Claude.
The only one I have seen give the correct answer that is basically ”You can’t. There’s no official method to do this and this is intentional by Microsoft.” along with a heuristic to instead determine the root launching process which is thus far (but not guaranteed to be) RDPINIT.EXE rather than EXPLORER.EXE as in typical desktop or RDP scenarios… has been OpenAI o3. o3 also provided additional details about the underlying protocol at play here which I could confirm with external sources to be correct.
I like my query because it forces the LLM to actually reply with that you just can’t do this, there’s no ”sign” of it other than going by a completely different side-effect. They are usually too eager to try to figure out a positive reply and hallucinate in the process. Often, there _are_ these env vars to read in cases like these, but not here.
Not quite as good as Claude but by the best Qwen model so far and 2x as fast as qwen3-235b-a22b-07-25
Specific results for qwen3-coder here: https://llm-benchmark.tinybird.live/models/qwen3-coder
jddj•6mo ago
Open, small, if the benchmarks are to be believed sonnet 4~ish, tool use?
danielhanchen•6mo ago
sourcecodeplz•6mo ago
"Today, we're announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we're excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct."
https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct
danielhanchen•6mo ago
fotcorn•6mo ago
You won't be out of work creating ggufs anytime soon :)
danielhanchen•6mo ago
stuartjohnson12•6mo ago
https://winbuzzer.com/2025/01/29/alibabas-new-qwen-2-5-max-m...
Alibaba is not a company whose culture is conducive to earnest acknowledgement that they are behind SOTA.
swyx•6mo ago
this is disingenous. there are a bunch of hurdles to using open models over closed models and you know them as well as the rest of us.
omneity•6mo ago
stuartjohnson12•6mo ago
stocksinsmocks•6mo ago
pxc•6mo ago
To the extent that there's a solution, the solution is choice!
daemonologist•6mo ago
sourcecodeplz•6mo ago