Also this model https://huggingface.co/Qwen/Qwen3-235B-A22B
Is native 32k. So the 64k and 131k use ROPE that is not the best for effective context.
While https://qwenlm.github.io/blog/qwen3-coder/ it's 256k native https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct.
> Cerebras Systemstoday [sic] announced the launch of Qwen3-235B with full 131K context support on its inference cloud platform
Then later:
> Cline users can now access Cerebras Qwen models directly within the editor—starting with Qwen3-32B at 64K contexton the free tier. This rollout will expand to include Qwen3-235B with 131K context
Not sure where you get the 40K number from.
(These model names are so confusing.)
I've often observed thinking/reasoning to cause models to completely disregard important constraints, because they essentially can act as conversational turns.
Funny that, when given too much brainpower, AIs manifest ADHD symptoms…
I know that Letta have a decent approach to this, but I haven't yet seen it done well with a coding agent, by them or anyone else. Is there anyone doing this with any measure of success?
I run plenty of agent loops and the speed makes a somewhat interesting difference in time "compression". Having a Claude 4 Sonnet-level model running at 1000-1500 tok/s would be extremely impressive.
To FEEL THE SPEED, you can either try it yourself on Cerebras Inference page, through their API, or for example on Mistral / Le Chat with their "Flash Answers" (powered by Cerebras). Iterating on code with 1000 tok/s makes it feel even more magical.
However, I think Cerebras first needs to get the APIs to be more openAI compliant. I tried their existing models with a bunch of coding agents (include Cline which they did a PR for) and they all failed to work either due to a 400 error or tool calls not being formatted correctly. Very disappointed.
Deciding if I should switch to Qwen 3 and Cerebras.
(Also, off-topic, but the name reminds me of cerebrates from Starcraft. The Zerg command hierarchy lore was fascinating when I was a young child.)
But then also that running a proxy tool locally is needed.
I haven’t tried this setup, and can’t say offhand if Cerebras’ hosted qwen described here is “OpenAI” compatible.
I also don’t know if all of the tools CC uses out of the box are supported in the most compatible non-Anthropic models.
Can anyone provide clarity / additional testimony on swapping out the engine on Claude Code?
A workflow I've been hearing about is: use Claude Code until quota exhaustion, then use Gemini CLI with Gemini 2.5 Pro free credits until quota exhaustion, then use something like a cheap-ish K2 or Qwen 3 provider, with OpenCode or the new Qwen Code, until your Claude Code credits reset and you begin the cycle anew.
It will also impact how we work: interactive IDEs like Cursor probably make more sense than CLI tools like Claude code when answers are nearly instant.
It opens up a whole lot of use cases that'd be a nightmare if you have to look at each individual change.
https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...
very fun to see agents using those backends
I tested it and the speed is incredible, though.
But then, same for humans yes?
https://www.kip.uni-heidelberg.de/Veroeffentlichungen/downlo...
https://archive.ll.mit.edu/publications/journal/pdf/vol02_no...
The second's patents would also be long-expired since it's from 1989.
With 44GB of SRAM per Cerebras chip, you'd need 45 chips chained together. $3m per chip. $135m total to run this.
For comparison, you can buy a DGX B200 with 8x B200 Blackwell chips and 1.4TB of memory for around $500k. Two systems would give you 2.8TB memory which is enough for this. So $1m vs $135m to run this model.
It's not very scalable unless you have some ultra high value task that need super fast inference speed. Maybe hedge funds or some sort of financial markets?
PS. The reason why I think we're only in the beginning of the AI boom is because I can't imagine what we can build if we can run models as good as Claude Opus 4 (or even better) at 1500 tokens/s for a very cheap price and tens of millions of context tokens. We're still a few generations of hardware away I'm guessing.
The smallest real floating point type is FP4.
EDIT: Who knew that correctness is controversial. What a weird place HN has become.
Remembering my CS classes, storing an FP value requires the base and the exponent; that's a design decision. Also remembering some assembler classes, Int arithmetic is way faster than FP.
Could there be a better "representation " for the numbers needed in NN that would provide the accuracy of floating point but provide faster operations? (Maybe even allow to perform required operations as bitwise ops. Kind of like the left/right shifting to double/half ints. )
A modern processor can do something similar to an integer bit shift about as quickly with a floating point, courtesy of FSCALE instructions and similes. Indeed, modern processors are extremely performant at floating point math.
Yes. Look up “block floating point”.
That's not how you would do it with Cerebras. 44GB is SRAM, so on chip memory, not HBM memory where you would store most of the params. For reference one GB200 has only 126MB of SRAM, if you tried to estimate how many GB200 you would need for a 2TB model just by looking at the L2 cache size you would get 16k GB200 aka ~600M$, obviously way off.
Cerebras uses a different architecture than Nvidia, where the HBM is not directly packaged with the chips, this is handled by a different system so you can scale memory and compute separately. Specifically you can use something like MemoryX to act as your HBM which will be high speed interconnected to the chips SRAM, see [1]. I'm not at all an expert in Cerebras, but IIRC you can connect up to like 2PB of memory to a single Cererbas chip, so almost 1000x the FP16 model.
[1]: https://www.cerebras.ai/blog/announcing-the-cerebras-archite...
That's my take on it all, not many apples to oranges comparisons to work from on these two system for even rolling down the same slope.
I don’t follow what you are saying and what “that” is specifically. Assuming it’s referencing using HBM and not just SRAM, this is not optional on a GPU, SRAM is many order of magnitudes too small. Data is constantly flowing between HBM and SRAM by design, and to get data in/out of your GPU you have to go through HBM first, you can’t skip that.
And while it is quite massive on a Cerebras system it is also still too small for very large models.
That's not how you would do it with Cerebras. 44GB is SRAM, so on chip memory, not HBM memory where you would store most of the params. For reference one GB200 has only 126MB of SRAM, if you tried to estimate how many GB200 you would need for a 2TB model just by looking at the L2 cache size you would get 16k GB200 aka ~600M$, obviously way off.
Yes but Cerebras achieves its speed by using SRAM.But that doesn’t mean you are only using SRAM, that would be impractical. Just like using a CPU just by storing stuff in the L3 cache and never going to the RAM. Unless I am missing something from the original link, I don’t know how you got to the conclusion that they only used SRAM.
So if Cerebras uses HBM to store the model but stream weights into SRAM, I really don't see the advantage long term over smaller chips like GB200 since both architectures use HBM.
The whole point of having a wafer chip is that you limit the need to reach out to external parts for memory since that's the slow part.
I don’t think you can look at those things binarily. 44GB of SRAM is still a massive amount. You don’t need infinite SRAM to get better performances. There is a reason NVidia is increasing the L2 cache size with every generation rather than just sticking with 32MB if it really changed nothing to have a bit more. The more SRAM you have the more you are able to mask communication behind computation. You can imagine with 44GB being able to load the weights of layer N+1 into SRAM while computing layer N, thereby entirely negating the penalty of going to HBM (same idea as FSDP).
You would have to have an insanely fast bus to prevent I/O stalls with this. With a 235B fp16 model you’d be streaming 470GiB of data every graph execution. To do that 1000tok/s, you’d need a bus that can deliver a sustained ~500 TiB/s. Even if you do a 32 wide MoE model, that’s still about 15 TiB/s of bandwidth you’d need from the HBM to avoid stalls at 1000tok/s.
It would seem like this either isn’t fp16 or this is indeed likely running completely out of SRAM.
Of course Cerebas doesn’t use a dense representation so these memory numbers could be way off and maybe that is SRAM+DRAM combo
Because they are doing 1,500 tokens per second.
That's exactly how Graphcore's current chips work, and I wouldn't be surprised if that's how Cerebras's wafer works. It's probably even harder for Cerebras to use DRAM because each chip in the wafer is "landlocked" and doesn't have an easy way to access the outside world. You could go up or down, but down is used for power input and up is used for cooling.
You're right it's not a good way to do things for memory hungry models like LLMs, but all of these chips were designed before it became obvious that LLMs are where the money is. Graphcore's next chip (if they are even still working on it) can access a mountain of DRAM with very high bandwidth. I imagine Cerebras will be working on that too. I wouldn't be surprised if the abandon WSI entirely due to needing to use DRAM.
Checking: Anthropic charges $70 per 1 million output tokens. @1500 tokens per second that would be around 10 cents per second, or around $8k per day.
The $500k sounds about right then, unless I’m mistaken.
Well now I'm curious; how is a layer judged on its relative need for precision? I guess I still have a lot of learning to do w.r.t. how quantization is done. I was under the impression it was done once, statically, and produced a new giant GGUF blob or whatever format your weights are in. Does that assumption still hold true for the approach you're describing?
That on-chip SRAM memory is purely temporary working memory and does need to hold the entire model weights. The Cerebras chip works on a sparse weights representation, streams non-zero off their external memory server and the cores work in a transport-triggered dataflow manner.
I'd think that HFT is already mature and doesn't really benefit from this type of model.
EDIT: online it seems TSMC prices are about 25K-30K per wafer. So even 10Xing that a wafer-scale system should be about 300K.
Definitely not hedge funds / quant funds.
You'd just buy a dgx
> For comparison, you can buy a DGX B200 with 8x B200 Blackwell chips and 1.4TB of memory for around $500k. Two systems would give you 2.8TB memory which is enough for this.
That would be enough to support a single user. If you want to host a service that provides this to 10k users in parallel your cost per user scales linearly with the GPU costs you posted. But we don't know how many users a comparable wafer-scale deployment can scale to (aside from the fact that the costs you posted for that are disputed by users down the thread as well), so your comparison is kind of meaningless in that way, you're missing data.
No. Magic of batching allows you to handle multiple user requests in parallel using the same weights with little VRAM overhead per user.
I don't know; I think we could be running models "as good as" Claude Opus 4, a few years down the line, with a lot less hardware — perhaps even going backwards, with "better" later models fitting on smaller, older — maybe even consumer-level — GPUs.
Why do I say this? Because I get the distinct impression that "throwing more parameters at the problem" is the current batch of AI companies' version of "setting money on fire to scale." These companies are likely leaving huge amounts of (almost-lossless) optimization on the table, in the name of having a model now that can be sold at huge expense to those few customers who really want it and are willing to pay (think: intelligence agencies automating real-time continuous analysis of the conversations of people-of-interest). Having these "sloppy but powerful" models, also enables the startups themselves to make use of them in expensive one-time batch-processing passes, to e.g. clean and pluck outliers from their training datasets with ever-better accuracy. (Think of this as the AI version of "ETL data migration logic doesn't need to be particularly optimized; what's the difference between it running for 6 vs 8 hours, if we're only ever going to run it once? May as well code it in a high-level scripting language.")
But there are only so many of these high-value customers to compete over, and only so intelligent these models need to get before achieving perfect accuracy on training-set data-cleaning tasks can be reduced to "mere" context engineering / agentic cross-validation. At some point, an inflection point will be passed where the marginal revenue to be earned from cost-reduced volume sales outweighs the marginal revenue to be earned from enterprise sales.
And at that point, we'll likely start to see a huge shift in in-industry research in how these models are being architected and optimized.
No longer would AI companies set their goal in a new model generation first as purely optimizing for intelligence on various leaderboards (ala the 1980s HPC race, motivated by serving many of the same enterprise customers!), and then, leaderboard score in hand, go back and re-optimize to make the intelligent model spit tokens faster when run on distributed backplanes (metric: tokens per watt-second).
But instead, AI companies would likely move to a combined optimization goal of training models from scratch to retain high-fidelity intelligent inference capabilities on lower-cost substrates — while minimizing work done [because that's what OEMs running local versions of their models want] and therefore minimizing "useless motion" of semantically-meaningless tokens. (Implied metric: bits of Shannon informational content generated per (byte-of-ram x GPU FLOP x second)).
Which is not enough to even pay the interest on one $3m chip.
What am I missing here ?
What sort of latency do you think one would get with 8x B200 Blackwell chips? Do you think 1500 tokens/sec would be achievable in that setup?
If someone from Cerebras is reading this feel free to dm me as optimizing this power is what we do.
I think the gist of this thread is entirely: "please do the same for Qwen 3 coder", with us all hoping for:
a) A viable alternative to Sonnet 3 b) Specifically a faster and cheaper alternative
They are also, anecdotally, super scary censored. Asking it if anything "interesting has happened in Tianamen Square?" And then refining with "any notable protests?" And finally "maybe something to do with a tank"... All you get is vague allusions to the square being a beautiful place with a rich history.
Anyway, YOU'RE NOT SERVING FULL CONTEXT CEREBRAS, YOU'RE SERVING HALF. Also what quantization exactly is this, can the customers know?
Edit: Looks like it did. They both introduced pay as you go, and have prepaid limits too at $1500. I wonder if they have any limitations on parallel execution for pay as you go...
Insane that Cerebras succeeded where everyone else failed for 5 decades.
now if something like Gemini 2.5 pro or Sonnet 4 even can run on Cerebras generating tens of thousands of code in a few seconds, that could really make a difference.
There's always going to be some latency in any compute architecture. Assume some insane billionaire cast the entire Qwen3-235B model into silicon, so it all ran in parallel, tokens going in one end, and the next token coming out the other end. This wafer (or likely, stack of interconnected wafers) would likely add up to a latency from end to end of 10 to 100 milliseconds.
If you then added pipelining, the latency might actually increase a millsecond or two, but the aggregate throughput would be N times the number of pipeline stages.
If you could increase N to the point that the clock cycle were a nanosecond... what would the economic value of this thing be? 100,000 separate streams at 10,000 tokens per second, multiplexing through it.
If you change it from cast in silicon, to a program to configure the silicon (line an FPGA, but far less clunky), I believe you get the future of LLM compute. Ever faster and wider lanes between compute and RAM is a dead end, a premature optimization.
Anyone could recommend a solution?
pr337h4m•7h ago
TechDebtDevin•7h ago
pr337h4m•7h ago
We don't know how/whether the Qwen3-235B served by Cerebras has been quantized.
logicchains•6h ago