The difference here seems to be that Cerebras does not appear to have Qwen3-Coder through their API! So now there is a crazy fast (and apparently good too?) model that they only provide if you pay the crazy monthly sub?
it's two kilotokens per second. that's fast.
Certainly, somewhere between fast and crazy.
In other words, it's needlessly fast.
So maybe there's something useful to do with the extra speed. But it does seem more "useful" for vibe coding than for writing usable/good code.
The way I would use this $50 Cerebras offering is as a delegate for some high token count items like documentation, lint fixing, and other operations as a way not only to speed up the workflow but to release some back pressure on Anthropic/claude so you don’t hit your limits as quickly… especially with the new weekly throttle coming. This $50 dollar jump seems very reasonable, now for the 1k completions a day, id really want to see and get a feel for how chatty it is.
I suppose thats how it starts but id the model is competent and fast, the speed alone might force you a bit to delegate more to it. (Maybe sub agent tasks)
Still, definitely the right direction!
EDIT: doesn't seem like anything but a first-party api with a monthly plan.
It hits the request per minute limit instantly and then you wait a minute.
https://github.com/pchalasani/claude-code-tools?tab=readme-o...
I was excited, then I read this:
> Send up to 1,000 messages per day—enough for 3–4 hours of uninterrupted vibe coding.
I don't mind paying for services I use. But it's hard to take this seriously when the first paragraph claim is contradicting the fine prints.
I understand it's a sale tactics. But it seems not forthcoming, and it's hard for me to trust the rest of the claims.
Also the weekly limit selling point is silly - it almost certainly only impacts those who are abusing, ie. running 24/7.
How do Claude's rate limits actually work?
I'm not a Pro/Max5/Max20 subscriber, only light API usage for Anthropic - so it's likely that I don't really understand the limits there.
For example, community reports that Anthropic's message limit for Max 5 translates to roughly 88k token per 5-hour window (there's variance, but it's somewhere in this 80-120k ballpark based on system load; also assuming Sonnet, not Opus). A normal user probably won't consume more than 250k token per day with this subscription. That's like 5M token for a month of 20 active days - which doesn't justify the 100 USD subscription cost. This also doesn't square with Anthropic's statement that users can consume 10000+ USD of usage on the Max 20 tier.
I'm clearly misunderstanding Claude's rate limits here. Can someone enlighten me? Is the 5-hour window somehow per task/instance and not per account?
Anyway, my personal experience on Max 20x is that, with Opus at least, on a busy day in the past I can burn through between 150 to 200 million tokens in a day using Claude Code for development stuff. Split that up into 5 hour windows, and assume I'm possibly using 2 or 3 windows in a day, that still works out to a lot of tokens, well into the millions. So, the 88k tokens per 5-hour window on Max 5x, I'm not sure if it's really as small as that. Maybe the apparent reductions recently in usage limits have made it drop to around that ballpark. Originally I saw Max 5x as a heavy usage Sonnet plan, with Max 20x being a heavy usage Opus plan, however with the new and additional weekly usage limit happening on August 28th I think I'd see the plans as potentially moderate to heavy usage Sonnet for Max 5x, and heavy usage Sonnet with multiple concurrent agents for Max 20x.
TLDR: I strongly imagine that Claude subscription usage limits are based on some kind of internal credit value, perhaps $ USD, not specifically tokens, and depending which model you use is how fast this "credit" will be depleted.
The usage limits are currently for an account, based on a 5-hour window, from the first message that was sent in a new 5-hour window. From August 28th there's an additional weekly limit which looks like it will primarily make Opus usage restricted.
Cerebras is jumping on a marketing faux-pas by Anthropic. I say this for the point you bring up about monthly session limits - no one on the Claude subreddit has yet to report being hit by this despite many going way over that. These are checks to deal w/ abusive accounts.
Because it hasn't gone into effect yet: "From August 28, we’ll introduce new weekly limits that’ll mitigate these problems while impacting as few customers as possible." [0]
[0] https://xcancel.com/AnthropicAI/status/1949898514844307953#m
> How do you calculate messages per day? Actual number of messages per day depends on token usage per request. Estimates based on average requests of ~8k tokens each for a median user.
So seems there is a token limit? But they're not clear what exactly that is? Haven't tried to subscribe, just going by public information available.
The anti-AI people would be pulling their pitchforks out against these people.
Would there be any way of compiling this without people's consent? Looking at GitHub public repos, etc.?
I imagine a future where we're all automatically profiled like this. Kind of like perverse employee tracking software.
OTOH, 5 hour limits are far superior to daily limits when both can be realistically hit.
It's harder to set up, lends itself to lower margins, and consumers generally do prefer more predictable/simpler pricing, but so many ai devtools products have pissed their users off by throttling their "unlimited"/plan-based pricing that I think it's now seen as a yellow flag
The whole point of vibe coding is its working faster than you would on your own. If you're reviewing it carefully and understand how it works, you might as well have written it by hand.
"Don't be curmudgeonly."
You take your Cerebras Code endpoint and configure XYZ CLI tool or IDE plugin to point at it.
For in-editor use that's game changing.
Or is VS code pretty good at this point? Or is there something better? These are the only two ways I'd know how to actually consume this with any success.
Personally, I use code-companion on neovim.
Maybe not the best solution for vibe coders but for serious engineers using these tools for AI-assisted development, OpenAI API compatibility means total flexibility.
If they can maintain this pricing level, and if Qwen3‑Coder is as good as people say then they will have an enormous hit on their hands. A massive money losing hit, but a hit.
Very interesting!
PS: Did they reduce the context window, it looks like it.
The $200/month is their "poor person" product for people who can't shell out $500k on one of their rigs.
But this will certainly be a money loser. They have likely been waiting for an open source model that somewhat conforms to their hardware's limitations and which gives acceptable recommendations.
It looks like they have found it with QWEN. We'll see!
https://www.lesswrong.com/posts/CCQsQnCMWhJcCFY9x/openai-los...
For $200plan, it has 40M token cap per day, so assuming the API pricing, the max usage per day is $12/day or 360 per month. (Assuming user max-out usage every day or doesn't hit the 1000message limit first)
relatively standard subscription pricing vs API pricing, i believe they are making money from this and counting on people compare this to Claude Code, which is a much more generous offer.
Claude and Gemini have similar offerings for a similar/same price, i thought. Eg if Claude Code can do it for $200/m, why can't Cerebras?
(honest question, trying to understand the challenge for Cerebras that you're pointing to)
edit: Maybe it's the speed? 2k tokens/s sounds... fast, much faster than Claude. Is that what you're referring to?
Cerebras uses the entire 12" and builds in redundancy so that with current defect rates a large fraction of the wafers are usable. This allows a huge level of parallelism, a large amount of on board ram, and the removal of the need to move data on/off the wafer. So the available bandwidth is insane and inference is mostly bandwidth limited.
(I would've just said, "the throughput is fantastic, but the latency is about 3 times higher than other offerings".)
brew install boldsoftware/tap/sketch
CEREBRAS_API_KEY=...
sketch --model=qwen3-coder-cerebras -skaband-addr=
Our experience is it seems overloaded right now, to the point where we have better results with our usual hosted version: sketch --model=qwenThey shat the bed. They went for super crazy fast compute and not much memory, assuming that models would plateu at a fee billion parameters.
Last year 70b parameters was considered huge, and a good place to standardize around.
Today we have 1t parameter models and we know it still scales linearly with parameters.
So next year we might have 10T parameter LLMs and these guys will still be playing catch up.
All that matters for inference right now is how many HBM chips you can stack and that's it
[0]: https://xcancel.com/CerebrasSystems/status/19513503371867015...
Roo Code support added in v3.25.5: https://github.com/RooCodeInc/Roo-Code/releases/tag/v3.25.5
Cerebras has also been added as a provider for Qwen 3 Coder in OpenRouter: https://openrouter.ai/qwen/qwen3-coder?sort=throughput
The quality is also not quite what Claude Code gave me, but the speed is definitely way faster. If Cerebras supported caching & reduced token pricing for using the cache I think I would run this more, but right now it's too expensive per agent run.
> Actual number of messages per day depends on token usage per request. Estimates based on average requests of ~8k tokens each for a median user.
https://cerebras-inference.help.usepylon.com/articles/346886...
It was adopted because trying to generate diffs with AI opens a whole new can of worms, but there's a very efficient approach in between: slice the files on the symbol level.
So if the AI only needs the declaration of foo() and the definition of bar(), the entire file can be collapsed like this:
class MyClass {
void foo();
void bar() {
//code
}
}
Any AI-suggested changes are then easy to merge back (renamings are the only notable exception), so it works really fast.I am currently working on an editor that combines this approach with the ability to step back-and-forth between the edits, and it works really well. I absolutely love the Cerebras platform (they have a free tier directly and pay-as-you-go offering via OpenRouter). It can get very annoying refactorings done in one or two seconds based on single-sentence prompts, and it usually costs about half a cent per refactoring in tokens. Also great for things like applying known algorithms to spread out data structures, where including all files would kill the context window, but pulling individual types works just fine with a fraction of tokens.
If you don't mind the shameless plug, there's a more explanation how it works here: https://sysprogs.com/CodeVROOM/documentation/concepts/symbol...
This approach saves tokens theoretically, but i find it can lead to wastefulness as it tries to figure out why things aren’t working when loading the full file would have solved the problem in a single step.
What works for me (adding features to huge interconnected projects), is think what classes, algorithms and interfaces I want to add, and then give very brief prompts like "split class into abstract base + child like this" and "add another child supporting x,y and z".
So, I still make all the key decisions myself, but I get to skip typing the most annoying and repetitive parts. Also, the code don't look much different from what I could have written by hand, just gets done about 5x faster.
I tried copy-pasting all the relevant parts into ChatGPT and gave it instructions like "add support for X to Y, similar to Z", and it got it pretty well each time. The bottleneck was really pasting things into the context window, and merging the changes back. So, I made a GUI that automated it - showed links on top of functions/classes to quickly attach them into the context window, either as just declarations, or as editable chunks.
That worked faster, but navigating to definitions and manually clicking on top of them still looked like an unnecessary step. But if you asked the model "hey, don't follow these instructions yet, just tell me which symbols you need to complete them", it would give reasonable machine-readable results. And then it's easy to look them up on the symbol level, and do the actual edit with them.
It doesn't do magic, but takes most of the effort out of getting the first draft of the edit, than you can then verify, tweak, and step through in a debugger.
The API price is not a reason to reject the subscription price.
In fact it seems obvious that you should use the flat fee model instead
Any attempt to deal with "<think>" in the code gets it replaced with "<tool_call>".
Both in inference.cerebras.ai chat and API.
Same model on chat.qwen.ai doesn't do it.
>While they advertise a 1,000-request limit, the actual daily constraint is a 7.5 million-token limit. [1]
Assumes an average of 7.5k/request whereas in their marketing videos they show API requests ballooning by ~24k per request. Still lower than the API price.
[1] https://old.reddit.com/r/LocalLLaMA/comments/1mfeazc/cerebra...
I subscribed to the $50 plan. It's super fast for sure, but rate limits kick in after just a couple requests. completely defeating the fact that responses are fast.
Did I miss something?
Who is the intended audience for Cerebras?
No weekly limits so far. Just you wait if you get same or more traction as Claude you are going to go same playbook as they did.
> Yeah I filed a ticket with Cursor
> They have problems with OpenAI customization
One workaround we're doing now that seems to work is use claude for all tasks but delegate specific tools with cerebras/qwen-3-coder-480b model to generate files or other token heavy tasks to avoid spiking the total number of requests. This has cost and latency consequences (and adds complexity to the code), but until those throttle limits are lifted seems to be a good combo. I also find that claude has better quality with tool selection when the number of tools required is > 15 which our current setup has.
namanyayg•6mo ago
I think a lot more companies will follow suit and the competition will make pricing much better for the end user.
congrats on the launch Cerebras team!