When they partnered with Cerebras, I kind of had a gut feeling that they wouldn't be able to use their technology for larger models because Cerebras doesn't have a track record of serving models larger than GLM.
It pains me that five days before my Codex subscription ends, I have to switch to Anthropic because despite getting less quota compared to Codex, at least I'll be able to use my quota _and_ stay in the flow.
But even Codex's slowness aside, it's just not as good of an "agentic" model as Opus: here's what drove me crazy: https://x.com/OrganicGPT/status/2021462447341830582?s=20. The Codex model (gpt-5.3-xhigh) has no idea about how to call agents smh
> I don't want a faster, smaller model. I want a faster, better model
Will you pay 10x the price? They didn't solve the "wrong problem". They did what they could with the resources they have.
It's entirely possible that this is the first step and that they will also do faster better models, too.
video is pretty outdated now, this was a PoC - working on a dependency free version.
> more than 1000 tokens per second
Perhaps, no more?
(Not to mention, if you're waiting for one LLM, sometimes it makes sense to multi-table. I think Boris from Anthropic says he runs 5 CC instances in his terminal and another 5-10 in his browser on CC web.)
If 60% of the work is "edit this file with this content", or "refactor according to this abstraction" then low latency - high token inference seems like a needed improvement.
Recently someone made a Claude plugin to offload low-priority work to the Anthropic Batch API [1].
Also I expect both Nvidia and Google to deploy custom silicon for inference [2]
1: https://github.com/s2-streamstore/claude-batch-toolkit/blob/...
2: https://www.tomshardware.com/tech-industry/semiconductors/nv...
Ive had great success with it, and it rapidly speeds up development time at fairly minimal cost.
(Overall, batches do have quite a bit of potential for agentic work as-is but you have to cope with them taking potentially up to 24h for just a single roundtrip with your local agent harness.)
[1] https://z.ai/blog/glm-4.7 [2] https://openai.com/index/introducing-gpt-5-3-codex-spark/
> Under the hood, we streamlined how responses stream from client to server and back, rewrote key pieces of our inference stack, and reworked how sessions are initialized so that the first visible token appears sooner and Codex stays responsive as you iterate. Through the introduction of a persistent WebSocket connection and targeted optimizations inside of Responses API, we reduced overhead per client/server roundtrip by 80%, per-token overhead by 30%, and time-to-first-token by 50%. The WebSocket path is enabled for Codex-Spark by default and will become the default for all models soon.
It's certainly not "untested".
So labelling it "untested" even at Meta's scale as a customer (which exceeds OpenAI's scale) is quiet nonsensical and frankly an uninformed take.
[0] https://www.cerebras.ai/customer-spotlights/meta
[1] https://www.cerebras.ai/news/hugging-face-partners-with-cere...
[2] https://www.cerebras.ai/press-release/cerebras-powers-perple...
I have yet to see this (produce anything actually useful).
Anthropic is actually sort of concerned with not burning through cash and charging people a reasonable price. Open AI doesn’t care. I can use Codex CLI all day and not approach any quotas with just my $20 a month ChatGPT subscription.
I treat coding agents like junior developers and never take my hand off the wheel except for boilerplate refactoring.
If you have a deterministic unit test that can reproduce the bug through your app front door, but you have no idea how the bug is actually happening, having a coding agent just grind through the slog of sticking debug prints everywhere, testing hypotheses, etc — it's an ideal usecase
The important role for me, as a SWE, in the process, is verify that the code does what we actually want it to do. If you remove yourself from the process by letting it run on its own overnight, how does it know it's doing what you actually want it to do?
Or is it more like with your usecase—you can say "here's a failing test—do whatever you can to fix it and don't stop until you do". I could see that limited case working.
Bad idea. It can modify the code that the test passes but everything else is now broken.
This is impressive, you’ve completely mitigated the risk of learning or understanding.
I don't discount the value of blood, sweat and tears spent on debugging those hard issues, and the lessons learned from doing so, but there is a certain point where it's OK to take a pass and just let the robots figure it out.
I've been finding that the Opus 4.5/4.6 and GPT-5.2/5.3 models really have represented a step-change in how good they are at running long tasks.
I can one-shot prompt all sorts of useful coding challenges now that previously I would have expected to need multiple follow-ups to fix mistakes the agents made.
I got all of this from a single prompt, for example: https://github.com/simonw/research/tree/main/cysqlite-wasm-w... - including this demo page: https://simonw.github.io/research/cysqlite-wasm-wheel/demo.h... - using this single prompt: https://github.com/simonw/research/pull/79
There is maybe 5 relevant lines in the script and nothing complex at all that would require to run for days.
I don't think I've got any examples of multi-hour or multi-day sessions that ran completely uninterrupted - this one back in December took 4.5 hours but I had to prompt it to keep going a few times along the way: https://simonwillison.net/2025/Dec/15/porting-justhtml/
I am a bit thick with such things, but just wanted to provide the context that Emscripten can be a fickle beast :)
I sure am glad I can now deploy Infinite Mechanized Autistic Persistence to such soul-crushing tasks, and go make a sandwich or something.
(The bug turned out to be that if I included a boolean in a class member, the whole game crashed, but only the Emscripten version. Sad. Ended up switching back to JS, which you basically need anyway for most serious web game dev.)
It's easy to say that these increasingly popular tools are only able to produce useless junk. You haven't tried, or you haven't "closed the loop" so that the agent can evaluate its own progress toward acceptance criteria, or you are monitoring incompetent feeds of other users.
Quick/Instant LLMs for human use (think UI). Slow, deep thinking LLMs for autonomous agents.
Slow, deep tasks are mostly for flashy one-shot demos that have little to no practical use in the real world.
I imagine it's a win-win. This could significantly help their tokenomics.
The example showing a plan being generated instantaneously is interesting. Human understanding will end up as the last, true bottleneck.
These are the bane of any staff engineers life - lol. Because people above need to know a plan in art form.
So seriously interested on how I can make it easier
1. "One more thing.
2. "It's not just a thing; it's one more thing."
3. "<sparkleEmoji>One more thing."Nevermind. [0]
Yields on silicon are great, but not perfect
Their only chance is an aquihire, but nvidia just spent $20b on groq instead. Dead man walking.
Compare the photos of a Cerebras deployment to a TPU deployment.
https://www.nextplatform.com/wp-content/uploads/2023/07/cere...
https://assets.bwbx.io/images/users/iqjWHBFdfxIU/iOLs2FEQxQv...
The difference is striking.
Let's not forget that the CEO is an SEC felon who got caught trying to pull a fast one.
Training models needs everything in one DC, inference doesn't.
Terrible yield: one defect can ruin a whole wafer instead of just a chip region. Poor perf./cost (see above). Difficult to program. Little space for RAM.
Google is crushing them on inference. By TPUv9, they could be 4x cheaper (even if Nvidia cuts their margins from 75% to 40%).
Cerebras will be substantially better for agentic workflows in terms of speed.
And if you don't care as much about speed and only cost, Google will still crush Nvidia.
And Nvidia won't be cheaper for training new models either. The vast majority of chips will be used for inference by 2028 instead of training anyway.
Nvidia has no manufacturing reliability story. Anyone can buy TSMC's output.
What am I missing? I don't understand how Nvidia could've been so far ahead and just let every part of the market slip away.
This is blazing fast but it definitely has a small model feel.
It's tearing up bluey bench (my personal speed benchmark), which is a file system benchmark where I have the agent generate transcripts of all the episodes of a season of bluey, perform a web search to find the episode descriptions, and then match the transcripts against the descriptions to generate the the file names and metadata. It's unbelievably fast (did it all in ~2 minutes), but it also has to be prompted to do actions in my AGENTS.md that the larger models adhere to without additional prompting.
throwup238•2h ago
(Yes I know they released /fast last week but I’m loving the constant oneupsmanship)
dude250711•1h ago
rvz•1h ago
[0] https://www.anthropic.com/news/anthropic-raises-30-billion-s...
bearjaws•7m ago
Last night it got stuck in a loop (in plan mode, I use vanilla CC) and burnt through $22 in 15 minutes.