When they partnered with Cerebras, I kind of had a gut feeling that they wouldn't be able to use their technology for larger models because Cerebras doesn't have a track record of serving models larger than GLM.
It pains me that five days before my Codex subscription ends, I have to switch to Anthropic because despite getting less quota compared to Codex, at least I'll be able to use my quota _and_ stay in the flow.
But even Codex's slowness aside, it's just not as good of an "agentic" model as Opus: here's what drove me crazy: https://x.com/OrganicGPT/status/2021462447341830582?s=20. The Codex model (gpt-5.3-xhigh) has no idea about how to call agents smh
> I don't want a faster, smaller model. I want a faster, better model
Will you pay 10x the price? They didn't solve the "wrong problem". They did what they could with the resources they have.
It's entirely possible that this is the first step and that they will also do faster better models, too.
video is pretty outdated now, this was a PoC - working on a dependency free version.
> more than 1000 tokens per second
Perhaps, no more?
(Not to mention, if you're waiting for one LLM, sometimes it makes sense to multi-table. I think Boris from Anthropic says he runs 5 CC instances in his terminal and another 5-10 in his browser on CC web.)
If 60% of the work is "edit this file with this content", or "refactor according to this abstraction" then low latency - high token inference seems like a needed improvement.
Recently someone made a Claude plugin to offload low-priority work to the Anthropic Batch API [1].
Also I expect both Nvidia and Google to deploy custom silicon for inference [2]
1: https://github.com/s2-streamstore/claude-batch-toolkit/blob/...
2: https://www.tomshardware.com/tech-industry/semiconductors/nv...
Ive had great success with it, and it rapidly speeds up development time at fairly minimal cost.
(Overall, batches do have quite a bit of potential for agentic work as-is but you have to cope with them taking potentially up to 24h for just a single roundtrip with your local agent harness.)
[1] https://z.ai/blog/glm-4.7 [2] https://openai.com/index/introducing-gpt-5-3-codex-spark/
> Under the hood, we streamlined how responses stream from client to server and back, rewrote key pieces of our inference stack, and reworked how sessions are initialized so that the first visible token appears sooner and Codex stays responsive as you iterate. Through the introduction of a persistent WebSocket connection and targeted optimizations inside of Responses API, we reduced overhead per client/server roundtrip by 80%, per-token overhead by 30%, and time-to-first-token by 50%. The WebSocket path is enabled for Codex-Spark by default and will become the default for all models soon.
It's certainly not "untested".
So labelling it "untested" even at OpenAI's scale which Meta as a customer clearly exceeds that is quiet nonsensical and frankly an uninformed take.
[0] https://www.cerebras.ai/customer-spotlights/meta
[1] https://www.cerebras.ai/news/hugging-face-partners-with-cere...
[2] https://www.cerebras.ai/press-release/cerebras-powers-perple...
I have yet to see this (produce anything actually useful).
If you have a deterministic unit test that can reproduce the bug through your app front door, but you have no idea how the bug is actually happening, having a coding agent just grind through the slog of sticking debug prints everywhere, testing hypotheses, etc — it's an ideal usecase
The important role for me, as a SWE, in the process, is verify that the code does what we actually want it to do. If you remove yourself from the process by letting it run on its own overnight, how does it know it's doing what you actually want it to do?
Or is it more like with your usecase—you can say "here's a failing test—do whatever you can to fix it and don't stop until you do". I could see that limited case working.
Bad idea. It can modify the code that the test passes but everything else is now broken.
This is impressive, you’ve completely mitigated the risk of learning or understanding.
I don't discount the value of blood, sweat and tears spent on debugging those hard issues, and the lessons learned from doing so, but there is a certain point where it's OK to take a pass and just let the robots figure it out.
I've been finding that the Opus 4.5/4.6 and GPT-5.2/5.3 models really have represented a step-change in how good they are at running long tasks.
I can one-shot prompt all sorts of useful coding challenges now that previously I would have expected to need multiple follow-ups to fix mistakes the agents made.
I got all of this from a single prompt, for example: https://github.com/simonw/research/tree/main/cysqlite-wasm-w... - including this demo page: https://simonw.github.io/research/cysqlite-wasm-wheel/demo.h... - using this single prompt: https://github.com/simonw/research/pull/79
There is maybe 5 relevant lines in the script and nothing complex at all that would require to run for days.
I don't think I've got any examples of multi-hour or multi-day sessions that ran completely uninterrupted - this one back in December took 4.5 hours but I had to prompt it to keep going a few times along the way: https://simonwillison.net/2025/Dec/15/porting-justhtml/
I am a bit thick with such things, but just wanted to provide the context that Emscripten can be a fickle beast :)
I sure am glad I can now deploy Infinite Mechanized Autistic Persistence to such soul-crushing tasks, and go make a sandwich or something.
(The bug turned out to be that if I included a boolean in a class member, the whole game crashed, but only the Emscripten version. Sad. Ended up switching back to JS, which you basically need anyway for most serious web game dev.)
It's easy to say that these increasingly popular tools are only able to produce useless junk. You haven't tried, or you haven't "closed the loop" so that the agent can evaluate its own progress toward acceptance criteria, or you are monitoring incompetent feeds of other users.
Quick/Instant LLMs for human use (think UI). Slow, deep thinking LLMs for autonomous agents.
Slow, deep tasks are mostly for flashy one-shot demos that have little to no practical use in the real world.
I imagine it's a win-win. This could significantly help their tokenomics.
The example showing a plan being generated instantaneously is interesting. Human understanding will end up as the last, true bottleneck.
1. "One more thing.
2. "It's not just a thing; it's one more thing."
3. "<sparkleEmoji>One more thing."Nevermind. [0]
Their only chance is an aquihire, but nvidia just spent $20b on groq instead. Dead man walking.
Let's not forget that the CEO is an SEC felon who got caught trying to pull a fast one.
Training models needs everything in one DC, inference doesn't.
Terrible yield: one defect can ruin a whole wafer instead of just a chip region. Poor perf./cost (see above). Difficult to program. Little space for RAM.
It's also tearing up Bluey bench, which is where the agent has to generate transcripts of unlabeled bluey episodes and rename them according the episode descriptions (it's provided an mcp server for transcription and web search for ep descriptions)
throwup238•1h ago
(Yes I know they released /fast last week but I’m loving the constant oneupsmanship)
dude250711•1h ago
rvz•50m ago
[0] https://www.anthropic.com/news/anthropic-raises-30-billion-s...