frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

GPT‑5.3‑Codex‑Spark

https://openai.com/index/introducing-gpt-5-3-codex-spark/
209•meetpateltech•1h ago

Comments

throwup238•1h ago
Your move, Anthropic.

(Yes I know they released /fast last week but I’m loving the constant oneupsmanship)

dude250711•1h ago
They asked Google to cover them this time. They will owe them a reciprocal favour.
rvz•50m ago
ok. [0]

[0] https://www.anthropic.com/news/anthropic-raises-30-billion-s...

OsrsNeedsf2P•1h ago
No hint on pricing. I'm curious if faster is more expensive, given a slight trade-off in accuracy
sauwan•20m ago
It's either more expensive or dumber.
behnamoh•1h ago
In my opinion, they solved the wrong problem. The main issue I have with Codex is that the best model is insanely slow, except at nights and weekends when Silicon Valley goes to bed. I don't want a faster, smaller model (already have that with GLM and MiniMax). I want a faster, better model (at least as fast as Opus).

When they partnered with Cerebras, I kind of had a gut feeling that they wouldn't be able to use their technology for larger models because Cerebras doesn't have a track record of serving models larger than GLM.

It pains me that five days before my Codex subscription ends, I have to switch to Anthropic because despite getting less quota compared to Codex, at least I'll be able to use my quota _and_ stay in the flow.

But even Codex's slowness aside, it's just not as good of an "agentic" model as Opus: here's what drove me crazy: https://x.com/OrganicGPT/status/2021462447341830582?s=20. The Codex model (gpt-5.3-xhigh) has no idea about how to call agents smh

re-thc•1h ago
> In my opinion, they solved the wrong problem

> I don't want a faster, smaller model. I want a faster, better model

Will you pay 10x the price? They didn't solve the "wrong problem". They did what they could with the resources they have.

cjbarber•1h ago
> In my opinion, they solved the wrong problem. The main issue I have with Codex is that the best model is insanely slow, except at nights and weekends when Silicon Valley goes to bed. I don't want a faster, smaller model (already have that with GLM and MiniMax). I want a faster, better model (at least as fast as Opus).

It's entirely possible that this is the first step and that they will also do faster better models, too.

behnamoh•1h ago
I doubt it; there's a limit on model size that can be supported by Cerebras tech. GPT-5.3 is supposedly +1T parameters...
properbrew•1h ago
I was using a custom skill to spawn subagents, but it looks like the `/experimental` feature in codex-cli has the SubAgent setting (https://github.com/openai/codex/issues/2604#issuecomment-387...)
behnamoh•1h ago
Yes, I was using that. But the prompt given to the agents is not correct. Codex sends a prompt to the first agent and then sends the second prompt to the second agent, but then in the second prompt, it references the first prompt. which is completely incorrect.
kachapopopow•1h ago
That's why I built oh-my-singularity (based on oh-my-pi - see the front page from can.ac): https://share.us-east-1.gotservers.com/v/EAqb7_Wt/cAlknb6xz0...

video is pretty outdated now, this was a PoC - working on a dependency free version.

mudkipdev•1h ago
Off topic but how is it always this HN user sharing model releases within a couple of minutes of their announcement?
sho_hn•1h ago
Maybe they set up an agent for it.
Squarex•49m ago
or a simple cron :)
casefields•46m ago
The account isn’t a normal user. They literally only post stuff like this. Their comments are just official links back to said announcements.
cjbarber•1h ago
For a bit, waiting for LLMs was like waiting for code to compile: https://xkcd.com/303/

> more than 1000 tokens per second

Perhaps, no more?

(Not to mention, if you're waiting for one LLM, sometimes it makes sense to multi-table. I think Boris from Anthropic says he runs 5 CC instances in his terminal and another 5-10 in his browser on CC web.)

jryio•1h ago
This is interesting for offloading "tiered" workloads / priority queue with coding agents.

If 60% of the work is "edit this file with this content", or "refactor according to this abstraction" then low latency - high token inference seems like a needed improvement.

Recently someone made a Claude plugin to offload low-priority work to the Anthropic Batch API [1].

Also I expect both Nvidia and Google to deploy custom silicon for inference [2]

1: https://github.com/s2-streamstore/claude-batch-toolkit/blob/...

2: https://www.tomshardware.com/tech-industry/semiconductors/nv...

dehugger•1h ago
I built something similar using an MCP that allows claude to "outsource" development to GLM 4.7 on Cerebras (or a different model, but GLM is what I use). The tool allows Claude to set the system prompt, instructions, specify the output file to write to and crucially allows it to list which additional files (or subsections of files) should be included as context for the prompt.

Ive had great success with it, and it rapidly speeds up development time at fairly minimal cost.

cheema33•1h ago
Why use MCP instead of an agent skill for something like this when MCP is typically context inefficient?
wahnfrieden•45m ago
Models haven't been trained enough on using skills yet, so they typically ignore them
andai•29m ago
Is that true? I had tool use working with GPT-4 in 2023, before function calling or structured outputs were even a thing. My tool instructions were only half a page though. Maybe the long prompts are causing problems?
zozbot234•1h ago
Note that Batch APIs are significantly higher latency than normal AI agent use. They're mostly intended for bulk work where time constraints are not essential. Also, GPT "Codex" models (and most of the "Pro" models also) are currently not available under OpenAI's own batch API. So you would have to use non-agentic models for these tasks and it's not clear how well they would cope.

(Overall, batches do have quite a bit of potential for agentic work as-is but you have to cope with them taking potentially up to 24h for just a single roundtrip with your local agent harness.)

pdeva1•1h ago
This is closer to 5.1 mini it seems and tied to Pro account. GLM 4.7 is available on-demand on Cerebras today [1] and performs better and cheaper... [1] https://www.cerebras.ai/blog/glm-4-7
ehzb2827•54m ago
GLM 4.7 scores 41.0% on Terminal Bench 2.0 [1] compared to 58.4% for GPT-5.3-Codex-Spark [2].

[1] https://z.ai/blog/glm-4.7 [2] https://openai.com/index/introducing-gpt-5-3-codex-spark/

cjbarber•1h ago
It'll be nice when there's smarter routing between models, or easier routing, so some things get sent to the fast model, some get sent to the cheap model, some get sent to the smart model, etc.
allisdust•1h ago
Normal codex it self is sub par compared to opus. This might be even worse
antirez•1h ago
The search for speed is vain. Often Claude Code Opus 4.6, on hard enough problems, can do the impression of acting fast without really making progresses because of lack of focus on what matters. Then you spin the much slower GPT 5.3-Codex and it fixes everything in 3 minutes of doing the right thing.
mickeyp•1h ago
I disagree. This is great for bulk tasks: renaming, finding and searching for things, etc
jusgu•1h ago
disagree. while intelligence is important, speed is especially important when productionizing AI. it’s difficult to formalize the increase in user experience per increase in TPS but it most definitely exists.
Aurornis•37m ago
I will always take more speed. My use of LLMs always comes back to doing something manually, from reviewing code to testing it to changing direction. The faster I can get the LLM part of the back-and-forth to complete, the more I can stay focused on my part.
alexhans•1h ago
When I saw Spark my mind went to Apache Spark and wondered if we were learning all the lessons in orchestration of driver/worker and data shuffling from that space.
jauntywundrkind•1h ago
Wasn't aware there was an effort to move to websockets. Is there any standards work for this, or is this just happening purely within the walled OpenAI garden?

> Under the hood, we streamlined how responses stream from client to server and back, rewrote key pieces of our inference stack, and reworked how sessions are initialized so that the first visible token appears sooner and Codex stays responsive as you iterate. Through the introduction of a persistent WebSocket connection and targeted optimizations inside of Responses API, we reduced overhead per client/server roundtrip by 80%, per-token overhead by 30%, and time-to-first-token by 50%. The WebSocket path is enabled for Codex-Spark by default and will become the default for all models soon.

kachapopopow•1h ago
Is this the first time one of the big 3 using Cerebras? I've been waiting for this day...
arisAlexis•1h ago
They were afraid for the untested tech but it looks like a leap in speed now
rvz•54m ago
This is nonsense what do you mean? Mistral uses Cerebras for their LLMs as well. [0]

It's certainly not "untested".

[0] https://www.cerebras.ai/blog/mistral-le-chat

lemming•39m ago
Tested at Mistral’s scale is a very different thing to tested at OpenAI’s scale.
rvz•13m ago
The scale of being "tested" clearly convinced Meta (beyond OpenAI's scale) [0] HuggingFace [1], Perplexity [2] and unsuprisingly many others in the AI industry [3] that require more compute than GPUs can deliver.

So labelling it "untested" even at OpenAI's scale which Meta as a customer clearly exceeds that is quiet nonsensical and frankly an uninformed take.

[0] https://www.cerebras.ai/customer-spotlights/meta

[1] https://www.cerebras.ai/news/hugging-face-partners-with-cere...

[2] https://www.cerebras.ai/press-release/cerebras-powers-perple...

[3] https://www.cerebras.ai/customer-spotlights

deskithere•1h ago
Anyway token eaters are upgrading their consumption capabilities.
nikkwong•1h ago
> Our latest frontier models have shown particular strengths in their ability to do long-running tasks, working autonomously for hours, days or weeks without intervention.

I have yet to see this (produce anything actually useful).

XCSme•1h ago
Their ability to burn through tokens non-stop for hours, days or weeks without intervention.
gamegoblin•1h ago
I routinely leave codex running for a few hours overnight to debug stuff

If you have a deterministic unit test that can reproduce the bug through your app front door, but you have no idea how the bug is actually happening, having a coding agent just grind through the slog of sticking debug prints everywhere, testing hypotheses, etc — it's an ideal usecase

tsss•51m ago
How can you afford that?
wahnfrieden•49m ago
It costs $200 for a month
nikkwong•33m ago
I have a hard time understanding how that would work — for me, I typically interface with coding agents through cursor. The flow is like this: ask it something -> it works for a min or two -> I have to verify and fix by asking it again; etc. until we're at a happy place with the code. How do you get it to stop from going down a bad path and never pulling itself out of it?

The important role for me, as a SWE, in the process, is verify that the code does what we actually want it to do. If you remove yourself from the process by letting it run on its own overnight, how does it know it's doing what you actually want it to do?

Or is it more like with your usecase—you can say "here's a failing test—do whatever you can to fix it and don't stop until you do". I could see that limited case working.

woah•17m ago
For some reason setting up agents in a loop with a solid prompt and new context each iteration seems to result in higher quality work for larger or more difficult tasks than the chat interface. It's like the agent doesn't have to spend half its time trying to guess what you want
p1esk•7m ago
“here's a failing test—do whatever you can to fix it”

Bad idea. It can modify the code that the test passes but everything else is now broken.

addaon•29m ago
> it's an ideal usecase

This is impressive, you’ve completely mitigated the risk of learning or understanding.

arcanemachiner•25m ago
Or, they have freed up time for more useful endeavours, that may otherwise have spent on drudgery.

I don't discount the value of blood, sweat and tears spent on debugging those hard issues, and the lessons learned from doing so, but there is a certain point where it's OK to take a pass and just let the robots figure it out.

simonw•56m ago
How hard have you tried?

I've been finding that the Opus 4.5/4.6 and GPT-5.2/5.3 models really have represented a step-change in how good they are at running long tasks.

I can one-shot prompt all sorts of useful coding challenges now that previously I would have expected to need multiple follow-ups to fix mistakes the agents made.

I got all of this from a single prompt, for example: https://github.com/simonw/research/tree/main/cysqlite-wasm-w... - including this demo page: https://simonw.github.io/research/cysqlite-wasm-wheel/demo.h... - using this single prompt: https://github.com/simonw/research/pull/79

aeyes•49m ago
What do you mean? The generated script just downloads the sources and runs pyodide: https://github.com/simonw/research/blob/main/cysqlite-wasm-w...

There is maybe 5 relevant lines in the script and nothing complex at all that would require to run for days.

simonw•23m ago
No, not for days - but it churned away on that one for about ten minutes.

I don't think I've got any examples of multi-hour or multi-day sessions that ran completely uninterrupted - this one back in December took 4.5 hours but I had to prompt it to keep going a few times along the way: https://simonwillison.net/2025/Dec/15/porting-justhtml/

andai•22m ago
Maybe so, but I did once spend 12 hours straight debugging an Emscripten C++ compiler bug! (After spending the first day of the jam setting up Emscripten, and the second day getting Raylib to compile in it. Had like an hour left to make the actual game, hahah.)

I am a bit thick with such things, but just wanted to provide the context that Emscripten can be a fickle beast :)

I sure am glad I can now deploy Infinite Mechanized Autistic Persistence to such soul-crushing tasks, and go make a sandwich or something.

(The bug turned out to be that if I included a boolean in a class member, the whole game crashed, but only the Emscripten version. Sad. Ended up switching back to JS, which you basically need anyway for most serious web game dev.)

basilgohar•48m ago
Can you share any examples of these one-shot prompts? I've not gotten to the point where I can get those kind of results yet.
simonw•7m ago
If you look through the commit logs on simonw/research and simonw/tools on GitHub most commits should either list the prompt, link to a PR with the prompt or link to a session transcript.
wahnfrieden•51m ago
It worked for me several times.

It's easy to say that these increasingly popular tools are only able to produce useless junk. You haven't tried, or you haven't "closed the loop" so that the agent can evaluate its own progress toward acceptance criteria, or you are monitoring incompetent feeds of other users.

nikkwong•27m ago
I'm definitely bullish on LLM's for coding. It sounds to me as though getting it to run on its own for hours and produce something usable requires more careful thought and setup than just throwing a prompt at it and wishing for the best—but I haven't seen many examples in the wild yet
johnfn•23m ago
The other day I got Codex to one-shot an upgrade to Vite 8 at my day job (a real website with revenue). It worked in this for over 3 hours without intervention (I went to sleep). This is now in production.
bitwize•14m ago
PEBKAC
capevace•1h ago
Seems like the industry is moving further towards having low-latency/high-speed models for direct interaction, and having slow, long thinking models for longer tasks / deeper thinking.

Quick/Instant LLMs for human use (think UI). Slow, deep thinking LLMs for autonomous agents.

varispeed•1h ago
Are they really thinking or are they sprinkling them with Sleep(x)?
gaigalas•1h ago
You always want faster feedback. If not a human leveraging the fast cycles, another automated system (eg CI).

Slow, deep tasks are mostly for flashy one-shot demos that have little to no practical use in the real world.

wxw•1h ago
Great stuff. People are getting used to agents as the interface for everything, even work as simple as "change label X to label Y". More speed on that front is welcome. The Codex "blended mode" they refer to will be useful (similar to Claude Code bouncing between haiku and opus).

I imagine it's a win-win. This could significantly help their tokenomics.

The example showing a plan being generated instantaneously is interesting. Human understanding will end up as the last, true bottleneck.

beklein•58m ago
I love this! I use coding agents to generate web-based slide decks where “master slides” are just components, and we already have rules + assets to enforce corporate identity. With content + prompts, it’s straightforward to generate a clean, predefined presentation. What I’d really want on top is an “improv mode”: during the talk, I can branch off based on audience questions or small wording changes, and the system proposes (say) 3 candidate next slides in real time. I pick one, present it, then smoothly merge back into the main deck. Example: if I mention a recent news article / study / paper, it automatically generates a slide that includes a screenshot + a QR code link to the source, then routes me back to the original storyline. With realtime voice + realtime code generation, this could turn the boring old presenter view into something genuinely useful.
orochimaaru•48m ago
How do you handle the diagrams?
beklein•37m ago
In my AGENTS.md file i have a _rule_ that tells the model to use Apache ECharts, the data comes from the prompt and normally .csv/.json files. Prompt would be like: "After slide 3 add a new content slide that shows a bar chart with data from @data/somefile.csv" ... works great and these charts can be even interactive.
turnsout•43m ago
I love the idea of a living slide deck. This feels like a product that needs to exist!
sva_•33m ago
I love the probabilistic nature of this. Presentations could be anywhere from extremely impressive to hilariously embarrassing.
postalcoder•28m ago

  1. "One more thing.

  2. "It's not just a thing; it's one more thing."

  3. "<sparkleEmoji>One more thing."
esafak•18m ago
Can you show one?
rvz•57m ago
> Today, we’re releasing a research preview of GPT‑5.3-Codex-Spark, a smaller version of GPT‑5.3-Codex, and our first model designed for real-time coding. Codex-Spark marks the first milestone in our partnership with Cerebras, which we announced in January .

Nevermind. [0]

[0] https://news.ycombinator.com/item?id=35490837

nusl•47m ago
These graphs are really weird. One only shows 30-60% range with the model(s) close to 60%, the other shows 80% but the top model is at 77%.
guessmyname•27m ago
Lying with charts → https://handsondataviz.org/how-to-lie-with-charts.html

Also → https://medium.com/@hypsypops/axes-of-evil-how-to-lie-with-g...

More → https://researchguides.library.yorku.ca/datavisualization/li...

And → https://vdl.sci.utah.edu/blog/2023/04/17/misleading/

anonzzzies•42m ago
Been using glm 4.7 for this with opencode. Works really well.
tsss•41m ago
Does anyone want this? Speed has never been the problem for me, in fact, higher latency means less work for me as a replaceable corporate employee. What I need is the most intelligence possible; I don't care if I have to wait a day for an answer if the answer is perfect. Small code edits, like they are presented as the use case here, I can do much better myself than trying to explain to some AI what exactly I want done.
pjs_•30m ago
Continue to believe that Cerebras is one of the most underrated companies of our time. It's a dinner-plate sized chip. It actually works. It's actually much faster than anything else for real workloads. Amazing
arcanemachiner•29m ago
Just wish they weren't so insanely expensive...
azinman2•21m ago
The bigger the chip, the worse the yield.
zozbot234•25m ago
It's "dinner-plate sized" because it's just a full silicon wafer. It's nice to see that wafer-scale integration is now being used for real work but it's been researched for decades.
femiagbabiaka•19m ago
yep
latchkey•18m ago
Not for what they are using it for. It is $1m+/chip and they can fit 1 of them in a rack. Rack space in DC's is a premium asset. The density isn't there. AI models need tons of memory (this product annoucement is case in point) and they don't have it, nor do they have a way to get it since they are last in line at the fabs.

Their only chance is an aquihire, but nvidia just spent $20b on groq instead. Dead man walking.

p1esk•13m ago
The real question is what’s their perf/dollar vs nvidia?
xnx•12m ago
Or Google TPUs.
latchkey•10m ago
Exactly. They won't ever tell you. It is never published.

Let's not forget that the CEO is an SEC felon who got caught trying to pull a fast one.

zozbot234•4m ago
I guess it depends what you mean by "perf". If you optimize everything for the absolutely lowest latency given your power budget, your throughput is going to suck - and vice versa. Throughput is ultimately what matters when everything about AI is so clearly power-constrained, latency is a distraction. So TPU-like custom chips are likely the better choice.
spwa4•6m ago
Oh don't worry. Ever since the power issue started developing rack space is no longer at a premium. Or at least, it's no longer the limiting factor. Power is.
latchkey•2m ago
The dirty secret is that there is plenty of power. But, it isn't all in one place and it is often stranded in DC's that can't do the density needed for AI compute.

Training models needs everything in one DC, inference doesn't.

xnx•9m ago
Cerebras is a bit of a stunt like "datacenters in spaaaaace".

Terrible yield: one defect can ruin a whole wafer instead of just a chip region. Poor perf./cost (see above). Difficult to program. Little space for RAM.

the_duke•5m ago
They claim the opposite, though, saying the chip is designed to tolerate many defects and work around them.
modeless•25m ago
Why are they obscuring the price? It must be outrageously expensive.
chaos_emergent•19m ago
I think it's a beta so they're trying to figure out pricing by deploying it.
cactusplant7374•16m ago
I was really hoping it would support codex xhigh first.
postalcoder•8m ago
I'm running gpt-5.3-codex-spark in Codex right right now. It's insanely fast. I'm definitely going to be using this for pair-programmy type stuff.

It's also tearing up Bluey bench, which is where the agent has to generate transcripts of unlabeled bluey episodes and rename them according the episode descriptions (it's provided an mcp server for transcription and web search for ep descriptions)