GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

https://github.com/openai/codex/issues/30364

76•maille•1h ago

Comments

maille•1h ago

tldr:

GPT-5.5 Codex model exhibits a clustering phenomenon in which reasoning_output_tokens cluster at fixed values spaced 518 apart.

These stuck responses at fixed thresholds are strongly correlated with errors in complex tasks.

Observed phenomenon is specific to GPT-5.5; it is much less prevalent in GPT-5.4 and almost absent in GPT-5.2 and 5.3

ProofHouse•56m ago

Personally, I would say very likely, to be honest. I gotta go through this a little more, but I actually use 5.5 codex an obscene amount, and I almost never use it for reasoning anymore. It's not even in the same galaxy as far as actually taking out the thinking and using GPT-5.5 or even Claude and then coming back and giving it the reasoning. Blah blah blah, it's the same model. Well, let me tell you, no, it's not, for several reasons, and the delta on intelligence is pretty staggering.

m101•54m ago

What?

benjiro29•53m ago

Care to explain what you mean by that?

dimitrios1•29m ago

I know that these types of comments are not really popular here, but this struck a chord with me because I feel the same. They aren't remotely close.

I have codex right now purely because they gave me a month free of ChatGPT Pro, so I have been using it in between my usage resets with claude. Since it's "free money" for me I have been using it exclusively on xHigh.

One of my most frequent prompts is "hey codex worked on ____, but it didn't quite hit the mark, can we review the work..."

Yes, part of this is normal even within the same model -- you have the highest power model review the work for correctness, refactoring opportunities, and so on, but man I tell you, I don't know what it is about codex, this is obviously one guy's anecdote -- same prompting style, same repository documentation ala MD files, same skills, way different results.

All that to say, maybe the bug report is on to something here, and it can be fixed.

kleton•30m ago

Clearly they are batching reasoning inference in a few multiples of 512 tokens as a throughput optimization

zenapollo•21m ago

I’ve definitely experienced step jumps down in quality on an almost daily basis. I usually used xhigh. The experience of relying on codex’s outstandingly thorough coding earlier in the year has evaporated for me. I’m seeing incredibly stupid implementations intermittently, and have simply switched to Claude until openai takes the issue seriously. As far as i could tell they haven’t taken it seriously for the several months I’ve been personally seeing it.

siva7•16m ago

I've switched 3 months ago to Codex because Claude got incredibly stupid. 6 months ago vice versa. It doesn't matter if you use Codex or Claude. Both will fuck with you at some point. Though Codex probably less.

siva7•8m ago

I swear some days ago someone here claimed Openai succeeded cutting down their compute cost by half with a breakthrough optimization. So this is it?

The Quest to Make Humanoid Robots Safe Enough for Humans

Recovering garbled Bitcoin addresses (2024)

AI Authentication and Authorization

The Ugly Phase

Mantissa, a distributed workload orchestration system

Scientists discover a surprising link between Vitamin C and brain health

Ask HN: When will the stock market crash?

DeFi yield comparison, read before depositing

Birdsong data from Merlin ID app to help global biodiversity project

Microsoft Copilot OS revealed in LEAKED video: built on Copilot and agentic AI

Tell HN: Megalodon.jp is faster than archive.today and doesn't require reCAPTCHA

Out-of-core LLM inference engine written from scratch in Rust

Simulation Game: Can you Terraform Mars?

Tool that stops iCloud from eating your Mac's SSD

I just tired of killing AI slop

RFC: Stopping runaway AI agent spend with atomic budget reservations

Freedom from NPM. Happy 4th

Check my temp mail with 18 custom domain

Cells, boundaries, and the emergence of biological order

Mystery of India's red-haired child unlocks hidden colour genes

Peekdiff – review GitHub PRs without the diff touching my server

Jellyfish can heal wounds in minutes. Scientists want their secrets

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design

New bacterial species discovered in NASA's cleanrooms

Toward Better Hip Kernel Generation for AMD GPUs

Researchers affirm long-held belief that viruses can trigger Parkinson's disease

China Is Devastating the Last Stronghold of German Industry

Four Corners – a spin on Connections-like games

Show HN: WifeBench – My wife vibes LLM rankings

Dermatology is wrong about the sun