GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

https://github.com/openai/codex/issues/30364

76•maille•1h ago

Comments

maille•1h ago

tldr:

GPT-5.5 Codex model exhibits a clustering phenomenon in which reasoning_output_tokens cluster at fixed values spaced 518 apart.

These stuck responses at fixed thresholds are strongly correlated with errors in complex tasks.

Observed phenomenon is specific to GPT-5.5; it is much less prevalent in GPT-5.4 and almost absent in GPT-5.2 and 5.3

ProofHouse•57m ago

Personally, I would say very likely, to be honest. I gotta go through this a little more, but I actually use 5.5 codex an obscene amount, and I almost never use it for reasoning anymore. It's not even in the same galaxy as far as actually taking out the thinking and using GPT-5.5 or even Claude and then coming back and giving it the reasoning. Blah blah blah, it's the same model. Well, let me tell you, no, it's not, for several reasons, and the delta on intelligence is pretty staggering.

m101•54m ago

What?

benjiro29•53m ago

Care to explain what you mean by that?

dimitrios1•29m ago

I know that these types of comments are not really popular here, but this struck a chord with me because I feel the same. They aren't remotely close.

I have codex right now purely because they gave me a month free of ChatGPT Pro, so I have been using it in between my usage resets with claude. Since it's "free money" for me I have been using it exclusively on xHigh.

One of my most frequent prompts is "hey codex worked on ____, but it didn't quite hit the mark, can we review the work..."

Yes, part of this is normal even within the same model -- you have the highest power model review the work for correctness, refactoring opportunities, and so on, but man I tell you, I don't know what it is about codex, this is obviously one guy's anecdote -- same prompting style, same repository documentation ala MD files, same skills, way different results.

All that to say, maybe the bug report is on to something here, and it can be fixed.

kleton•30m ago

Clearly they are batching reasoning inference in a few multiples of 512 tokens as a throughput optimization

zenapollo•21m ago

I’ve definitely experienced step jumps down in quality on an almost daily basis. I usually used xhigh. The experience of relying on codex’s outstandingly thorough coding earlier in the year has evaporated for me. I’m seeing incredibly stupid implementations intermittently, and have simply switched to Claude until openai takes the issue seriously. As far as i could tell they haven’t taken it seriously for the several months I’ve been personally seeing it.

siva7•17m ago

I've switched 3 months ago to Codex because Claude got incredibly stupid. 6 months ago vice versa. It doesn't matter if you use Codex or Claude. Both will fuck with you at some point. Though Codex probably less.

siva7•9m ago

I swear some days ago someone here claimed Openai succeeded cutting down their compute cost by half with a breakthrough optimization. So this is it?

Command and Conquer Generals natively ported to macOS, iPhone, iPad using Fable

GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

Leaking YouTube creators' private videos

Google Books (or similar) all book scans – $200k bounty (2025)

Better Models: Worse Tools

Potential session/cache leakage between workspace instances or consumer accounts

Verizon is About to Break our Watches

Explanation of everything you can see in htop/top on Linux (2019)

Zig: All Package Management Functionality Moved from Compiler to Build System

Drone Physics

Windows CE Dreamcast Community Edition (wince-dc)

Protocol Prying: Vulnerability Research in AirDrop and Quick Share

Curveball

Astrophysicists Puzzle over Webb’s New Universe

It's not me, it's the compiler

The Vespa at 80

Fable created novel 4D splat format

Can you build a recognizable World Map in under 500 bytes?

Neural Render Proxies for Interactive and Differentiable Lighting

EndBASIC 0.14: Are we multimedia yet?

Designing DB partitions you don't have to babysit

Postgres data stored in Parquet on S3: LTAP architecture explained

Breaking the Bird Barrier: Scientist Decodes Zebra Finch Language

BareMetal RAM Dumper – Bare-metal x86 tool for Cold Boot Attack experiments

Finland's last analogue landline phones go silent after 150 years

The .join() that should be a bug

The bottleneck might be the air in the room

Mir Books – Books from the Soviet Era

Game Boy Advance Dev: Logging to the Console

Plein Air

GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

Comments

Command and Conquer Generals natively ported to macOS, iPhone, iPad using Fable

GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

Leaking YouTube creators' private videos

Google Books (or similar) all book scans – $200k bounty (2025)

Better Models: Worse Tools

Potential session/cache leakage between workspace instances or consumer accounts

Verizon is About to Break our Watches

Explanation of everything you can see in htop/top on Linux (2019)

Zig: All Package Management Functionality Moved from Compiler to Build System

Drone Physics

Windows CE Dreamcast Community Edition (wince-dc)

Protocol Prying: Vulnerability Research in AirDrop and Quick Share

Curveball

Astrophysicists Puzzle over Webb’s New Universe

It's not me, it's the compiler

The Vespa at 80

Fable created novel 4D splat format

Can you build a recognizable World Map in under 500 bytes?

Neural Render Proxies for Interactive and Differentiable Lighting

EndBASIC 0.14: Are we multimedia yet?

Designing DB partitions you don't have to babysit

Postgres data stored in Parquet on S3: LTAP architecture explained

Breaking the Bird Barrier: Scientist Decodes Zebra Finch Language

BareMetal RAM Dumper – Bare-metal x86 tool for Cold Boot Attack experiments

Finland's last analogue landline phones go silent after 150 years

The .join() that should be a bug

The bottleneck might be the air in the room

Mir Books – Books from the Soviet Era

Game Boy Advance Dev: Logging to the Console

Plein Air