The LLM Lobotomy?

https://learn.microsoft.com/en-us/answers/questions/5561465/the-llm-lobotomy

77•sgt3v•1h ago

Comments

esafak•1h ago

This was the perfect opportunity to share the evidence. I think undisclosed quantization is definitely a thing. We need benchmarks to be periodically re-evaluated to ward against this.

Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too.

edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought.

colordrops•1h ago

In addition to quantization, I suspect the additions they make continually to their hidden system prompt for legal, business, and other reasons slowly degrade responses over time as well.

jonplackett•1h ago

This is quite similar to all the modifications intel had to do due to spectre - I bet those system prompts have grown exponentially.

icyfox•46m ago

I guarantee you the weights are already versioned like you're describing. Each training run results in a static bundle of outputs and these are very much pinned (OpenAI has confirmed multiple times that they don't change the model weights once they issue a public release).

> Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.”

- [Ted Sanders](https://news.ycombinator.com/item?id=44242198) (OpenAI)

The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible.

https://www.anthropic.com/engineering/a-postmortem-of-three-... https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

xg15•15m ago

Sorry, but this makes no sense. Numerical instability would lead to random fluctuations in output quality, but not to a continuous slow decline like the OP described.

Heard of similar experiences from RL acquaintances, where a prompt worked reliably for hundreds of requests per day for several months - and then suddenly the model started to make mistakes, ignore parts of the prompt, etc when a newer model was released.

I agree, it doesn't have to be deliberate malice like intentionally nerfing a model to make people switch to the newer one - it might just be that less resources are allocated to the older model once the newer one is available and so the inference parameters change - but some effect at the release of a newer model seems to be there.

icyfox•11m ago

I'm responding to the parent comment who's suggesting we version control the "model" in Docker. There are infra reasons why companies don't do that. Numerical instability is one class of inference issues, but there can be other bugs in the stack separate from them intentionally changing the weights or switching to a quantized model.

As for the original forum post:

- Multiple numerical instability bugs can compound to make things worse (we saw this in the latest Anthropic post-mortum)

- OP didn't provide any details on eval methodology, so I don't think it's worth speculating on this anecdotal report until we see more data

xg15•3m ago

Good points. And I also agree we'd have to see the data that OP collected.

If it indeed did show a slow decline over time and OpenAI did not change the weights, then something does not add up.

gregsadetsky•38m ago

I commented on the forum asking Sarge whether they could share some of their test results.

If they do, I think that it will add a lot to this conversation. Hope it happens!

ProjectArcturis•1h ago

I'm confused why this is addressed to Azure instead of OpenAI. Isn't Azure just offering a wrapper around chatGPT?

That said, I would also love to see some examples or data, instead of just "it's getting worse".

SubiculumCode•1h ago

I don't remember where I saw it, but I remember a claim that Azure hosted models performed poorer than those hosted by openAI.

transcriptase•1h ago

Explains why the enterprise copilot ChatGPT wrapper that they shoehorn into every piece of office365 performs worse than a badly configured local LLM.

bongodongobob•53m ago

They most definitely do. They have been lobotomized in some way to be ultra corporate friendly. I can only use their M365 Copilot at work and it's absolute dogshit at writing code more than maybe 100 lines. It can barely write correct PowerShell. Luckily, I really only need it for quick and dirty short PS scripts.

gwynforthewyn•1h ago

What's the conversation that you're looking to have here? There are fairly widespread claims that GPT-5 is worse than 4, and that's what the help article you've linked to says. I'm not sure how this furthers dialog about or understanding of LLMs, though, it reads to _me_ like this question just reinforces a notion that lots of people already agree with.

What's your aim here, sgt3v? I'd love to positively contribute, but I don't see how this link gets us anywhere.

bigchillin•1h ago

This is why we have open source. I noticed this with cursor, it’s not just an azure problem.

ukFxqnLa2sBSBf6•1h ago

It’s a good thing the author provided no data or examples. Otherwise, there might be something to actually talk about.

cush•1h ago

Is setting temperature to 0 even a valid way to measure LLM performance over time, all else equal?

jonplackett•1h ago

It could be that performance on temp zero has declined but performance on a normal temp is the same or better.

I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.

fortyseven•56m ago

I'd have assumed a fixed seed was used, but he doesn't mention that. Weird. Maybe he meant that?

criemen•48m ago

Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult.

visarga•16m ago

It's because batch size is dynamic. So a different batch size will change the output even on temp 0.

SirensOfTitan•1h ago

I’m convinced all of the major LLM providers silently quantize their models. The absolute worst was Google’s transition from Gemini 2.5 Pro 3-25 checkpoint to the May checkpoint, but I’ve noticed this effect with Claude and GPT over the years too.

I couldn’t imagine relying on any closed models for a business because of this highly dishonest and deceptive practice.

briga•1h ago

I have a theory: all these people reporting degrading model quality over time aren't actually seeing model quality deteriorate. What they are actually doing is discovering that these models aren't as powerful as they initially thought (ie. expanding their sample size for judging how good the model is). The probabilistic nature of LLM produces a lot of confused thinking about how good a model is, just because a model produces nine excellent responses doesn't mean the tenth response won't be garbage.

chaos_emergent•1h ago

Yes exactly, my theory is that the novelty of a new generation of LLMs’ performances tends to cause an inflation in peoples’ perceptions of the model, with a reversion to a better calibrated expectation over time. If the developer reported numerical evaluations that drifted over time, I’d be more convinced of model change.

colordrops•1h ago

Did any of you read the article? They have a test framework that objectively shows the model getting worse over time.

Aurornis•58m ago

I read the article. No proof was included. Not even a graph of declining results.

yieldcrv•1h ago

fta: “I am glad I have proof of this with the test system”

I think they have receipts, but did not post them there

Aurornis•1h ago

A lot of the claims I’ve seen have claimed to have proof, but details are never shared.

Even a simple graph of the output would be better than nothing, but instead it’s just an empty claim.

yieldcrv•17m ago

That's been my experience too

but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now.

so perhaps it's just a matter of transparency

but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model

zzzeek•59m ago

your theory does not hold up for this specific article as they carefully explained they are sending identical inputs into the model each time and observing progressively worse results with other variables unchanged. (though to be fair, others have noted they provided no replication details as to how they arrived at these results.)

vintermann•59m ago

They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line?

Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.

nothrabannosir•59m ago

TFA is about someone running the same test suite with 0 temperature and fixed inputs and fixtures on the same model over months on end.

What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.

zzzeek•1h ago

I'm sure MSFT will offer this person some upgraded API tier that somewhat improves the issues, though not terrifically, for only ten times the price.

jug•58m ago

At least on OpenRouter, you can often verify what quant a provider is using for a particular model.

mehdibl•32m ago

Since when LLM become deterministic?

Designing NotebookLM

Ultrasonic Chef's Knife

Cormac McCarthy's tips on how to write a science paper (2019) [pdf]

Scream cipher

TV Time Machine: A Raspberry Pi That Plays Random 90s TV

Escapee pregnancy test frogs colonised Wales for 50 years (2019)

Images over DNS

I'm Not a Robot

MapSCII – World map in terminal

Living microbial cement supercapacitors with reactivatable energy storage

Bringing restartable sequences out of the niche

Evals in 2025: going beyond simple benchmarks to build models people can use

Systemd can be a cause of restrictions on daemons

Solving a wooden puzzle using Haskell

FLX1s phone is launched

After Babel Fish: The promise of cheap translations at the speed of the Web

Is Zig's new writer unsafe?

Are touchscreens in cars dangerous?

Claude can sometimes prove it

Vapor chamber tech keeps iPhone 17 Pro cool

Bezier Curve as Easing Function in C++

Show HN: Math2Tex – Convert handwritten math and complex notes to LaTeX text

UNESCO Launches the First Virtual Museum of Stolen Cultural Objects

The LLM Lobotomy?

PYREX vs. pyrex: What's the difference?

Git: Introduce Rust and announce it will become mandatory in the build system

Visa holders on vacation have 15 hours to return to US or pay $100k fee

If all the world were a monorepo

Overcoming barriers of hydrogen storage with a low-temperature hydrogen battery

If you are good at code review, you will be good at using AI agents

The LLM Lobotomy?

Comments

Designing NotebookLM

Ultrasonic Chef's Knife

Cormac McCarthy's tips on how to write a science paper (2019) [pdf]

Scream cipher

TV Time Machine: A Raspberry Pi That Plays Random 90s TV

Escapee pregnancy test frogs colonised Wales for 50 years (2019)

Images over DNS

I'm Not a Robot

MapSCII – World map in terminal

Living microbial cement supercapacitors with reactivatable energy storage

Bringing restartable sequences out of the niche

Evals in 2025: going beyond simple benchmarks to build models people can use

Systemd can be a cause of restrictions on daemons

Solving a wooden puzzle using Haskell

FLX1s phone is launched

After Babel Fish: The promise of cheap translations at the speed of the Web

Is Zig's new writer unsafe?

Are touchscreens in cars dangerous?

Claude can sometimes prove it

Vapor chamber tech keeps iPhone 17 Pro cool

Bezier Curve as Easing Function in C++

Show HN: Math2Tex – Convert handwritten math and complex notes to LaTeX text

UNESCO Launches the First Virtual Museum of Stolen Cultural Objects

The LLM Lobotomy?

PYREX vs. pyrex: What's the difference?

Git: Introduce Rust and announce it will become mandatory in the build system

Visa holders on vacation have 15 hours to return to US or pay $100k fee

If all the world were a monorepo

Overcoming barriers of hydrogen storage with a low-temperature hydrogen battery

If you are good at code review, you will be good at using AI agents