The LLM Lobotomy?

https://learn.microsoft.com/en-us/answers/questions/5561465/the-llm-lobotomy

74•sgt3v•1h ago

Comments

esafak•1h ago

This was the perfect opportunity to share the evidence. I think undisclosed quantization is definitely a thing. We need benchmarks to be periodically re-evaluated to ward against this.

Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too.

edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought.

colordrops•56m ago

In addition to quantization, I suspect the additions they make continually to their hidden system prompt for legal, business, and other reasons slowly degrade responses over time as well.

jonplackett•49m ago

This is quite similar to all the modifications intel had to do due to spectre - I bet those system prompts have grown exponentially.

icyfox•32m ago

I guarantee you the weights are already versioned like you're describing. Each training run results in a static bundle of outputs and these are very much pinned (OpenAI has confirmed multiple times that they don't change the model weights once they issue a public release).

> Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.”

- [Ted Sanders](https://news.ycombinator.com/item?id=44242198) (OpenAI)

The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible.

https://www.anthropic.com/engineering/a-postmortem-of-three-... https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

xg15•1m ago

Sorry, but this makes no sense. Numerical instability would lead to random fluctuations in output quality, but not to a continuous slow decline like the OP described.

gregsadetsky•24m ago

I commented on the forum asking Sarge whether they could share some of their test results.

If they do, I think that it will add a lot to this conversation. Hope it happens!

ProjectArcturis•1h ago

I'm confused why this is addressed to Azure instead of OpenAI. Isn't Azure just offering a wrapper around chatGPT?

That said, I would also love to see some examples or data, instead of just "it's getting worse".

SubiculumCode•1h ago

I don't remember where I saw it, but I remember a claim that Azure hosted models performed poorer than those hosted by openAI.

transcriptase•57m ago

Explains why the enterprise copilot ChatGPT wrapper that they shoehorn into every piece of office365 performs worse than a badly configured local LLM.

bongodongobob•39m ago

They most definitely do. They have been lobotomized in some way to be ultra corporate friendly. I can only use their M365 Copilot at work and it's absolute dogshit at writing code more than maybe 100 lines. It can barely write correct PowerShell. Luckily, I really only need it for quick and dirty short PS scripts.

gwynforthewyn•1h ago

What's the conversation that you're looking to have here? There are fairly widespread claims that GPT-5 is worse than 4, and that's what the help article you've linked to says. I'm not sure how this furthers dialog about or understanding of LLMs, though, it reads to _me_ like this question just reinforces a notion that lots of people already agree with.

What's your aim here, sgt3v? I'd love to positively contribute, but I don't see how this link gets us anywhere.

bigchillin•1h ago

This is why we have open source. I noticed this with cursor, it’s not just an azure problem.

ukFxqnLa2sBSBf6•1h ago

It’s a good thing the author provided no data or examples. Otherwise, there might be something to actually talk about.

cush•1h ago

Is setting temperature to 0 even a valid way to measure LLM performance over time, all else equal?

jonplackett•46m ago

It could be that performance on temp zero has declined but performance on a normal temp is the same or better.

I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.

fortyseven•42m ago

I'd have assumed a fixed seed was used, but he doesn't mention that. Weird. Maybe he meant that?

criemen•34m ago

Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult.

visarga•2m ago

It's because batch size is dynamic. So a different batch size will change the output even on temp 0.

SirensOfTitan•1h ago

I’m convinced all of the major LLM providers silently quantize their models. The absolute worst was Google’s transition from Gemini 2.5 Pro 3-25 checkpoint to the May checkpoint, but I’ve noticed this effect with Claude and GPT over the years too.

I couldn’t imagine relying on any closed models for a business because of this highly dishonest and deceptive practice.

briga•1h ago

I have a theory: all these people reporting degrading model quality over time aren't actually seeing model quality deteriorate. What they are actually doing is discovering that these models aren't as powerful as they initially thought (ie. expanding their sample size for judging how good the model is). The probabilistic nature of LLM produces a lot of confused thinking about how good a model is, just because a model produces nine excellent responses doesn't mean the tenth response won't be garbage.

chaos_emergent•58m ago

Yes exactly, my theory is that the novelty of a new generation of LLMs’ performances tends to cause an inflation in peoples’ perceptions of the model, with a reversion to a better calibrated expectation over time. If the developer reported numerical evaluations that drifted over time, I’d be more convinced of model change.

colordrops•57m ago

Did any of you read the article? They have a test framework that objectively shows the model getting worse over time.

Aurornis•44m ago

I read the article. No proof was included. Not even a graph of declining results.

yieldcrv•57m ago

fta: “I am glad I have proof of this with the test system”

I think they have receipts, but did not post them there

Aurornis•45m ago

A lot of the claims I’ve seen have claimed to have proof, but details are never shared.

Even a simple graph of the output would be better than nothing, but instead it’s just an empty claim.

yieldcrv•2m ago

That's been my experience too

but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now.

so perhaps it's just a matter of transparency

but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model

zzzeek•45m ago

your theory does not hold up for this specific article as they carefully explained they are sending identical inputs into the model each time and observing progressively worse results with other variables unchanged. (though to be fair, others have noted they provided no replication details as to how they arrived at these results.)

vintermann•45m ago

They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line?

Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.

nothrabannosir•44m ago

TFA is about someone running the same test suite with 0 temperature and fixed inputs and fixtures on the same model over months on end.

What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.

zzzeek•46m ago

I'm sure MSFT will offer this person some upgraded API tier that somewhat improves the issues, though not terrifically, for only ten times the price.

jug•44m ago

At least on OpenRouter, you can often verify what quant a provider is using for a particular model.

mehdibl•18m ago

Since when LLM become deterministic?

Show HN: WaFlow – Local sandbox to prototype WhatsApp-style bots

103 Early Hints

Self-Reliant Programmer Manifesto

Tariffs make it harder to justify US investments, automakers and suppliers warn

The New AirPods Can Translate Languages in Your Ears. This Is Profound

Albert Einstein Gives a Speech Praising Immigrants' Contributions to America

AI tool detects hidden consciousness in brain-injured patients

Designing Data Intensive Applications 2nd edition

Ask HN: What Would You Do If You Had 10 Years Left to Live?

AV2 Arriving: What We Know, and What We Don't Know

RIP "Browsers"

What went wrong (& what went right) with AIO with Andres Freund

Intervision 2025: Soviet-era version of beloved song contest is revived

Ask HN: Where to store my bag for a few hours in SF?

Ukraine reveals jammer-resistant Kamikaze strike drones

How Attabotics Went Off the Rails

Enhancing Startup Success Predictions in Venture Capital

Everyday Rails Testing with RSpec Updated

Bringing restartable sequences out of the niche

Querying Graph-Relational Data (EdgeQL)

Parser generators vs. handwritten parsers: surveying language implementations

TV Time Machine: A Raspberry Pi That Plays Random 90s TV

Does China Underconsume?

New H-1B visa fee will not apply to existing holders, official says

Time to First Byte

Ongoing Tradeoffs, and Incidents as Landmarks

Native TypeScript-go and esbuild in the browser (WASM)

Destroying Asteroid 2024 YR4 Is the Best Option to Stop It from Hitting the Moon

Corruption: When norms upstage the law – Knowable Magazine

LLVM's AI policy vs. code of confuct vs. reality