frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

The LLM Lobotomy?

https://learn.microsoft.com/en-us/answers/questions/5561465/the-llm-lobotomy
77•sgt3v•1h ago

Comments

esafak•1h ago
This was the perfect opportunity to share the evidence. I think undisclosed quantization is definitely a thing. We need benchmarks to be periodically re-evaluated to ward against this.

Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too.

edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought.

colordrops•1h ago
In addition to quantization, I suspect the additions they make continually to their hidden system prompt for legal, business, and other reasons slowly degrade responses over time as well.
jonplackett•1h ago
This is quite similar to all the modifications intel had to do due to spectre - I bet those system prompts have grown exponentially.
icyfox•46m ago
I guarantee you the weights are already versioned like you're describing. Each training run results in a static bundle of outputs and these are very much pinned (OpenAI has confirmed multiple times that they don't change the model weights once they issue a public release).

> Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.”

- [Ted Sanders](https://news.ycombinator.com/item?id=44242198) (OpenAI)

The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible.

https://www.anthropic.com/engineering/a-postmortem-of-three-... https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

xg15•15m ago
Sorry, but this makes no sense. Numerical instability would lead to random fluctuations in output quality, but not to a continuous slow decline like the OP described.

Heard of similar experiences from RL acquaintances, where a prompt worked reliably for hundreds of requests per day for several months - and then suddenly the model started to make mistakes, ignore parts of the prompt, etc when a newer model was released.

I agree, it doesn't have to be deliberate malice like intentionally nerfing a model to make people switch to the newer one - it might just be that less resources are allocated to the older model once the newer one is available and so the inference parameters change - but some effect at the release of a newer model seems to be there.

icyfox•11m ago
I'm responding to the parent comment who's suggesting we version control the "model" in Docker. There are infra reasons why companies don't do that. Numerical instability is one class of inference issues, but there can be other bugs in the stack separate from them intentionally changing the weights or switching to a quantized model.

As for the original forum post:

- Multiple numerical instability bugs can compound to make things worse (we saw this in the latest Anthropic post-mortum)

- OP didn't provide any details on eval methodology, so I don't think it's worth speculating on this anecdotal report until we see more data

xg15•3m ago
Good points. And I also agree we'd have to see the data that OP collected.

If it indeed did show a slow decline over time and OpenAI did not change the weights, then something does not add up.

gregsadetsky•38m ago
I commented on the forum asking Sarge whether they could share some of their test results.

If they do, I think that it will add a lot to this conversation. Hope it happens!

ProjectArcturis•1h ago
I'm confused why this is addressed to Azure instead of OpenAI. Isn't Azure just offering a wrapper around chatGPT?

That said, I would also love to see some examples or data, instead of just "it's getting worse".

SubiculumCode•1h ago
I don't remember where I saw it, but I remember a claim that Azure hosted models performed poorer than those hosted by openAI.
transcriptase•1h ago
Explains why the enterprise copilot ChatGPT wrapper that they shoehorn into every piece of office365 performs worse than a badly configured local LLM.
bongodongobob•53m ago
They most definitely do. They have been lobotomized in some way to be ultra corporate friendly. I can only use their M365 Copilot at work and it's absolute dogshit at writing code more than maybe 100 lines. It can barely write correct PowerShell. Luckily, I really only need it for quick and dirty short PS scripts.
gwynforthewyn•1h ago
What's the conversation that you're looking to have here? There are fairly widespread claims that GPT-5 is worse than 4, and that's what the help article you've linked to says. I'm not sure how this furthers dialog about or understanding of LLMs, though, it reads to _me_ like this question just reinforces a notion that lots of people already agree with.

What's your aim here, sgt3v? I'd love to positively contribute, but I don't see how this link gets us anywhere.

bigchillin•1h ago
This is why we have open source. I noticed this with cursor, it’s not just an azure problem.
ukFxqnLa2sBSBf6•1h ago
It’s a good thing the author provided no data or examples. Otherwise, there might be something to actually talk about.
cush•1h ago
Is setting temperature to 0 even a valid way to measure LLM performance over time, all else equal?
jonplackett•1h ago
It could be that performance on temp zero has declined but performance on a normal temp is the same or better.

I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.

fortyseven•56m ago
I'd have assumed a fixed seed was used, but he doesn't mention that. Weird. Maybe he meant that?
criemen•48m ago
Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult.
visarga•16m ago
It's because batch size is dynamic. So a different batch size will change the output even on temp 0.
SirensOfTitan•1h ago
I’m convinced all of the major LLM providers silently quantize their models. The absolute worst was Google’s transition from Gemini 2.5 Pro 3-25 checkpoint to the May checkpoint, but I’ve noticed this effect with Claude and GPT over the years too.

I couldn’t imagine relying on any closed models for a business because of this highly dishonest and deceptive practice.

briga•1h ago
I have a theory: all these people reporting degrading model quality over time aren't actually seeing model quality deteriorate. What they are actually doing is discovering that these models aren't as powerful as they initially thought (ie. expanding their sample size for judging how good the model is). The probabilistic nature of LLM produces a lot of confused thinking about how good a model is, just because a model produces nine excellent responses doesn't mean the tenth response won't be garbage.
chaos_emergent•1h ago
Yes exactly, my theory is that the novelty of a new generation of LLMs’ performances tends to cause an inflation in peoples’ perceptions of the model, with a reversion to a better calibrated expectation over time. If the developer reported numerical evaluations that drifted over time, I’d be more convinced of model change.
colordrops•1h ago
Did any of you read the article? They have a test framework that objectively shows the model getting worse over time.
Aurornis•58m ago
I read the article. No proof was included. Not even a graph of declining results.
yieldcrv•1h ago
fta: “I am glad I have proof of this with the test system”

I think they have receipts, but did not post them there

Aurornis•1h ago
A lot of the claims I’ve seen have claimed to have proof, but details are never shared.

Even a simple graph of the output would be better than nothing, but instead it’s just an empty claim.

yieldcrv•17m ago
That's been my experience too

but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now.

so perhaps it's just a matter of transparency

but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model

zzzeek•59m ago
your theory does not hold up for this specific article as they carefully explained they are sending identical inputs into the model each time and observing progressively worse results with other variables unchanged. (though to be fair, others have noted they provided no replication details as to how they arrived at these results.)
vintermann•59m ago
They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line?

Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.

nothrabannosir•59m ago
TFA is about someone running the same test suite with 0 temperature and fixed inputs and fixtures on the same model over months on end.

What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.

zzzeek•1h ago
I'm sure MSFT will offer this person some upgraded API tier that somewhat improves the issues, though not terrifically, for only ten times the price.
jug•58m ago
At least on OpenRouter, you can often verify what quant a provider is using for a particular model.
mehdibl•32m ago
Since when LLM become deterministic?

Designing NotebookLM

https://jasonspielman.com/notebooklm
69•vinhnx•2h ago•25 comments

Ultrasonic Chef's Knife

https://seattleultrasonics.com/
94•hemloc_io•3h ago•65 comments

Cormac McCarthy's tips on how to write a science paper (2019) [pdf]

https://gwern.net/doc/science/2019-savage.pdf
143•surprisetalk•5h ago•53 comments

Scream cipher

https://sethmlarson.dev/scream-cipher
220•alexmolas•2d ago•87 comments

TV Time Machine: A Raspberry Pi That Plays Random 90s TV

https://quarters.captaintouch.com/blog/posts/2025-09-20-tv-time-machine-a-raspberry-pi-that-plays...
19•capitain•58m ago•6 comments

Escapee pregnancy test frogs colonised Wales for 50 years (2019)

https://www.bbc.com/news/uk-wales-44886585
89•Luc•4d ago•37 comments

Images over DNS

https://dgl.cx/2025/09/images-over-dns
125•dgl•8h ago•34 comments

I'm Not a Robot

https://neal.fun/not-a-robot/
148•meetpateltech•4d ago•93 comments

MapSCII – World map in terminal

https://github.com/rastapasta/mapscii
106•_august•2d ago•16 comments

Living microbial cement supercapacitors with reactivatable energy storage

https://www.cell.com/cell-reports-physical-science/fulltext/S2666-3864(25)00409-6
66•PaulHoule•6h ago•39 comments

Bringing restartable sequences out of the niche

https://lwn.net/Articles/1033955/
16•PaulHoule•57m ago•0 comments

Evals in 2025: going beyond simple benchmarks to build models people can use

https://github.com/huggingface/evaluation-guidebook/blob/main/yearly_dives/2025-evaluations-for-u...
32•jxmorris12•2d ago•2 comments

Systemd can be a cause of restrictions on daemons

https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdCanBeRestrictionCause
66•zdw•4h ago•71 comments

Solving a wooden puzzle using Haskell

https://glocq.github.io/en/blog/20250428/
6•Bogdanp•3d ago•2 comments

FLX1s phone is launched

https://furilabs.com/flx1s-is-launched/
94•slau•8h ago•81 comments

After Babel Fish: The promise of cheap translations at the speed of the Web

https://hedgehogreview.com/issues/lessons-of-babel/articles/after-babel-fish
4•miqkt•2d ago•0 comments

Is Zig's new writer unsafe?

https://www.openmymind.net/Is-Zigs-New-Io-Unsafe/
114•ibobev•5h ago•94 comments

Are touchscreens in cars dangerous?

https://www.economist.com/science-and-technology/2025/09/19/are-touchscreens-in-cars-dangerous
120•Brajeshwar•4h ago•113 comments

Claude can sometimes prove it

https://www.galois.com/articles/claude-can-sometimes-prove-it
162•lairv•3d ago•49 comments

Vapor chamber tech keeps iPhone 17 Pro cool

https://spectrum.ieee.org/iphone-17-pro-vapor-chamber
37•rbanffy•6h ago•101 comments

Bezier Curve as Easing Function in C++

https://asawicki.info/news_1790_bezier_curve_as_easing_function_in_c
39•ibobev•6h ago•4 comments

Show HN: Math2Tex – Convert handwritten math and complex notes to LaTeX text

30•leoyixing•3d ago•12 comments

UNESCO Launches the First Virtual Museum of Stolen Cultural Objects

https://www.unesco.org/en/articles/unesco-launches-worlds-first-virtual-museum-stolen-cultural-ob...
3•gnabgib•1h ago•0 comments

The LLM Lobotomy?

https://learn.microsoft.com/en-us/answers/questions/5561465/the-llm-lobotomy
77•sgt3v•1h ago•34 comments

PYREX vs. pyrex: What's the difference?

https://www.corning.com/worldwide/en/products/life-sciences/resources/stories/in-the-field/pyrex-...
69•lisper•13h ago•54 comments

Git: Introduce Rust and announce it will become mandatory in the build system

https://lore.kernel.org/git/20250904-b4-pks-rust-breaking-change-v1-0-3af1d25e0be9@pks.im/
264•WhyNotHugo•7h ago•224 comments

Visa holders on vacation have 15 hours to return to US or pay $100k fee

https://timesofindia.indiatimes.com/technology/tech-news/microsoft-has-a-24-hour-deadline-warning...
274•irthomasthomas•7h ago•393 comments

If all the world were a monorepo

https://jtibs.substack.com/p/if-all-the-world-were-a-monorepo
240•sebg•4d ago•67 comments

Overcoming barriers of hydrogen storage with a low-temperature hydrogen battery

https://www.isct.ac.jp/en/news/okmktjxyrvdc
47•rustoo•8h ago•51 comments

If you are good at code review, you will be good at using AI agents

https://www.seangoedecke.com/ai-agents-and-code-review/
114•imasl42•14h ago•113 comments