frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

The LLM Lobotomy?

https://learn.microsoft.com/en-us/answers/questions/5561465/the-llm-lobotomy
74•sgt3v•1h ago

Comments

esafak•1h ago
This was the perfect opportunity to share the evidence. I think undisclosed quantization is definitely a thing. We need benchmarks to be periodically re-evaluated to ward against this.

Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too.

edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought.

colordrops•56m ago
In addition to quantization, I suspect the additions they make continually to their hidden system prompt for legal, business, and other reasons slowly degrade responses over time as well.
jonplackett•49m ago
This is quite similar to all the modifications intel had to do due to spectre - I bet those system prompts have grown exponentially.
icyfox•32m ago
I guarantee you the weights are already versioned like you're describing. Each training run results in a static bundle of outputs and these are very much pinned (OpenAI has confirmed multiple times that they don't change the model weights once they issue a public release).

> Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.”

- [Ted Sanders](https://news.ycombinator.com/item?id=44242198) (OpenAI)

The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible.

https://www.anthropic.com/engineering/a-postmortem-of-three-... https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

xg15•1m ago
Sorry, but this makes no sense. Numerical instability would lead to random fluctuations in output quality, but not to a continuous slow decline like the OP described.
gregsadetsky•24m ago
I commented on the forum asking Sarge whether they could share some of their test results.

If they do, I think that it will add a lot to this conversation. Hope it happens!

ProjectArcturis•1h ago
I'm confused why this is addressed to Azure instead of OpenAI. Isn't Azure just offering a wrapper around chatGPT?

That said, I would also love to see some examples or data, instead of just "it's getting worse".

SubiculumCode•1h ago
I don't remember where I saw it, but I remember a claim that Azure hosted models performed poorer than those hosted by openAI.
transcriptase•57m ago
Explains why the enterprise copilot ChatGPT wrapper that they shoehorn into every piece of office365 performs worse than a badly configured local LLM.
bongodongobob•39m ago
They most definitely do. They have been lobotomized in some way to be ultra corporate friendly. I can only use their M365 Copilot at work and it's absolute dogshit at writing code more than maybe 100 lines. It can barely write correct PowerShell. Luckily, I really only need it for quick and dirty short PS scripts.
gwynforthewyn•1h ago
What's the conversation that you're looking to have here? There are fairly widespread claims that GPT-5 is worse than 4, and that's what the help article you've linked to says. I'm not sure how this furthers dialog about or understanding of LLMs, though, it reads to _me_ like this question just reinforces a notion that lots of people already agree with.

What's your aim here, sgt3v? I'd love to positively contribute, but I don't see how this link gets us anywhere.

bigchillin•1h ago
This is why we have open source. I noticed this with cursor, it’s not just an azure problem.
ukFxqnLa2sBSBf6•1h ago
It’s a good thing the author provided no data or examples. Otherwise, there might be something to actually talk about.
cush•1h ago
Is setting temperature to 0 even a valid way to measure LLM performance over time, all else equal?
jonplackett•46m ago
It could be that performance on temp zero has declined but performance on a normal temp is the same or better.

I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.

fortyseven•42m ago
I'd have assumed a fixed seed was used, but he doesn't mention that. Weird. Maybe he meant that?
criemen•34m ago
Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult.
visarga•2m ago
It's because batch size is dynamic. So a different batch size will change the output even on temp 0.
SirensOfTitan•1h ago
I’m convinced all of the major LLM providers silently quantize their models. The absolute worst was Google’s transition from Gemini 2.5 Pro 3-25 checkpoint to the May checkpoint, but I’ve noticed this effect with Claude and GPT over the years too.

I couldn’t imagine relying on any closed models for a business because of this highly dishonest and deceptive practice.

briga•1h ago
I have a theory: all these people reporting degrading model quality over time aren't actually seeing model quality deteriorate. What they are actually doing is discovering that these models aren't as powerful as they initially thought (ie. expanding their sample size for judging how good the model is). The probabilistic nature of LLM produces a lot of confused thinking about how good a model is, just because a model produces nine excellent responses doesn't mean the tenth response won't be garbage.
chaos_emergent•58m ago
Yes exactly, my theory is that the novelty of a new generation of LLMs’ performances tends to cause an inflation in peoples’ perceptions of the model, with a reversion to a better calibrated expectation over time. If the developer reported numerical evaluations that drifted over time, I’d be more convinced of model change.
colordrops•57m ago
Did any of you read the article? They have a test framework that objectively shows the model getting worse over time.
Aurornis•44m ago
I read the article. No proof was included. Not even a graph of declining results.
yieldcrv•57m ago
fta: “I am glad I have proof of this with the test system”

I think they have receipts, but did not post them there

Aurornis•45m ago
A lot of the claims I’ve seen have claimed to have proof, but details are never shared.

Even a simple graph of the output would be better than nothing, but instead it’s just an empty claim.

yieldcrv•2m ago
That's been my experience too

but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now.

so perhaps it's just a matter of transparency

but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model

zzzeek•45m ago
your theory does not hold up for this specific article as they carefully explained they are sending identical inputs into the model each time and observing progressively worse results with other variables unchanged. (though to be fair, others have noted they provided no replication details as to how they arrived at these results.)
vintermann•45m ago
They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line?

Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.

nothrabannosir•44m ago
TFA is about someone running the same test suite with 0 temperature and fixed inputs and fixtures on the same model over months on end.

What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.

zzzeek•46m ago
I'm sure MSFT will offer this person some upgraded API tier that somewhat improves the issues, though not terrifically, for only ten times the price.
jug•44m ago
At least on OpenRouter, you can often verify what quant a provider is using for a particular model.
mehdibl•18m ago
Since when LLM become deterministic?

Show HN: WaFlow – Local sandbox to prototype WhatsApp-style bots

1•leandrobon•3m ago•0 comments

103 Early Hints

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/103
2•tosh•4m ago•0 comments

Self-Reliant Programmer Manifesto

https://yobibyte.github.io/self_reliant_programmer.html
2•yobibyte•7m ago•0 comments

Tariffs make it harder to justify US investments, automakers and suppliers warn

https://www.autonews.com/manufacturing/an-trump-tariffs-suppliers-automaker-letter-mbs-0919/
2•rntn•8m ago•0 comments

The New AirPods Can Translate Languages in Your Ears. This Is Profound

https://www.nytimes.com/2025/09/18/technology/personaltech/new-airpods-language-translation-featu...
1•whack•9m ago•0 comments

Albert Einstein Gives a Speech Praising Immigrants' Contributions to America

https://www.openculture.com/2025/09/albert-einstein-gives-a-speech-praising-diversity-immigrants-...
2•gslin•10m ago•0 comments

AI tool detects hidden consciousness in brain-injured patients

https://www.psypost.org/new-ai-tool-detects-hidden-consciousness-in-brain-injured-patients-by-ana...
1•geox•16m ago•0 comments

Designing Data Intensive Applications 2nd edition

https://www.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/
2•dondraper36•16m ago•1 comments

Ask HN: What Would You Do If You Had 10 Years Left to Live?

1•vinnyglennon•18m ago•2 comments

AV2 Arriving: What We Know, and What We Don't Know

https://www.streamingmedia.com/Articles/ReadArticle.aspx?ArticleID=171548
1•CharlesW•19m ago•0 comments

RIP "Browsers"

https://blog.jim-nielsen.com/2025/rip-browsers/
1•ulrischa•19m ago•0 comments

What went wrong (& what went right) with AIO with Andres Freund

https://talkingpostgres.com/episodes/what-went-wrong-what-went-right-with-aio-with-andres-freund/...
2•pella•24m ago•0 comments

Intervision 2025: Soviet-era version of beloved song contest is revived

https://en.wikipedia.org/wiki/Intervision_2025
1•lovegrenoble•25m ago•1 comments

Ask HN: Where to store my bag for a few hours in SF?

1•wibbily•26m ago•1 comments

Ukraine reveals jammer-resistant Kamikaze strike drones

https://www.tomshardware.com/tech-industry/ukraine-reveals-jammer-resistant-kamikaze-strike-drone...
4•giuliomagnifico•30m ago•0 comments

How Attabotics Went Off the Rails

https://thelogic.co/news/the-big-read/calgary-startup-attabotics-bankruptcy/
1•gnabgib•33m ago•0 comments

Enhancing Startup Success Predictions in Venture Capital

https://arxiv.org/abs/2408.09420
1•Bostonian•33m ago•1 comments

Everyday Rails Testing with RSpec Updated

https://everydayrails.com/2025/09/16/rspec-book-rails-7
1•ruralocity•36m ago•0 comments

Bringing restartable sequences out of the niche

https://lwn.net/Articles/1033955/
12•PaulHoule•42m ago•0 comments

Querying Graph-Relational Data (EdgeQL)

https://arxiv.org/abs/2507.16089
2•gregsadetsky•43m ago•1 comments

Parser generators vs. handwritten parsers: surveying language implementations

https://notes.eatonphil.com/parser-generators-vs-handwritten-parsers-survey-2021.html
3•fuzztester•44m ago•0 comments

TV Time Machine: A Raspberry Pi That Plays Random 90s TV

https://quarters.captaintouch.com/blog/posts/2025-09-20-tv-time-machine-a-raspberry-pi-that-plays...
17•capitain•44m ago•3 comments

Does China Underconsume?

https://www.global-developments.org/p/does-china-underconsume
2•alphabetatango•45m ago•0 comments

New H-1B visa fee will not apply to existing holders, official says

https://www.axios.com/2025/09/20/trump-h-1b-immigration-visas
10•srameshc•46m ago•3 comments

Time to First Byte

https://web.dev/articles/ttfb
2•tosh•46m ago•0 comments

Ongoing Tradeoffs, and Incidents as Landmarks

https://ferd.ca/ongoing-tradeoffs-and-incidents-as-landmarks.html
1•Noghartt•47m ago•0 comments

Native TypeScript-go and esbuild in the browser (WASM)

https://beta.fullstacked.org
2•cplepage•48m ago•0 comments

Destroying Asteroid 2024 YR4 Is the Best Option to Stop It from Hitting the Moon

https://www.universetoday.com/articles/destroying-asteroid-2024-yr4-is-the-best-option-to-stop-it...
2•rbanffy•54m ago•0 comments

Corruption: When norms upstage the law – Knowable Magazine

https://knowablemagazine.org/content/article/society/2025/how-corruption-interplays-with-social-n...
1•rbanffy•56m ago•0 comments

LLVM's AI policy vs. code of confuct vs. reality

https://discourse.llvm.org/t/our-ai-policy-vs-code-of-conduct-and-vs-reality/88300
2•MaskRay•56m ago•0 comments