frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Xkcd: Game AIs

https://xkcd.com/1002/
1•ravenical•31s ago•0 comments

Windows 11 is finally killing off legacy printer drivers in 2026

https://www.windowscentral.com/microsoft/windows-11/windows-11-finally-pulls-the-plug-on-legacy-p...
1•ValdikSS•1m ago•0 comments

From Offloading to Engagement (Study on Generative AI)

https://www.mdpi.com/2306-5729/10/11/172
1•boshomi•2m ago•1 comments

AI for People

https://justsitandgrin.im/posts/ai-for-people/
1•dive•3m ago•0 comments

Rome is studded with cannon balls (2022)

https://essenceofrome.com/rome-is-studded-with-cannon-balls
1•thomassmith65•9m ago•0 comments

8-piece tablebase development on Lichess (op1 partial)

https://lichess.org/@/Lichess/blog/op1-partial-8-piece-tablebase-available/1ptPBDpC
2•somethingp•10m ago•0 comments

US to bankroll far-right think tanks in Europe against digital laws

https://www.brusselstimes.com/1957195/us-to-fund-far-right-forces-in-europe-tbtb
3•saubeidl•11m ago•0 comments

Ask HN: Have AI companies replaced their own SaaS usage with agents?

1•tuxpenguine•14m ago•0 comments

pi-nes

https://twitter.com/thomasmustier/status/2018362041506132205
1•tosh•16m ago•0 comments

Show HN: Crew – Multi-agent orchestration tool for AI-assisted development

https://github.com/garnetliu/crew
1•gl2334•17m ago•0 comments

New hire fixed a problem so fast, their boss left to become a yoga instructor

https://www.theregister.com/2026/02/06/on_call/
1•Brajeshwar•18m ago•0 comments

Four horsemen of the AI-pocalypse line up capex bigger than Israel's GDP

https://www.theregister.com/2026/02/06/ai_capex_plans/
1•Brajeshwar•18m ago•0 comments

A free Dynamic QR Code generator (no expiring links)

https://free-dynamic-qr-generator.com/
1•nookeshkarri7•19m ago•1 comments

nextTick but for React.js

https://suhaotian.github.io/use-next-tick/
1•jeremy_su•21m ago•0 comments

Show HN: I Built an AI-Powered Pull Request Review Tool

https://github.com/HighGarden-Studio/HighReview
1•highgarden•21m ago•0 comments

Git-am applies commit message diffs

https://lore.kernel.org/git/bcqvh7ahjjgzpgxwnr4kh3hfkksfruf54refyry3ha7qk7dldf@fij5calmscvm/
1•rkta•24m ago•0 comments

ClawEmail: 1min setup for OpenClaw agents with Gmail, Docs

https://clawemail.com
1•aleks5678•30m ago•1 comments

UnAutomating the Economy: More Labor but at What Cost?

https://www.greshm.org/blog/unautomating-the-economy/
1•Suncho•37m ago•1 comments

Show HN: Gettorr – Stream magnet links in the browser via WebRTC (no install)

https://gettorr.com/
1•BenaouidateMed•38m ago•0 comments

Statin drugs safer than previously thought

https://www.semafor.com/article/02/06/2026/statin-drugs-safer-than-previously-thought
1•stareatgoats•40m ago•0 comments

Handy when you just want to distract yourself for a moment

https://d6.h5go.life/
1•TrendSpotterPro•42m ago•0 comments

More States Are Taking Aim at a Controversial Early Reading Method

https://www.edweek.org/teaching-learning/more-states-are-taking-aim-at-a-controversial-early-read...
2•lelanthran•43m ago•0 comments

AI will not save developer productivity

https://www.infoworld.com/article/4125409/ai-will-not-save-developer-productivity.html
1•indentit•48m ago•0 comments

How I do and don't use agents

https://twitter.com/jessfraz/status/2019975917863661760
1•tosh•54m ago•0 comments

BTDUex Safe? The Back End Withdrawal Anomalies

1•aoijfoqfw•57m ago•0 comments

Show HN: Compile-Time Vibe Coding

https://github.com/Michael-JB/vibecode
7•michaelchicory•59m ago•1 comments

Show HN: Ensemble – macOS App to Manage Claude Code Skills, MCPs, and Claude.md

https://github.com/O0000-code/Ensemble
1•IO0oI•1h ago•1 comments

PR to support XMPP channels in OpenClaw

https://github.com/openclaw/openclaw/pull/9741
1•mickael•1h ago•0 comments

Twenty: A Modern Alternative to Salesforce

https://github.com/twentyhq/twenty
1•tosh•1h ago•0 comments

Raspberry Pi: More memory-driven price rises

https://www.raspberrypi.com/news/more-memory-driven-price-rises/
2•calcifer•1h ago•0 comments
Open in hackernews

The LLM Lobotomy?

https://learn.microsoft.com/en-us/answers/questions/5561465/the-llm-lobotomy
140•sgt3v•4mo ago

Comments

esafak•4mo ago
This was the perfect opportunity to share the evidence. I think undisclosed quantization is definitely a thing. We need benchmarks to be periodically re-evaluated to ward against this.

Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too.

edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought.

colordrops•4mo ago
In addition to quantization, I suspect the additions they make continually to their hidden system prompt for legal, business, and other reasons slowly degrade responses over time as well.
jonplackett•4mo ago
This is quite similar to all the modifications intel had to do due to spectre - I bet those system prompts have grown exponentially.
icyfox•4mo ago
I guarantee you the weights are already versioned like you're describing. Each training run results in a static bundle of outputs and these are very much pinned (OpenAI has confirmed multiple times that they don't change the model weights once they issue a public release).

> Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.”

- [Ted Sanders](https://news.ycombinator.com/item?id=44242198) (OpenAI)

The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible.

https://www.anthropic.com/engineering/a-postmortem-of-three-... https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

xg15•4mo ago
Sorry, but this makes no sense. Numerical instability would lead to random fluctuations in output quality, but not to a continuous slow decline like the OP described.

Heard of similar experiences from RL acquaintances, where a prompt worked reliably for hundreds of requests per day for several months - and then suddenly the model started to make mistakes, ignore parts of the prompt, etc when a newer model was released.

I agree, it doesn't have to be deliberate malice like intentionally nerfing a model to make people switch to the newer one - it might just be that less resources are allocated to the older model once the newer one is available and so the inference parameters change - but some effect at the release of a newer model seems to be there.

icyfox•4mo ago
I'm responding to the parent comment who's suggesting we version control the "model" in Docker. There are infra reasons why companies don't do that. Numerical instability is one class of inference issues, but there can be other bugs in the stack separate from them intentionally changing the weights or switching to a quantized model.

As for the original forum post:

- Multiple numerical computation bugs can compound to make things worse (we saw this in the latest Anthropic post-mortum)

- OP didn't provide any details on eval methodology, so I don't think it's worth speculating on this anecdotal report until we see more data

xg15•4mo ago
Good points. And I also agree we'd have to see the data that OP collected.

If it indeed did show a slow decline over time and OpenAI did not change the weights, then something does not add up.

esafak•4mo ago
That's a great point. However, I think we can treat the serving pipeline as part and parcel of the model, for practical purposes. So it is dishonest of companies to say they haven't changed the model while undertaking such cost optimizations that impair the models' effective intelligence.
gregsadetsky•4mo ago
I commented on the forum asking Sarge whether they could share some of their test results.

If they do, I think that it will add a lot to this conversation. Hope it happens!

gregsadetsky•4mo ago
Update: Sarge responded in the forum and added more information.

I asked them to share data/dates as much as that’s possible - fingers crossed

ProjectArcturis•4mo ago
I'm confused why this is addressed to Azure instead of OpenAI. Isn't Azure just offering a wrapper around chatGPT?

That said, I would also love to see some examples or data, instead of just "it's getting worse".

SubiculumCode•4mo ago
I don't remember where I saw it, but I remember a claim that Azure hosted models performed poorer than those hosted by openAI.
transcriptase•4mo ago
Explains why the enterprise copilot ChatGPT wrapper that they shoehorn into every piece of office365 performs worse than a badly configured local LLM.
bongodongobob•4mo ago
They most definitely do. They have been lobotomized in some way to be ultra corporate friendly. I can only use their M365 Copilot at work and it's absolute dogshit at writing code more than maybe 100 lines. It can barely write correct PowerShell. Luckily, I really only need it for quick and dirty short PS scripts.
sebazzz•4mo ago
I agree. I asked it for some help refactoring a database and some of the SQL is quite broken. It also doesn't help that their streaming code is broken so LLM responses sometimes end up broken in the web browser (both Firefox and Edge so it is not a browser issue), so you need to refresh after a response to make sure the LLMs response was not an indication of a drunk LLM.
SBArbeit•4mo ago
I know that OpenAI has made computing deals with other companies, and as time goes on, the percentage of inference that they run their models on will shift, but I doubt that much, if any, of that has moved from Microsoft Azure data centers yet, so that's not a reason for difference in model performance.

With that said, Microsoft has a different level of responsibility, both to its customers and to its stakeholders, to provide safety than OpenAI or any other frontier provider. That's not a criticism of OpenAI or Anthropic or anyone else, who I believe are all trying their best to provide safe usage. (Well, other than xAI and Grok, for which the lack of safety is a feature, not a bug.)

The risk to Microsoft of getting this wrong is simply higher than it is for other companies, and what's why they have a strong focus on Responsible AI (RAI) [1]. I don't know the details, but I have to assume there's a layer of RAI processing on models through Azure OpenAI that's not there for just using OpenAI models directly through the OpenAI API. That layer is valuable for the companies who choose to run their inference through Azure, who also want to maximize safety.

I wonder if that's where some of the observed changes are coming from. I hope the commenter posts their proof for further inspection. It would help everyone.

[1]: https://www.microsoft.com/en-us/ai/responsible-ai

gwynforthewyn•4mo ago
What's the conversation that you're looking to have here? There are fairly widespread claims that GPT-5 is worse than 4, and that's what the help article you've linked to says. I'm not sure how this furthers dialog about or understanding of LLMs, though, it reads to _me_ like this question just reinforces a notion that lots of people already agree with.

What's your aim here, sgt3v? I'd love to positively contribute, but I don't see how this link gets us anywhere.

bn-l•4mo ago
Maybe to prompt more anecdotes on how gpt-$ is the money making gpt—where they gut quality and hold prices steady to reduce losses?

I can tell you that the post describes is exactly what I’ve seen also: degraded performance and excruciatingly slow.

bigchillin•4mo ago
This is why we have open source. I noticed this with cursor, it’s not just an azure problem.
ukFxqnLa2sBSBf6•4mo ago
It’s a good thing the author provided no data or examples. Otherwise, there might be something to actually talk about.
cush•4mo ago
Is setting temperature to 0 even a valid way to measure LLM performance over time, all else equal?
jonplackett•4mo ago
It could be that performance on temp zero has declined but performance on a normal temp is the same or better.

I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.

fortyseven•4mo ago
I'd have assumed a fixed seed was used, but he doesn't mention that. Weird. Maybe he meant that?
numpad0•4mo ago
Pure sci-Fi idea: what if actually nothing was changed, but RNGs were becoming less random as we extract more randomness out of the universe?
maxbond•4mo ago
I bet they did both. If I'm reading the documentation right you have to supply a seed in order to get "best effort" determinism.

https://learn.microsoft.com/en-us/azure/ai-foundry/openai/re...

criemen•4mo ago
Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult.
visarga•4mo ago
It's because batch size is dynamic. So a different batch size will change the output even on temp 0.
criemen•4mo ago
Batch size is dynamic, in MoE apparently the experts chosen depend on the batch (not only your single inference request, which sounds weird to me, but I'm just an end user), no one audited the inference pipeline for floating point nondeterminisms, and I'm not even sure that temperature 0 implies deterministic sampling (the quick math formula I found has e^(1/temp) which means that 0 is not a valid value anyways and would need some dealing with).
Spivak•4mo ago
I don't think it's a valid measure across models but, as in the OP, it's a great measure for when they mess with "the same model" behind the scenes.

That being said we also do keep a test suite to check that model updates don't result in worse results for our users and it worked well enough. We had to skip a few versions of Sonnet because it stopped being able to complete tasks (on the same data) it could previously. I don't blame Anthropic, I would be crazy to assume that new models are a strict improvement across all tasks and domains.

I do just wish they would stop depreciating old models, once you have something working to your satisfaction it would be nice to freeze it. Ah well, only for local models.

SirensOfTitan•4mo ago
I’m convinced all of the major LLM providers silently quantize their models. The absolute worst was Google’s transition from Gemini 2.5 Pro 3-25 checkpoint to the May checkpoint, but I’ve noticed this effect with Claude and GPT over the years too.

I couldn’t imagine relying on any closed models for a business because of this highly dishonest and deceptive practice.

bn-l•4mo ago
You can be clever with language also. You can say “we never intentionally degrade model performance” and then claim you had no idea a quant would make perf worse because it was meant to make it better (faster).
briga•4mo ago
I have a theory: all these people reporting degrading model quality over time aren't actually seeing model quality deteriorate. What they are actually doing is discovering that these models aren't as powerful as they initially thought (ie. expanding their sample size for judging how good the model is). The probabilistic nature of LLM produces a lot of confused thinking about how good a model is, just because a model produces nine excellent responses doesn't mean the tenth response won't be garbage.
chaos_emergent•4mo ago
Yes exactly, my theory is that the novelty of a new generation of LLMs’ performances tends to cause an inflation in peoples’ perceptions of the model, with a reversion to a better calibrated expectation over time. If the developer reported numerical evaluations that drifted over time, I’d be more convinced of model change.
colordrops•4mo ago
Did any of you read the article? They have a test framework that objectively shows the model getting worse over time.
Aurornis•4mo ago
I read the article. No proof was included. Not even a graph of declining results.
colordrops•4mo ago
Ok fair, but not including the data is not the same as the article saying it was subjective "feel".
yieldcrv•4mo ago
fta: “I am glad I have proof of this with the test system”

I think they have receipts, but did not post them there

Aurornis•4mo ago
A lot of the claims I’ve seen have claimed to have proof, but details are never shared.

Even a simple graph of the output would be better than nothing, but instead it’s just an empty claim.

yieldcrv•4mo ago
That's been my experience too

but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now.

so perhaps it's just a matter of transparency

but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model

zzzeek•4mo ago
your theory does not hold up for this specific article as they carefully explained they are sending identical inputs into the model each time and observing progressively worse results with other variables unchanged. (though to be fair, others have noted they provided no replication details as to how they arrived at these results.)
vintermann•4mo ago
They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line?

Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.

Spivak•4mo ago
If it was random luck, wouldn't you expect about half the answers to be better? Assuming the OP isn't lying I don't think there's much room for luck when you get all the questions wrong on a T/F test.
lostmsu•4mo ago
With T=0 on the same model you should get the same exact output text. If they are not getting it, other environmental factors invalidate the test result.
nothrabannosir•4mo ago
TFA is about someone running the same test suite with 0 temperature and fixed inputs and fixtures on the same model over months on end.

What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.

gtsop•4mo ago
I see your point but no, it's getting objectively worse. I have a similar experience of casually using chatgpt for various use cases, when 5 dropped i noticed it was very fast but oddly got some details off. As time moved on it became both slower and the output deteriorated.
zzzeek•4mo ago
I'm sure MSFT will offer this person some upgraded API tier that somewhat improves the issues, though not terrifically, for only ten times the price.
jug•4mo ago
At least on OpenRouter, you can often verify what quant a provider is using for a particular model.
mehdibl•4mo ago
Since when LLM become deterministic?
thomasmg•4mo ago
LLM are just software + data and can be made deterministic, in the same way a pseudo random number generator can be made deterministic by using the same seed. For an LLM, you typically set temperature to 0, or set the random seed to the same value, run it on the same hardware (or emulation) or otherwise ensure the (floating point) calculations get the exact same results. I think that's it. In reality, yes it's not that easy, but it's possible.
mr_toad•4mo ago
Unfortunately because floating point addition isn’t always associative, and because GPUs don’t always perform calculations in the same order you won’t always get the same result even with a temperature of zero.
ant6n•4mo ago
I used to think running your own local model is silly because it’s slow and expensive, but the nerfing of ChatGPT and Gemini is so aggressive it’s starting to make a lot more sense. I want the smartest model, and I don’t want to second guess some potentially quantized black box.
bbminner•4mo ago
Am I the only person who can sense the exact moment an LLM-written response kicked in? :) "sharing some of the test results/numbers you have would truly help cement this case!" - c'mon :)
gregsadetsky•4mo ago
I actually 100% wrote that comment myself haha!! See https://news.ycombinator.com/item?id=45316437

I think it would have sounded more reasonable in French, which is my actual native tongue. (i.e. I subconsciously translate from French when I'm writing in English)

((this comment was also written without AI!!)) :-)

bbminner•4mo ago
Oh, my honest apologies then, Greg! :) I am not a native speaker myself. And as far as i can tell, the phrasing is absolutely grammatically correct, but there's some quality to it that registers as LLM-speak to me.

I wonder how the causal graph looks here: do people (esp those working with LLMs a lot) lean towards LLM-speak over time, or both LLMs and native speakers picked up this very particular sentence structure from a common source? (eg a large corpus of French-English translations in the same style?)

gregsadetsky•4mo ago
No apologies needed, but thanks for your kind words! I think that we’re all understandably “on edge” considering that so much content is now llm-generated, and it’s hard to know what’s real and what isn’t.

I’ve been removing hyphens and bullet points from my own writing just to appear even less llm like! :)

Great stylistic chicken and egg question! French definitely tends to use certain (I’m struggling to not say “fancier”) words even in informal contexts.

I personally value using over-the-top ornate expressions in French: they both sound distinguished and a bit ridiculous, so I get to both ironically enjoy them and feel detached from them… but none of that really translates to casual English. :)

Cheers

mmh0000•4mo ago
I've noticed this with Claude Code recently. A few weeks ago, Claude was "amazing" in that I could feed it some context and a specification, and it could generate mostly correct code and refine it in a few prompts.

Now, I can try the same things, and Claude gets it terribly wrong and works itself into problems it can't find its way out of.

The cynical side of me thinks this is being done on purpose, not to save Anthropic money, but to make more money by burning tokens.

cjtrowbridge•4mo ago
This brings up a point many will not be aware of. If you know the random seed and the prompt, and the hash of the model's binary file; the output is completely deterministic. You can use this information to check whether they are in fact swapping your requests out to cheaper models than what you're paying for. This level of auditability is a strong argument for using open-source, commodified models, because you can easily check if the vendor is ripping you off.
TZubiri•4mo ago
Pretty sure this is wrong, requests are batched and size can affect the output, also gpus are highly parallel, there can be many race conditions.
TeMPOraL•4mo ago
Yup. Floating point math turns race conditions into numerical errors, reintroducing non-determinism regardless of inputs used.
romperstomper•4mo ago
Could it be a result of a caching of some sort? I suppose in case of LLM they can't make a direct cache but they could group prompts using embeddings and produce some most common result maybe? (this is just a theory)
juliangoldsmith•4mo ago
I've been using Azure AI Foundry for an ongoing project, and have been extremely dissatisfied.

The first issue I ran into was with them not supporting LLaMA for tool calls. Microsoft stated in February that they were working on it [0], and they were just closing the ticket because they were tracking it internally. I'm not sure why they've been unable to do what took me two hours in over six months, but I am sure they wouldn't be upset by me using the much more expensive OpenAI models.

There are also consistent performance issues, even on small models, as mentioned elsewhere. This is with a rate on the order of one per minute. You can solve that with provisioned throughput units. The cheapest option is one of the GPT models, at a minimum of $10k/month (a bit under half the cost of just renting an A100 server). DeepSeek was a minimum of around $72k/month. I don't remember there being any other non-OpenAI models with a provisioned option.

Given that current usage without provisioning is approximately in the single dollars per month, I have some doubts as to whether we'd be getting our money's worth having to provision capacity.