That said, I would also love to see some examples or data, instead of just "it's getting worse".
With that said, Microsoft has a different level of responsibility, both to its customers and to its stakeholders, to provide safety than OpenAI or any other frontier provider. That's not a criticism of OpenAI or Anthropic or anyone else, who I believe are all trying their best to provide safe usage. (Well, other than xAI and Grok, for which the lack of safety is a feature, not a bug.)
The risk to Microsoft of getting this wrong is simply higher than it is for other companies, and what's why they have a strong focus on Responsible AI (RAI) [1]. I don't know the details, but I have to assume there's a layer of RAI processing on models through Azure OpenAI that's not there for just using OpenAI models directly through the OpenAI API. That layer is valuable for the companies who choose to run their inference through Azure, who also want to maximize safety.
I wonder if that's where some of the observed changes are coming from. I hope the commenter posts their proof for further inspection. It would help everyone.
What's your aim here, sgt3v? I'd love to positively contribute, but I don't see how this link gets us anywhere.
I can tell you that the post describes is exactly what I’ve seen also: degraded performance and excruciatingly slow.
I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.
https://learn.microsoft.com/en-us/azure/ai-foundry/openai/re...
That being said we also do keep a test suite to check that model updates don't result in worse results for our users and it worked well enough. We had to skip a few versions of Sonnet because it stopped being able to complete tasks (on the same data) it could previously. I don't blame Anthropic, I would be crazy to assume that new models are a strict improvement across all tasks and domains.
I do just wish they would stop depreciating old models, once you have something working to your satisfaction it would be nice to freeze it. Ah well, only for local models.
I couldn’t imagine relying on any closed models for a business because of this highly dishonest and deceptive practice.
I think they have receipts, but did not post them there
Even a simple graph of the output would be better than nothing, but instead it’s just an empty claim.
but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now.
so perhaps it's just a matter of transparency
but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model
Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.
What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.
I think it would have sounded more reasonable in French, which is my actual native tongue. (i.e. I subconsciously translate from French when I'm writing in English)
((this comment was also written without AI!!)) :-)
I wonder how the causal graph looks here: do people (esp those working with LLMs a lot) lean towards LLM-speak over time, or both LLMs and native speakers picked up this very particular sentence structure from a common source? (eg a large corpus of French-English translations in the same style?)
I’ve been removing hyphens and bullet points from my own writing just to appear even less llm like! :)
Great stylistic chicken and egg question! French definitely tends to use certain (I’m struggling to not say “fancier”) words even in informal contexts.
I personally value using over-the-top ornate expressions in French: they both sound distinguished and a bit ridiculous, so I get to both ironically enjoy them and feel detached from them… but none of that really translates to casual English. :)
Cheers
Now, I can try the same things, and Claude gets it terribly wrong and works itself into problems it can't find its way out of.
The cynical side of me thinks this is being done on purpose, not to save Anthropic money, but to make more money by burning tokens.
The first issue I ran into was with them not supporting LLaMA for tool calls. Microsoft stated in February that they were working on it [0], and they were just closing the ticket because they were tracking it internally. I'm not sure why they've been unable to do what took me two hours in over six months, but I am sure they wouldn't be upset by me using the much more expensive OpenAI models.
There are also consistent performance issues, even on small models, as mentioned elsewhere. This is with a rate on the order of one per minute. You can solve that with provisioned throughput units. The cheapest option is one of the GPT models, at a minimum of $10k/month (a bit under half the cost of just renting an A100 server). DeepSeek was a minimum of around $72k/month. I don't remember there being any other non-OpenAI models with a provisioned option.
Given that current usage without provisioning is approximately in the single dollars per month, I have some doubts as to whether we'd be getting our money's worth having to provision capacity.
esafak•4mo ago
Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too.
edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought.
colordrops•4mo ago
jonplackett•4mo ago
icyfox•4mo ago
> Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.”
- [Ted Sanders](https://news.ycombinator.com/item?id=44242198) (OpenAI)
The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible.
https://www.anthropic.com/engineering/a-postmortem-of-three-... https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
xg15•4mo ago
Heard of similar experiences from RL acquaintances, where a prompt worked reliably for hundreds of requests per day for several months - and then suddenly the model started to make mistakes, ignore parts of the prompt, etc when a newer model was released.
I agree, it doesn't have to be deliberate malice like intentionally nerfing a model to make people switch to the newer one - it might just be that less resources are allocated to the older model once the newer one is available and so the inference parameters change - but some effect at the release of a newer model seems to be there.
icyfox•4mo ago
As for the original forum post:
- Multiple numerical computation bugs can compound to make things worse (we saw this in the latest Anthropic post-mortum)
- OP didn't provide any details on eval methodology, so I don't think it's worth speculating on this anecdotal report until we see more data
xg15•4mo ago
If it indeed did show a slow decline over time and OpenAI did not change the weights, then something does not add up.
esafak•4mo ago
gregsadetsky•4mo ago
If they do, I think that it will add a lot to this conversation. Hope it happens!
gregsadetsky•4mo ago
I asked them to share data/dates as much as that’s possible - fingers crossed