That said, I would also love to see some examples or data, instead of just "it's getting worse".
What's your aim here, sgt3v? I'd love to positively contribute, but I don't see how this link gets us anywhere.
I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.
I couldn’t imagine relying on any closed models for a business because of this highly dishonest and deceptive practice.
I think they have receipts, but did not post them there
Even a simple graph of the output would be better than nothing, but instead it’s just an empty claim.
but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now.
so perhaps it's just a matter of transparency
but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model
Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.
What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.
esafak•1h ago
Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too.
edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought.
colordrops•1h ago
jonplackett•1h ago
icyfox•46m ago
> Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.”
- [Ted Sanders](https://news.ycombinator.com/item?id=44242198) (OpenAI)
The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible.
https://www.anthropic.com/engineering/a-postmortem-of-three-... https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
xg15•15m ago
Heard of similar experiences from RL acquaintances, where a prompt worked reliably for hundreds of requests per day for several months - and then suddenly the model started to make mistakes, ignore parts of the prompt, etc when a newer model was released.
I agree, it doesn't have to be deliberate malice like intentionally nerfing a model to make people switch to the newer one - it might just be that less resources are allocated to the older model once the newer one is available and so the inference parameters change - but some effect at the release of a newer model seems to be there.
icyfox•11m ago
As for the original forum post:
- Multiple numerical instability bugs can compound to make things worse (we saw this in the latest Anthropic post-mortum)
- OP didn't provide any details on eval methodology, so I don't think it's worth speculating on this anecdotal report until we see more data
xg15•3m ago
If it indeed did show a slow decline over time and OpenAI did not change the weights, then something does not add up.
gregsadetsky•38m ago
If they do, I think that it will add a lot to this conversation. Hope it happens!