(Speaking of both Claude Code and the desktop app, both Sonnet and Opus >=4, on the Max plan.)
"Use your web search tool to find me the go-to component for doing xyz in $language $framework. Always link the GitHub repo in your response."
Previously Sonnet 4 would return a good answer to this at least 80% of the time.
Now even Opus 4.1 with extended thinking frequently ignores my ask for it to use the search tool, which allows it to hallucinate a component in a library. Or maybe an entire repo.
It's gone backwards severely.
(If someone from Anthropic sees this, feel free to reach out for chat IDs/share links. I have dozens.)
IDK about you but I find it faster to type a few keywords and click the first result than to wait for "extended thinking" to warm up a cup of hot water only to ignore "your ask" (it's a "request," not an "ask," unless you're talking to a Product Manager with corporate brain damage) to search and then outputs bullshit.
I can only assume after you waste $0.10 asking Claude and reading the bullshit, you use normal search.
Truly revolutionary rechnology
Might be Claude optimizing for general use cases compared to code and that affecting the code side?
Feels strange, because Claude api isn’t the same as the web tool so I didn’t expect Claude code to be the same.
It might be a case of having to learn to read Claude best practice docs and keep up with them. Normally I’d have Claude read them itself and update an approach to use. Not sure that works as well anymore.
Anyone remember GPT4 the day it launched? :)
People seem to turn to this with a lot when the suspicion many people have is difficult to verify. And while I don’t trust a suspicion just because it’s held by a lot of people, I also won’t allow myself to embrace the comforting certainty of “it’s surely false and it’s psychological bias”.
Sometimes we just need to not be sure what’s going on.
Based on the discussions here it seems that every model is either about to be great or was great in the past but now is not. Sucks for those of us who are stuck in the now, though.
Yes, but I'll revisit.
On that note, I strongly recommend qwen3:4b. It is _bonkers_ how good it is, especially considering how relatively tiny it is.
FWIW, Codex-CLI w/ ChatGPT5 medium is great right now. Objectively accelerating me. Not a coding god like some posters would have it, but overall freeing up time for me. Observably.
Assuming I haven't had since-cured delusions, the same was true for Claude Code, but isn't any more.
Concrete supporting evidence: From time to time, I have coding CLIs port older projects of varying (but small-ish) sizes from JS to TS. Claude Code used to do well on that. Repeatedly. I did another test last Sunday, and it dug a momentous hole for itself that even liberal sprinkling of 'as unknown' everywhere couldn't solve. Codex managed both the ab-initio port and was able to undig from CC's massive hole abandoned mid-port.
So I'd say the evidence points somewhat against random process, given repeated testing shows clear signal both of past capability and of recent loss of capability.
The idea that it's a "random" process is misguided.
You mean like our human brains and our entire bodies? We are the result of random processes.
>Sucks for those of us who are stuck in the now, though
I don't know what you are doing- but GPT5 is incredible. I literally spent 3 hours last night going back and forth on a project where I loaded some files for a somewhat complicated and tedious conversion between two data formats. And I was able to keep going back and forth and making the improvements incrementally and have AI do 90% of the actual tedious work.
To me it's incredible people don't seem to understand the CURRENT value. It has literally replaced a junior developer for me. I am 100% better off working with AI for all these tedious tasks than passing them off to someone off. We can argue all day if that's good for the world (it's not) but in terms of the current state of AI- it's already incredible.
It might not be a junior dev tool. Senior devs are using AI quite differently to magnify themselves not help them manage juniors with developing ceilings.
https://status.anthropic.com/incidents/72f99lh1cj2c
Suggesting people are "out of their mind" is not really appropriate on this forum, especially so in this circumstance.
This most definitely feels like people analyzing the output of a random process - at this point I am feeling like I'm losing my mind.
(As for the phrasing I was quoting the OP, who I believe took it in the spirit in which it was meant)
[1] https://news.ycombinator.com/item?id=45183587
[2] https://news.ycombinator.com/item?id=45182714
> New features like this feel pointless when the underlying model is becoming unusable.
I recognize I could have been clearer.
And for what it's worth, yes, your comment's phrasing didn't bother me at all.
They were wrong, but not inappropriate. They re-used the "out of their mind" phrase from the parent comment to cheekily refer to the possibility of a cognitive bias.
(lol, yes, thank you.)
As an example I’ve been using an MCP tool to provide table schemas to Claude for months.
There was a point where it stopped recognizing the tool unless mentioned in early August. Maybe that’s related to their degraded quality issue.
This morning after pulling the correct schema info Sonnet started hallucinating columns (from Shopify’s API docs) and added them to my query.
That’s a use case I’ve been doing daily for months and in the last few weeks has gone from consistent low supervision to flaky and low quality.
I don’t know what’s going on, Sonnet has definitely felt worse, and the timeline matches their status page incident, but it’s definitely not resolved.
Opus 4.1 also feels flaky, it feels like it’s less consistent about recalling earlier prompt details than 4.0.
I personally am frustrated that there’s no refund or anything after a month of degraded performance, and they’ve had a lot of downtime.
I actually think this is psychological bias. It got a few things right early on, and that's what you remember. As time passes, the errors add up, until the memory doesn't match reality. The "new shiny" feeling goes away, and you perceive it for what it really is: a kind of shitty slot machine
> personally am frustrated that there’s no refund or anything after a month of degraded performance
lol, LMAO. A company operates a shitty slot machine at a loss and you're surprised they have "issues" that reduce your usage?
I'm not paying for any of this shit until these companies figure out how to align incentives. If they make more by applying limits, or charge me when the machine makes errors, that's good for them and bad for me! Why should I continue to pay to pull on the slot machine lever?
It's a waste of time and money. I'll be richer and more productive if I just write the code myself, and the result will be better too.
I think you’re right. I think it’s complete bias with a little bit of “it does more tasks now” so it might behave a bit differently to the same prompt.
I also think you’re right that there’s an incentive to dumb it down so you pull the lever more. Just 2 more $1 spins and maybe you’ll hit jackpot.
Really it’s the enshitification of the SOTA for profits and glory.
another option could be a system prompt change to make it too long?
I signed up for Claude over a week ago and I totally regret it!
Previously I was using it and some ChatGPT here and there (also had a subscription in the past) and I felt like Claude added some more value.
But it's getting so unstable. It generates code, I see it doing that, and then it throws the code away and gives me the previous version of something 1:1 as a new version.
And then I have to waste CO2 to tell it to please don't do that and then sometimes it generates what I want, sometimes it just generates it again, just to throw it away immediately...
This is soooooooo annoying and the reason I canceled my subscription!
I've had the same experience. Totally unreliable.
1. Ask Claude to fix something
2. It fails to fix the issue
3. I tell it that the fix didn’t work
4. It reverts its failed fix and tells me everything is working now.
This is like finding a decapitated body, trying to bring it back to life by smooshing the severed head against the neck, realizing that didn’t bring them back to life, dropping the head back on the ground, and saying, “There; I’ve saved them now.”
They recently resolved two bugs affecting model quality, one of which was in production Aug 5-Sep 4. They also wrote:
Importantly, we never intentionally degrade model quality as a result of demand or other factors, and the issues mentioned above stem from unrelated bugs.
Sibling comments are claiming the opposite, attributing malice where the company itself says it was a screw up. Perhaps we should take Anthropic at its word, and also recognize that model performance will follow a probability distribution even for similar tasks, even without bugs making thing worse.- They're reporting that only impacted Haiku 3.5 and Sonnet 4. I used neither model during the time period I'm concerned with.
- It took them a month to publicly acknowledge that issue, so now we lack confidence there isn't another underlying issue going undetected (or undisclosed, less charitably) that affects Opus.
You can be confident there is a non-zero rate of errors and defects in any complex service that's moving as fast as the frontier model providers!
Things they could do that would not technically contradict that:
- Quantize KV cache
- Data aware model quantization where their own evals will show "equivalent perf" but the overall model quality suffers.
Simple fact is that it takes longer to deploy physical compute but somehow they are able to serve more and more inference from a slowly growing pool of hardware. Something has to give...
Is training compute interchangeable with inference compute or does training vs. inference have significantly different hardware requirements?
If training and inference hardware is pooled together, I could imagine a model where training simply fills in any unused compute at any given time (?)
Also, if you pull too manny resources from training your next model to make inference revenue today, you’ll fall behind in the larger race.
I picked up Claude at the beginning of the summer and have had the same experience.
https://status.anthropic.com/incidents/72f99lh1cj2c
That being said, they still have capacity issues on any day of the week that ends in Y. No clue how long would that take to resolve.
- They admittedly go off of "vibes" for system prompt updates[0]
- I've seen my coworkers making a lot of bad config and CLAUDE.md updates, MCP server span, etc. and claiming the model got worse. After running it with a clean slate, they redacted their claims.
> we never intentionally degrade model quality as a result of demand or other factors
Fully giving them the benefit of the doubt, I still think that still allows for a scenario like "we may [switch to quantized models|tune parameters], but our internal testing showed that these interventions didn't materially affect end user experience".
I hate to parse their words in this way, because I don't know how they could have phrased it that closed the door on this concern, but all the anecdata (personal and otherwise) suggests something is happening.
> I don't know how they could have phrased it that closed the door on this concern
Agreed. A full legal document would probably be the only way to convince everyone.
Sure, people complain about Anthropic's AI models getting worse over time. As well as OpenAI's models getting worse over time. But guess what? If you serve them open weights models, they also complain about models getting worse over time. Same exact checkpoint, same exact settings, same exact hardware.
Relative LMArena metrics, however, are fairly consistent across time.
The takeaway is that users are not reliable LLM evaluators.
My hypothesis is that users have a "learning curve", and get better at spotting LLM mistakes over time - both overall and for a specific model checkpoint. Resulting in increasingly critical evaluations over time.
Quantization could be done, not to deliberately make the model worse, but to increase reliability! Like Apple throttling devices - they were just trying to save your battery! After all there are regular outages, and some pretty major ones a handful of weeks back taking eg Opus offline for an entire afternoon.
> But guess what? If you serve them open weights models, they also complain about models getting worse over time.
Isn't this also anecdotal, or is there data informing this statement?
I think you could be partially right, but I also don't think dismissing criticism as just being a change in perspective is correct either. At least some complaints are from power users who can usually tell when something is getting objectively worse (as was the case for some of us Claude Code users recently). I'm not saying we can't fool ourselves too, but I don't think that's the most likely assumption to make.
Living evals can solve for the quantitative issues with infra and model updates, but not sure how to deal with perceptual adaptation.
Intentionally might mean manually, or maybe the system does it on it's own when it thinks it's best.
However there have been some bugs causing performance degradation acknowledged by Anthropic as well (and fixed) and so I would guess there's a good amount of real degradation still if people are still seeing issues.
I've seen a lot of people switching to codex cli, and yesterday I did too, for now my 200/mo goes to OpenAI. It's quite good and I recommend it.
I'll probably come back and try a Claude Code subscription again, but I'm good for the time being with the alternative I found. I also kind of suspect the subscription model isn't going to work for me long term and instead the pay per use approach (possibly with reserved time like we have for cloud compute) where I can swap models with low friction is far more appealing.
Of course there’s always the problem of teaching to the test and out of test degradations, but presumably bugs would be independent of that.
They do not seem to care at all that what they're peddling is just elaborate smoke and mirrors.
I don't feel Claude would do this intentionally, and am reminded how I kept Claude for use for some things but not generally.
I'm kidding btw.
Maybe the reliability problems have almost nothing to do with what features they build, and are bottlenecked for completely different reasons.
Using only 2 MCP servers and not extending claude.md.
I had to keep prompting it to generate new artifacts all the time.
Thankfuly that is mostly gone with Claude Code.
From troubleshooting Claude by reviewing it's performance and digging in multiple times why it did what it did, it seems useful to make sure the first sentence is a clearer and completer instruction instead of breaking it up.
As models optimize resources, prompt engineering seems to become relevant again.
https://www.businessinsider.com/anthropic-ceo-ai-90-percent-...
Sonnet was nearly unusable without a perfect prompt and it took a separate therapy session with another sonnet chat to deconstruct how it was no lager working.
There appear to be hard overrides being introduced that overlook basic things like using your personal preferences.
Vague or general descriptions get weighed less important vs the strong and clear.
Maybe. What would you rather have?
A) rock solid Sonnet 4 with Sonnet 5, say, next April
B) buggy Sonnet 4 with Sonnet 5, say, next January
Seems like different customers would have a range of preferences.
This must be one of the questions facing the team at Anthropic: what proportion of effort should go towards quality vs. velocity?
Might be worth trying Claude through Amazon as well.
The Anthropic product adding a feature is not the end of employment or even a step along the way.
MOST PEOPLE can't even use an actual computer yet even think about programming.
WYSIWYG editors didn't kill web development because most people are simply too stupid to understand a new tool, let alone use it.
Rewind back to the 70s and ask the same question.
they're great for spitting out a lot of code but not so great at making it work or make sense, unfortunately
None?
I mean, Mr. Reason is standing right there!
It can actually drive emacs itself, creating buffers, being told not to edit the buffers and simply respond in the chat etc.
I actually _like_ working with efrit vs other LLM integrations in editors.
In fact I kind of need to have my anthropic console up to watch my usage... whoops!
That's the functionality which I could use for my day job, but I'm not finding an LLM which directly affords that capability (without programming or other steps which are difficult on my work computer).
I'd like an all-in-one tool of an LLM front-end which can access multiple files since that is more easily explained/permission granted for.
It looks to me like a variant of the Code Interpreter pattern, where Claude has a (presumably sandboxed) server-side container environment in which it can run Python. When you ask it to make a spreadsheet it runs this:
pip install openpyxl pandas --break-system-packages
And then generates and runs a Python script.What's weird is that when you enable it in https://claude.ai/settings/features it automatically disables the old Analysis tool - which used JavaScript running in your browser. For some reason you can have one of those enabled but not both.
The new feature is being described exclusively as a system for creating files though! I'm trying to figure out if that gets used for code analysis too now, in place of the analysis tool.
I tried "Tell me everything you can about your shell and Python environments" and got some interesting results after it ran a bunch of commands.
Linux runsc 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 24.04.2 LTS
Python 3.12.3
/usr/bin/node is v18.19.1
Disk Space: 4.9GB total, with 4.6GB available
Memory: 9.0GB RAM
Attempts at making HTTP requests all seem to fail with a 403 error. Suggesting some kind of universal proxy.
But telling it to "Run pip install sqlite-utils" worked, so apparently they have allow-listed some domains such as PyPI.
I poked around more and found these environment variables:
HTTPS_PROXY=http://21.0.0.167:15001
HTTP_PROXY=http://21.0.0.167:15001
On further poking, some of the allowed domains include github.com and pypi.org and registry.npmjs.org - the proxy is running Envoy.Anthropic have their own self-issued certificate to intercept HTTPS.
A lot of the people I graduated with spent their 20s making powerpoint and excel. There would be people with a master's in engineering getting phone calls at 1am, with an instruction to change the fonts on slide 75, or to slightly modify some calculation. Most of the real decision making was, funnily enough, not based on these documents. But it still meant people were working 100 hour weeks.
I could see this resulting in the same work being done in a few minutes. But I could also see it resulting in the MDs asking for 10x the number of slide decks.
“Now here you see, it takes all the running you can do, to keep in the same place” as she says.
I fully believe any slack this creates will get gobbled up in competition in a few years.
Would appreciate if that could be fixed but of course new features are more interesting for them to prioritize.
I've been paying $10/month for GitHub Copilot, which I use via Microsoft's Visual Studio Code, and about a month ago, they added ChatGPT5 (preview), which uses the agent model of interaction. It's a qualitative jump that I'm still learning to appreciate in full.
It seems like the worst possible thing, in terms of security, to let an LLM play with your stuff, but I really didn't understand just how much easier it could be to work with an LLM if it's an agent. Previously I'd end up with a blizzard of python error messages, and just give up on a project, now it fixes it's own mess. What a relief!
Will also make using Linux tooling a lot easier on non- Linux hosts like Windows/MacOS
In practice, they require a lot of sysadmin-related work, and installing all the software inside them is no fun, even if using scripts, etc.
No, because the software that needs to be installed into them keeps changing (new versions, new packages, etc.)
Sysadmin is a job for a reason. And with containers you are a sysadmin for more than one system.
I'm on 100$ Max plan, I would even buy 2x 200$ plan if Opus would stop randomly being dumb. Especially after 7am ET time.
Hope not.
I reverse-engineered it a bit, figured out its container specs, used it to render a PDF join diagram for a SQLite database and then re-ran a much more complex "recreate this chart from this screenshot and XLSX file" example that I previously ran against ChatGPT Code Interpreter last night.
Here's my review: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
I just tried this new feature to work on a text document in a project, and it's a big difference. Now I really want to have this feature (for text at least) in ChatGPT to be able to work on documents through voice and without looking at the screen.
Malware writers are rejoicing!
Something with OAuth authentication.
Our org isn't interested in running a local, unofficial MCP server and having users create their own API keys.
ChatGPT can package up files as a download.
Both Gemini and ChatGPT accept zip files with lots of files in them.
Claude does neither of those things.
michaelmior•4h ago
DharmaPolice•4h ago
jjice•4h ago