The larger monthly scale should be the default, or you should get more samples.
I've been using CC more or less 8 hrs/day for the past 2 weeks, and if anything it feels like CC is getting better and better at actual tasks.
Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts? Is your hypothesis that Opus 4.5, a huge static model, is somehow changing? Master system prompt changing? Safety filters changing?
Is CC getting better, or are you getting better at using it? And how do you know the difference?
I'm an occasional user, and I can definitely see improvements in my prompts over the past couple of months.
My initial prompting is boilerplate at this point, and looks like this:
(Explain overall objective / problem without jumping to a solution)
(Provide all the detail / file references / past work I can think of)
(Ask it "what questions do you have for me before we build a plan?")
And then go back and forth until we have a plan.
Compared to my work with CC six months ago, it's just much more capable, able to solve more nuanced bugs, and less likely to generate spaghetti code.
For me I've noticed it getting nothing but better over the past couple months, but I've been working on my workflows and tooling.
For example, I used to use plan mode and would put everything in a single file and then ask it to implement it in a new session.
Switching to the 'superpowers' plugin with its own skills to brainstorm and write plans and execute plans with batches and tasks seems to have made a big improvement and help catch things I wouldn't have before. There's a "get shit done" plugin that's similar that I want to explore as well.
The code output always looks good to me for the most part though and I've never thought that it's getting dumber anything, so I feel like a lot of the improvements I see are because of a skill issue on my part trying to use everything. Obviously it doesn't help there's a new way to do things every two weeks though.
Thumbs up or down? (could be useful for trends) Usage growth from the same user over time? (as an approximation) Tone of user responses? (Don't do this... this is the wrong path... etc.)
Just ignore the continual degradation of service day over day, long after the "infrastructure bugs" have reportedly been solved.
Oh, and I've got a bridge in Brooklyn to sell ya, it's a great deal!
They should be transparent and tell customers that they're trying to not lose money, but that'd entail telling people why they're paying for service they're not getting. I suspect it's probably not legal to do a bait and switch like that, but this is pretty novel legal territory.
It's the only time cussing worked, though.
It's not my fault, they set high standards!
It was probably 3x faster than usual. I got more done in the next hour with it than I do in half a day usually. It was definitely a bit of a glimpse into a potential future of “what if these things weren’t resource constrained and could just fly”.
Ultimately I can understand if a new model is coming in without as much optimization then it'll add pressure to the older models achieving the same result.
Nice plausible deniability for a convenient double effect.
This last week it seems way dumber than before.
How do you pay for those SWE-bench runs?
I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.
Thanks!
It’s a terrific idea to provide this. ~Isitdownorisitjustme for LLMs would be the parakeet in the coalmine that could at least inform the multitude of discussion threads about suspected dips in performance (beyond HN).
What we could also use is similar stuff for Codex, and eventually Gemini.
Really, the providers themselves should be running these tests and publishing the data.
The availability status information is no longer sufficient to gauge the service delivery because it is by nature non-deterministic.
I don't know if they do this or not, but the nature of the API is such you could absolutely load balance this way. The context sent at each point is not I believe "sticky" to any server.
TLDR you could get a "stupid" response and then a "smart" response within a single session because of heterogeneous quantization / model behaviour in the cluster.
Aha, so the models do degrade under load.
I would imagine a sort of hybrid qualities of volunteer efforts like wikipedia, new problems like advent of code and benchmarks like this. The goal? It would be to study the collective effort on the affects of usage to so many areas where AI is used.
[MedWatch](https://www.fda.gov/safety/medwatch-fda-safety-information-a...)
[VAERS](https://www.cdc.gov/vaccine-safety-systems/vaers/index.html)
[EudraVigilance](https://www.ema.europa.eu/en/human-regulatory-overview/resea...)
1. The percentage drop is too low and oscillating, it goes up and down.
2. The baseline of Sonnet 4.5 (the obvious choice for when they have GPU busy for the next training) should be established to see Opus at some point goes Sonnet level. This was not done but likely we would see a much sharp decline in certain days / periods. The graph would look like dominated by a "square wave" shape.
3. There are much better explanations for this oscillation: A) They have multiple checkpoints and are A/B testing, CC asks you feedbacks about the session. B) Claude Code itself gets updated, as the exact tools version the agent can use change. In part it is the natural variability due to the token sampling that makes runs not equivalent (sometimes it makes suboptimal decisions compared to T=0) other than not deterministic, but this is the price to pay to have some variability.
Why January 8? Was that an outlier high point?
IIRC, Opus 4.5 was released late november.
Doesn't really work like that. I'd remove the "statistically significant" labelling because it's misleading.
I would be curious to see on how it fares against a constant harness.
There were thread claiming that Claude Code got worse with 2.0.76, with some people going back to 2.0.62. https://github.com/anthropics/claude-code/issues/16157
So it would be wonderful to measure these.
I wouldn't be surprised if the thing this is actually testing is benchmarking just claude codes constant system prompt changes.
I wouldn't really trust this to be able to benchmark opus itself.
"You have a bug in line 23." "Oh yes, this solution is bugged, let me delete the whole feature." That one-line fix I could make even with ChatGPT 3.5 can't just happen. Workflows that I use and are very reproducible start to flake and then fail.
After a certain number of tokens per day, it becomes unusable. I like Claude, but I don't understand why they would do this.
I would suggest adding some clarification to note that longer measure like 30 pass rate is raw data only while the statistically significant labels apply only to change.
Maybe something like Includes all trials, significance labels apply only to confidence in change vs baseline.
Very simple queries, even those easily answered via regular web searching, have begun to consistently not result accurate results with Opus 4.5, despite the same prompts previously yielding accurate results.
One of the tasks that I already thought was fully saturated as most recent releases had no issues in solving it was to request a list of material combinations for fabrics used in bag constructions that utilise a specific fabric base. In the last two weeks, Claude has consistently and reproducibly provided results which deviate from the requested fabric base, making the results inaccurate in a way that a person less familiar with the topic may not notice instantly. There are other queries of this type for other topics I am nerdily familiar with to a sufficient degree to notice such deviations from the prompt like motorcycle history specific queries that I can say this behaviour isn't limited to the topic of fabrics and bag construction.
Looking at the reasoning traces, Opus 4.5 even writes down the correct information, yet somehow provides an incorrect final output anyways.
What makes this so annoying is that in coding tasks, with extensive prompts that require far greater adherence to very specific requirements in a complex code base, Opus 4.5 does not show such a regression.
I can only speculate what may lead to such an experience, but for none coding tasks I have seen regression in Opus 4.5 whereas for coding I did not. Not saying there is none, but I wanted to point it out as such discussions are often primarily focused on coding, where I find it can be easier to see potential regressions where their are none as a project goes on and tasks become inherently more complex.
qwesr123•2h ago