Ask HN: Claude Opus performance affected by time of day?

41•scaredreally•3w ago

I am a big fan of Claude Opus as it has been very good at understanding feature requests and generally staying consistent with my codebase (completely written from scratch using Opus).

I've noticed recently that when I am using Opus at night (Eastern US), I am seeing it go down extreme rabbit holes on the same types of requests I am putting through on a regular basis. It is more likely to undertake refactors that break the code and then iterates on those errors in a sort of spiral. A request that would normally take 3-4 minutes will turn into a 10 minute adventure before I revert the changes, call out the mistake, and try again. It will happily admit the mistake, but the pattern seems to be consistent.

I haven't performed a like for like test and that would be interesting, but has anyone else noticed the same?

Comments

bayarearefugee•3w ago

I mostly use Gemini, so I can't speak for Claude, but Gemini definitely has variable quality at different times, though I've never bothered to try to find a specific time-of-day pattern to it.

The most reliable time to see it fall apart is when Google makes a public announcement that is likely to cause a sudden influx of people using it.

And there are multiple levels of failure, first you start seeing iffy responses of obvious lesser quality than usual and then if things get really bad you start seeing just random errors where Gemini will suddenly lose all of its context (even on a new chat) or just start failing at the UI level by not bothering to finish answers, etc.

The sort of obvious likely reason for this is when the models are under high load they probably engage in a type of dynamic load balancing where they fall back to lighter models or limit the amount of time/resources allowed for any particular prompt.

kevinsync•3w ago

I suspect they might transparently fall back too; Opus 4.5 has been really reasonable lately, except right after it launched, and also surrounding any service interruptions / problems reported on status.claude.ai -- once those issues resolve, for a few hours the results feel very "Sonnet", and it starts making a lot more mistakes. When that happens, I'll usually just pause Claude and prompt Codex and Gemini with the same issue to see what comes out of the black hole.. then a bit later, Claude mysteriously regains its wits.

I just assume it went to the bar, got wasted, and needed time to sober up!

scaredreally•3w ago

Precisely. Once I point out the fact that it is doing this, it seems to produce better results for a bit before going back to the same.

I jokingly (and not so) thought that it was trained on data that made it think it should be tired at the end of the day.

But it is happening daily and at night.

woleium•3w ago

I find it helps to tell it to take some stimulants

stavros•3w ago

I didn't believe such conspiracy theories, until one day I noticed Sonnet 4.5 (which I had been using for weeks to great success) perform much worse, very visibly so. A few hours later, Opus 4.5 was released.

Now I don't know what to think.

pankajdoharey•3w ago

Model router.

astrange•3w ago

They don't ever fall back to cheaper models silently.

What Anthropic does do is poke the model to tell you to go to bed if you use it too long ("long conversation reminder") which distracts it from actually answering.

Sometimes they do have associations with things like the day of the year and might be lazier some months than others.

pankajdoharey•3w ago

If they are real slime balls they can justify it by saying you see we use speculative decoding so we first use a smaller faster model model first and then then answer is enhanced by larger model blah blah ..... "FOr the best User experience"

pankajdoharey•3w ago

Its the router they are using, we surely are not getting what we select. Also after a few queries the intelligence drops. abruptly. and doesn't recover even after we start a new session, so there is another internal quota at play.

janalsncm•3w ago

It’s possible that they could be using fallback models during peak load times (west coast mid day). I assume your traffic would be routed to an east coast data center though. But secretly routing traffic to a worse model is a bit shady so I’d want some concrete numbers to quantify worse performance.

dcre•3w ago

To be clear, the company has very directly denied doing this.

denysvitali•3w ago

They did yes, but should we trust them?

I remember clearly this problem happening in the past, despite their claims. I initially thought it was an elaborate hoax, but it turned out to be factually true in my case.

dcre•3w ago

I tend to think it would be very hard and very risky for large, successful companies to systematically lie about these things without getting caught, and the people who would be doing the lying in this case are not professional liars, they’re engineers who generally seem trustworthy. So yes, if there is a degradation, I think bugs are much more likely than systematic lying.

cma•2w ago

The TPU implementation used approximate top-k instead of the exact used on nvidia. While that wouldn't matter too much and there was a bug with it, it still was a cost savings thing not to use exact from the beginning because it wasn't efficient on TPUs which they were routing to under load. So it was a bit of a model difference under load, even aside from the bug.

dcre•2w ago

To the extent this is an accurate characterization (somewhat, I think), they considered the quality difference a bug and fixed it!

causal•3w ago

I've had the same suspicion for various providers - if I had time and motivation I would put together a private benchmark that runs hourly and chart performance over time. If anyone wants to do that I'll upvote your Show HN :)

fhk•3w ago

Hold my beer

oncallthrow•3w ago

For what it’s worth, Anthropic very strongly claim that they don’t degrade model performance by time of day [1]. I have no reason to doubt that, imo Anthropic are about as ethical as LLM companies get.

[1] https://www.anthropic.com/engineering/a-postmortem-of-three-...

joshribakoff•3w ago

Banning paying users with no warning doesn’t seem super ethical. Probably not unethical, either, but I would not frame them as “the most ethical”

phist_mcgee•3w ago

I'd say they're about as good as the average billion dollar American tech company when it comes to ethics.

hagbard_c•3w ago

Simple, the model is tired after a long day of working so it starts making mistakes. Give it some rest and it is ready to serve again.

anonzzzies•3w ago

Many people 'notice' it (on reddit); I notice it too, but it is hard to prove. I tried the same prompt on the same code every 4 hours for 48 hours, the behaviour was slightly different but not worse or much different in time. But then I just work on my normal code, think wtf is it doing now??? look at the time and see it is US day time and stop.

People put forward many theories for this (weaker model routing; be it a different model, Sonnet or Haiku or lower quantized Opus seem the most popular), Anthropic says it is all not happening.

8note•3w ago

you might want to capture all your context when its doing badly?

maybe youre operating at very full, or with some poison pill file?

keep some logs of when cc reads different files vs when it gets stupid?

RickS•3w ago

I've certainly noticed some variance from opus. there are times it gets stuck and loops on dumb stuff that would have been frustrating from sonnet 3.5, let alone something as good as opus 4.5 when it's locked in. But it's not obviously correlated with time, I've hit those snags at odd hours, and gotten great perf during peak times. It might just be somewhat variable, or a shitty context.

Now GPT4.1 was another story last year, I remember cooking at 4am pacific and feeling the whole thing slam to a halt as the US east coast came online.

UncleEntity•3w ago

>> ...or a shitty context

This is my guess, sometimes it churns through things without a care in the world and other times is seem to be intentionally annoying to eat up the token quota without doing anything productive.

Kind of have to see which mode it's in before turning it loose unsupervised and keep an eye on it just in case it decides to get stupid and/or lazy.

joshribakoff•3w ago

Yep, i have long felt like i randomly get sonnet results despite opus billing. I try to work odd hours and notice better results.

jgbuddy•3w ago

It seems clear that, rather than throttling, anthropic serves lower quality versions of their models during peak usage to keep up with demand. They refuse to admit it, and it's hard to prove, but these threads consistently happen ~3 months after every single model release.

killingtime74•3w ago

Are you using the API or a subscription?

storus•3w ago

I had something similar with GPT, like a clockwork every day after like 1pm it started producing total garbage. Not sure if our account was A/B tested or they just routed us to some brutal quantization of GPT, or even a completely different model.

DANmode•3w ago

Always Be Collecting (accounts)

botacode•3w ago

My limited understanding here is that usage loads impact model outputs to make them less deterministic (and likely degrading in quality). See: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

schmookeeg•3w ago

I do think Claude does jiggery pokery with its model quality but I have had Clod appear any time of day.

What i find IS tied to time of day is my own fatigue, my own ability to detect garbage tier code and footguns, and my patience is short so if I am going to start cussing at Clod, it is almost always after 4 when I am trying to close out my day.

taurath•3w ago

Yes I’ve noticed that at certain times it gets very stuck, with the exact same setups. If I keep trying with new context windows it will still have poor performance, but if I come back in 30m or an hour it returns to normal. I don’t think it’s my context window changing, it seems to truly be degradation.

FWIW, I experienced it with sonnet as well. My conspiracy brain says they’re testing tuning the model to use up more tokens when they want to increase revenue, especially as agents become more automated. Making things worse == more money! Just like the rest of tech

jdcasale•3w ago

The math is obvious on this one. It's super well-documented that model performance on complex tasks scales (to some asymptote) with the amount of inference-time compute allocated.

LLM providers must dynamically scale inference-time compute based on current load because they have limited compute. Thus it's impossible for traffic spikes _not_ to cause some degradations in model performance (at least until/unless they acquire enough compute to saturate that asymptotic curve for every request under all demand conditions -- it does not seem plausible that they are anywhere close to this)

YetAnotherNick•3w ago

Umm. I run multiple benchmark using APIs for my work and the inference time compute allotted has clear correlation with the metrics. But time of the day certainly isn't. If it is that straightforward people can prove very easily rather than relying on the anecdotes.

They either overprovision the server during low demand or they might dynamically provision servers based on load.

SOLAR_FIELDS•3w ago

Yes, every time I see some variant of this come up (and believe me, this has been coming up since before the GPT3.5 days) there’s never any actual data demonstrating that it’s the case. As you say, it should be completely trivial to run the exact same prompt multiple times per day and capture the output to demonstrate this.

But no one ever seems to do that, they are rather content to “feel” that this is the case instead

baranmelik•2w ago

Have you noticed whether this correlates more with load (e.g. outages / announcements) than actual time of day? Curious if anyone has tried controlled comparisons.

Waffle2180•2w ago

I’ve noticed something similar, though I don’t think it’s literally “time of day” so much as changing system conditions.

My working theory is that under higher load, the model is more likely to: - take broader interpretive leaps - attempt larger refactors instead of minimal diffs - “explain its way forward” after a wrong turn rather than reset cleanly

That shows up as rabbit holes and self-reinforcing iterations, especially on codebases where local consistency matters more than global cleverness.

What’s helped a bit for me: - explicitly asking for minimal, localized changes - telling it not to refactor unless necessary - breaking requests into smaller steps and locking earlier decisions

It could also be variance from routing, context window pressure, or subtle prompt drift rather than a predictable nightly degradation, but the pattern of “overconfident refactor spirals” feels real.

A like-for-like experiment with the same prompt and context at different times would be interesting, though hard to fully control.

muzani•2w ago

Yeah, we did some tests. We're in UTC+8 (probably literally the opposite side of the world). We found it was great in the day, but degraded around midnight. The theory was that it gets worse on American peak time, but I wonder if it's just optimized for daytime.

Creamsicle47•2w ago

Yes, I'm seeing the exact same behavior. Ask it a question and it takes forever to answer at night.

Ask HN: Opus 4.6 ignoring instructions, how to use 4.5 in Claude Code instead?

Ask HN: Anyone Using a Mac Studio for Local AI/LLM?

Ask HN: Ideas for small ways to make the world a better place

Ask HN: Non AI-obsessed tech forums

Ask HN: 10 months since the Llama-4 release: what happened to Meta AI?

LLMs are powerful, but enterprises are deterministic by nature

Ask HN: Who wants to be hired? (February 2026)

Ask HN: Who is hiring? (February 2026)

AI Regex Scientist: A self-improving regex solver

Ask HN: Non-profit, volunteers run org needs CRM. Is Odoo Community a good sol.?

Tell HN: Another round of Zendesk email spam

Ask HN: Is Connecting via SSH Risky?

Ask HN: Has your whole engineering team gone big into AI coding? How's it going?

Ask HN: How does ChatGPT decide which websites to recommend?

Ask HN: Why LLM providers sell access instead of consulting services?

Ask HN: Is there anyone here who still uses slide rules?

Ask HN: Mem0 stores memories, but doesn't learn user patterns

Ask HN: What is the most complicated Algorithm you came up with yourself?

Ask HN: Is it just me or are most businesses insane?

Kernighan on Programming

Ask HN: Anyone Seeing YT ads related to chats on ChatGPT?

Ask HN: Does global decoupling from the USA signal comeback of the desktop app?

We built a serverless GPU inference platform with predictable latency

Ask HN: Does a good "read it later" app exist?

Ask HN: Have you been fired because of AI?

Ask HN: Anyone have a "sovereign" solution for phone calls?

Ask HN: Cheap laptop for Linux without GUI (for writing)

Ask HN: Any International Job Boards for International Workers?

Ask HN: How Did You Validate?

GitHub Actions Have "Major Outage"

Ask HN: Opus 4.6 ignoring instructions, how to use 4.5 in Claude Code instead?

Ask HN: Anyone Using a Mac Studio for Local AI/LLM?

Ask HN: Ideas for small ways to make the world a better place

Ask HN: Non AI-obsessed tech forums

Ask HN: 10 months since the Llama-4 release: what happened to Meta AI?

LLMs are powerful, but enterprises are deterministic by nature

Ask HN: Who wants to be hired? (February 2026)

Ask HN: Who is hiring? (February 2026)

AI Regex Scientist: A self-improving regex solver

Ask HN: Non-profit, volunteers run org needs CRM. Is Odoo Community a good sol.?

Tell HN: Another round of Zendesk email spam

Ask HN: Is Connecting via SSH Risky?

Ask HN: Has your whole engineering team gone big into AI coding? How's it going?

Ask HN: How does ChatGPT decide which websites to recommend?

Ask HN: Why LLM providers sell access instead of consulting services?

Ask HN: Is there anyone here who still uses slide rules?

Ask HN: Mem0 stores memories, but doesn't learn user patterns

Ask HN: What is the most complicated Algorithm you came up with yourself?

Ask HN: Is it just me or are most businesses insane?

Kernighan on Programming

Ask HN: Anyone Seeing YT ads related to chats on ChatGPT?

Ask HN: Does global decoupling from the USA signal comeback of the desktop app?

We built a serverless GPU inference platform with predictable latency

Ask HN: Does a good "read it later" app exist?

Ask HN: Have you been fired because of AI?

Ask HN: Anyone have a "sovereign" solution for phone calls?

Ask HN: Cheap laptop for Linux without GUI (for writing)

Ask HN: Any International Job Boards for International Workers?

Ask HN: How Did You Validate?

GitHub Actions Have "Major Outage"

Ask HN: Claude Opus performance affected by time of day?

Comments