frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Claude Code Daily Benchmarks for Degradation Tracking

https://marginlab.ai/trackers/claude-code/
87•qwesr123•1h ago

Comments

qwesr123•1h ago
FYI the MarginLab Claude Code degradation tracker is showing a statistically significant ~4% drop in SWE-Bench-Pro accuracy over the past month
beardsciences•45m ago
Very interesting. I would be curious to understand how granular these updates are being applied to CC + what might be causing things like this. I feel like I can notice a very small degradation but have compensated with more detailed prompts (which I think, perhaps naively, is offsetting this issue).
goldenarm•38m ago
I really like the idea, but a "±14.0% significance threshold" is meaningless here.

The larger monthly scale should be the default, or you should get more samples.

zacmps•32m ago
Could you elaborate what you think the problems are? I guess they should be using some form of multiple comparison correction?
goldenarm•25m ago
The daily scale is not statistically significant and is meaningless. You should lower the confidence interval by either increasing the scale or the evaluations.
turnsout•33m ago
This is probably entirely down to subtle changes to CC prompts/tools.

I've been using CC more or less 8 hrs/day for the past 2 weeks, and if anything it feels like CC is getting better and better at actual tasks.

fragebogen•31m ago
I was going to ask, are all other variables accounted for? Are we really comparing apples to apples here? Still worth doing obviously, as it serves a good e2e evaluations, just for curiosity's sake.
FfejL•28m ago
Honest, good-faith question.

Is CC getting better, or are you getting better at using it? And how do you know the difference?

I'm an occasional user, and I can definitely see improvements in my prompts over the past couple of months.

billylo•28m ago
That's why benchmarks are useful. We all suffer from the shortcomings of human perception.
gpm•21m ago
Benchmarks shortcomings are no worse... they inevitably measure something that is only close to the thing you actually care about, not the thing you actually care about. It's entirely plausible that this decreased benchmark score is because Anthropic's initial prompting of the model was overtuned to the benchmark and as they're gaining more experience with real world use they are changing the prompt to do better at that and consequentially worse at the benchmark.
billylo•11m ago
I wonder how best we can measure the usefulness of models going forward.

Thumbs up or down? (could be useful for trends) Usage growth from the same user over time? (as an approximation) Tone of user responses? (Don't do this... this is the wrong path... etc.)

fragebogen•29m ago
Would love to see this idea expanded to ever alleged SoTA model currently in production. Any speculation as to why this degradation occurs?
embedding-shape•23m ago
Anecdote, I don't have any proof and it's just a feeling. But around afternoon in GMT+1 compared to the morning/midday, there seems to be a change in the quality of responses, which seems to line up with when the US wakes up. I consistently get (what feels like) worse responses in both Codex and Claude Code in the afternoon/night compared to morning/midday, so much that I usually give up then try the same prompt next morning and get better results. But I guess that might as well be about me being more tired in the night than morning too, as I said, haven't measured this.
jzig•15m ago
It’s the afternoon slump. The AI needs a cup of coffee and to doomscroll for half an hour!
embedding-shape•12m ago
Or a load balancing technique :) Either way, it kicks me off to do other things so maybe it isn't so bad after all.
sciencejerk•25m ago
Why is this happening?
giwook•21m ago
https://www.anthropic.com/engineering/a-postmortem-of-three-...
Trufa•18m ago
I have absolutely no insight knowledge, but I think it's not a bad assumption to have that, it's costly to run the models, when they release a new model they assume that cost and give per user more raw power, when they've captured the new users and wow factor, they start reducing costs by reducing the capacity they provide to users. Rinse and repeat.
Uehreka•2m ago
There are frequently claims that Anthropic is somehow diluting or dumbing down models in some subtle way. Unfortunately it’s tough to validate these claims without a body of regularly checked evals. This test set should hopefully help settle whether Anthropic is actually making changes under the hood or whether the changes are all in people’s heads.
Dowwie•25m ago
Simply search user prompts for curse words and then measure hostility sentiment. User hostility rises as agents fail to meet expectations.
Trufa•21m ago
I'm glad I'm not the only one.
mrbananagrabber•17m ago
I uh might be skewing that as I generally just use a lot of curse words with Claude by default
ctxc•16m ago
I feel bad about it but sometimes it's so daft, I can't even xD

It's not my fault, they set high standards!

smotched•11m ago
there are many times where I just do it myself and it thinks it did well.
silverlight•18m ago
There was a moment about a week ago where Claude went down for about an hour. And right after it came back up it was clear a lot of people had given up and were not using it.

It was probably 3x faster than usual. I got more done in the next hour with it than I do in half a day usually. It was definitely a bit of a glimpse into a potential future of “what if these things weren’t resource constrained and could just fly”.

yoavsha1•6m ago
I had that exact same feeling during the US holidays where I got to enjoy 2x usage limits and everything just seemed to work well
dajonker•17m ago
Wouldn't be surprised if they slowly start quantizing their models over time. Makes it easier to scale and reduce operational cost. Also makes a new release have more impact as it will be more notably "better" than what you've been using the past couple of days/weeks.
ofirpress•14m ago
[SWE-bench co-author here] It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that. I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.
mohsen1•12m ago
Hope you don't mind the unrelated question:

How do you pay for those SWE-bench runs?

I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.

https://mafia-arena.com

ofirpress•8m ago
Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.
mohsen1•7m ago
yes I reached out to them but as you say it's a chicken-and-egg problem.

Thanks!

ghm2199•12m ago
In medicine there is a concept of reporting adverse effects of medication or interventions which are then collectively studied for Public Health [MedWatch][VAERS][EudraVigilance] and in academia. We should have something like that for all coding agents(and agents in other fields too), given how widely its deployed and affect on "health" in general(not only human). Call it the AI "health" of things benchmark.

I would imagine a sort of hybrid qualities of volunteer efforts like wikipedia, new problems like advent of code and benchmarks like this. The goal? It would be to study the collective effort on the affects of usage to so many areas where AI is used.

[MedWatch](https://www.fda.gov/safety/medwatch-fda-safety-information-a...)

[VAERS](https://www.cdc.gov/vaccine-safety-systems/vaers/index.html)

[EudraVigilance](https://www.ema.europa.eu/en/human-regulatory-overview/resea...)

antirez•11m ago
Why I do not believe this shows Anthropic serves folks a worse model:

1. The percentage drop is too low and oscillating, it goes up and down.

2. The baseline of Sonnet 4.5 (the obvious choice for when they have GPU busy for the next training) should be established to see Opus at some point goes Sonnet level. This was not done but likely we would see a much sharp decline in certain days / periods. The graph would look like dominated by a "square wave" shape.

3. There are much better explanations for this oscillation: A) They have multiple checkpoints and are A/B testing, CC asks you feedbacks about the session. B) Claude Code itself gets updated, as the exact tools version the agent can use change. In part it is the natural variability due to the token sampling that makes runs not equivalent (sometimes it makes suboptimal decisions compared to T=0) other than not deterministic, but this is the price to pay to have some variability.

IshKebab•8m ago
> We model tests as Bernoulli random variables and compute 95% confidence intervals around daily, weekly, and monthly pass rates. Statistically significant differences in any of those time horizons are reported.

Doesn't really work like that. I'd remove the "statistically significant" labelling because it's misleading.

Opinionated GitHub Action for generating high-quality SBOMs

https://github.com/sbomify/github-action
1•todsacerdoti•5s ago•0 comments

Catching the Next Telnetd-Class Security Bug

https://vartia.ai/posts/telnetd_cve/
1•briandw•1m ago•0 comments

Tmux for Claude Code but accessible from web browser and mobile

https://github.com/kirikov/teleclaude
1•Datkiri•1m ago•0 comments

I Hope You Get to Live Your Life as a Human Being

https://transgamerthoughts.com/post/802327706229456896/i-hope-you-get-to-live-your-entire-life-as...
1•hn_acker•3m ago•0 comments

We're All Beginners Again

https://matthewrocklin.com/ai-beginners/
1•vinhnx•3m ago•0 comments

Modern Pandas (2016)

https://tomaugspurger.net/posts/modern-1-intro/
1•tosh•4m ago•0 comments

The Pacific Northwest Tree Octopus

https://zapatopi.net/treeoctopus/
2•jeffjeffbear•4m ago•0 comments

US trade deficit widens by the most in nearly 34 years in November

https://finance.yahoo.com/news/us-trade-deficit-widens-most-144236696.html
3•thomassmith65•4m ago•1 comments

First Impressions of Readeck

https://www.autodidacts.io/readeck-open-source-read-it-later-app-with-kobo-support/
1•Curiositry•6m ago•0 comments

Data on Neocloud Adoption

https://www.hostingadvice.com/studies/neocloud-adoption/
1•ljh501•7m ago•0 comments

Advancing regulatory variant effect prediction with AlphaGenome

https://www.nature.com/articles/s41586-025-10014-0
1•mellosouls•8m ago•0 comments

US Congress asks Ford for more info on Chinese military battery partnership

https://chinaselectcommittee.house.gov/media/press-releases/moolenaar-questions-ford-about-its-ch...
1•737min•10m ago•1 comments

Create App store and Google Play store screenshots with AppLaunchpad

https://theapplaunchpad.com/
1•applaunchpad•10m ago•0 comments

We may get a trial on whether Elon Musk defrauded Twitter investors

https://bsky.app/profile/annmlipton.bsky.social/post/3mdkowyv7tk2p
3•doener•11m ago•0 comments

The 80% Problem in Agentic Coding

https://addyo.substack.com/p/the-80-problem-in-agentic-coding
1•vinhnx•12m ago•0 comments

New Game Plus

https://mar.coconauts.net/blog/posts/2025-01-29-new-game-plus/
1•marbartolome•12m ago•0 comments

Everyone's okay with their AI, just not yours

https://idiallo.com/blog/ai-is-ok-just-not-yours
1•Brajeshwar•12m ago•0 comments

Recreating the Smells of History

https://knowablemagazine.org/content/article/society/2026/recreating-the-smells-of-the-past
1•Brajeshwar•12m ago•0 comments

Ancient humans were seafaring far earlier than we realised

https://www.newscientist.com/article/2511681-ancient-humans-were-seafaring-far-earlier-than-we-re...
1•Brajeshwar•12m ago•0 comments

ClickBench

https://benchmark.clickhouse.com/
1•tosh•14m ago•0 comments

Don't invert established UX mental models

https://thoughts.wyounas.com/p/dont-invert-established-ux-mental
2•simplegeek•14m ago•0 comments

The Fancy Payment Cards of Taiwan

https://hackaday.com/2026/01/28/the-fancy-payment-cards-of-taiwan/
1•lxm•15m ago•0 comments

Anthropic Is at War with Itself

https://www.theatlantic.com/technology/2026/01/anthropic-is-at-war-with-itself/684892/
1•kerim-ca•18m ago•1 comments

Are Google navigation services getting worse?

https://ilearnt.com/blog/googleworse/
1•speckx•19m ago•1 comments

Something that I used to love

https://andreapivetta.com/posts/something-that-i-used-to-love.html
1•ziggy42•19m ago•0 comments

KiteSQL: Rust-native embedded SQL with TPC-C benchmarks and WASM support

https://github.com/KipData/KiteSQL
1•Jacques2Marais•20m ago•0 comments

Valanza – my Unix way for weight tracking and anlysis

https://github.com/paolomarrone/valanza
1•lallero317•22m ago•0 comments

Finding out your public IP address via curl

https://heitorpb.github.io/bla/ip-rs/
1•wilsonfiifi•25m ago•0 comments

History teaches us to deal with societal collapse – TEDxTallinn [video]

https://www.youtube.com/watch?v=SNPjghax6uA
1•obscurette•25m ago•0 comments

AtomVM 2025 Year in Review

https://substack.com/home/post/p-186191026
1•todsacerdoti•26m ago•0 comments