Did Claude Increase Bugs in Rsync?

https://alexispurslane.github.io/rsync-analysis/

58•logicprog•1h ago

Comments

wookmaster•1h ago

Claude is just a tool ? The developers who merged that code and didn't properly test increased the bugs.

everdrive•1h ago

"Did cars increase traveling deaths?"

"Cars are just a tool. The drivers who piloted the vehicles and weren't careful enough [are responsible for the deaths.]"

Angostura•1h ago

This tool is claimed to be able to find and fix bugs.

roywiggins•1h ago

If something's a bad tool that misleads people into doing bad work, it would be good to know that.

the_real_cher•1h ago

Is there a non vibe coded fork of rsync?

throwaway7356•55m ago

Yes: https://news.ycombinator.com/item?id=48390931

So far it reintroduced several security issues and replaced the README.md.

rovr138•1h ago

I'm just curious about testing.

Is this a configuration that's not common and thus not tested?

If people think they can do better, I want to see their forks and them keeping up with it.

https://github.com/RsyncProject/rsync/graphs/contributors?fr...

duk3luk3•1h ago

This article is unfortunately unreadable because all of the prose is unfiltered LLM slop.

roywiggins•1h ago

> A simple distributional analysis of every rsync release with bug data. No model. No assumptions. Just placement.

If you want me to read your analysis, you are going to have to make it not read like Claude wrote it. What does "placement" even mean here?

logicprog•1h ago

"Placement" as in where the Claude-driven releases exist within the existing distribution of bugs per 100 commits. If they're not OOD, then nothing is unusual.

Also, it wasn't written by Claude FWIW, GLM 5.1.

rroblak•1h ago

Yeah, made me chuckle that an LLM— probably Claude— was used to write this.

The use of "regime shift" is what gave it away for me. I've never seen a human write that, but Claude does from time to time.

At least they removed occurrences of "load-bearing".

roywiggins•53s ago

"quietly" seems to be the new one recently

gamegod•59m ago

It's the ultimate product for marketers. It inserts itself as an advertisement into every conversation now and defends itself against criticism. Just crazy. There's no hope for the rest of us.

logicprog•56m ago

It's not defending itself here, both because I used GLM 5.1, not Claude, and because I was the one who decided to do this analysis, iterated through six or seven different methodologies to try to find the one that was most honest with the data that I had (all of the methodologies showed directionally and often in magnitude the exact same thing, but I wanted to do something that fit the purpose, in consultation with my wife, who, as I've mentioned elsewhere, has a master's degree in statistics), and, of course, I specifically chose all of the metrics and sources for the data.

If you don't want to read the LLM prose, you can just go to the GitHub of my project, grab the scripts, and run the full pipeline. It will gather the data, build the database, and run the analysis from scratch for you, and you can look at the numbers directly. It's all repeatable.

mschuster91•1h ago

This article reeks of LLM "assistance" at the very least.

Please, why can't people write stuff by hand themselves any more? It's a good analysis but how can I trust it without reviewing everything myself?!

logicprog•1h ago

I mean, you can literally clone my repo, run the Python that rebuilds the database and does the whole data analysis and to end from scratch, and verify that the numbers are accurate. I made the code for this analysis public for that exact reason. This wasn't just an LLM running unsupervised in a loop. I came up with the methodologies and metrics and data scraping strategies precisely myself, iterated on it to try to be as honest with what the data could show as possible.

sanitycheck•56m ago

I think the point people are making is that when the text has an "AI smell" (it does), we immediately lose trust in the veracity of any claim being made and feel like continuing to read what is possibly a hallucinated fiction is a complete waste of time.

At this point we're all used to skimming through thousands of AI-generated sentences every working day and constantly thinking "this is likely to be 20% bullshit", it's hard to turn that off even if I try.

logicprog•52m ago

Do you think it would help if I went through and manually rewrote all of the prose? If it would get people to listen, I'd be totally willing to do it. It's not like I don't like writing. I just was focused on something else when I was making this, namely trying to find a good methodology that isn't insane for this low amount of data.

bradrn

logicprog•1h ago

Some notes on this:

- I used GLM 5.1 to help with the coding and math for this.

- However, I explicitly dictated where the data should be pulled from (GitHub, Bugzilla, mailing list), how it should be tagged and grouped, and what data to look at (e.g. bugs instead of regressions)

- Additionally, I consulted with my wife, who has a master's degree in statistics from Penn State University for what sort of statistical methodology would be justified for this very limited data set, while still giving as much information as possible.

- I know the website looks like we stereotypically consider vibe-coded websites to look, but I actually explicitly asked for that. The original HTML design looked like a website from 1995, and I just prefer how this looks. It's pretty!

jchw•1h ago

I really struggle to believe you wrote text like:

> A simple distributional analysis of every rsync release with bug data. No model. No assumptions. Just placement.

logicprog•1h ago

No, I didn't write the text itself. I'm typically significantly more verbose and elliptical, and more than that, the numbers and methodology changed often enough over the course of the last couple days I was working on this because I was trying to get it to be as accurate and fair as possible that trying to keep the whole thing up to date manually would have been problematic.

jchw•37m ago

Sorry to say but I'm absolutely certain I would've preferred to read your worst attempt at a write-up over the grating utter shite LLMs output. It's not even a question, this is unreadable.

logicprog

nairboon•1h ago

Is this an analysis made by/with Claude?

quentindanjou•52m ago

It very obviously is. "The Outlier Nobody Noticed" -_-"

Polarity•1h ago

so the answer is: no. actaully less bugs. thanks

geraneum•1h ago

> But the critics' accusation is also blunt: "Claude is making things worse." A blunt instrument is the fairest response.

So the criticism was bad, and that somehow makes it ok to use a bad metric?

logicprog•1h ago

That's not what I'm saying. What I'm saying is that if the criticism is referring to a broad set of metrics like bugs per release and number of commits that were made by Claude, then it's correct to look at precisely those things because that's what the claim is about.

faitswulff•1h ago

> The analysis uses a single metric: bugs per 10 commits (bugs/10c).

Bugs per commit as a metric papers over severity, both in terms of security severity as well as the effect on the user. A mislabeled button has the same weight as the entire app crashing in this framework.

skeledrew•16m ago

There was no analysis of severity in all of the rage posting that occurred. The single point being pushed was "use of an LLM led/leads to more bugs". The author specifically states that's what they're addressing (blunt accusation -> blunt response).

gadrev•57m ago

Ok.

  $ apt-cache policy rsync | grep Installed
    Installed: 3.4.1+ds1-7ubuntu0.2
  $ sudo apt-mark hold rsync     
    rsync set on hold.

logicprog•54m ago

Did you face any actual bugs or regressions? Or are you doing this just because of the bandwagon that's going around right now? Because until you can actually present an argument for why this release is worse than any of the others, which is precisely the subject of my post, then this is not an argument against my post at all. This is just a self-referential appeal to authority.

gadrev•44m ago

Nah, I skimmed TFA but then I went into the linked GH issues thread, and that's the one that scared me a bit. I just want to hold it for a while and not run into some of the things I'm reading since I'm on the latest ubuntu. Just a precaution.

I didn't have the time to actually think about any "arguments" at all tbh it's just a knee jerk reaction as I get ready to log off for the weekend. Not actually looking to argument for or against your post at all lol.

imurray•45m ago

That version has security fixes from the same day as the latest rsync release: https://ubuntu.com/security/notices/USN-8283-1

As usual, Ubuntu backported fixes and didn't upgrade to a new version. Whether or not they also backported regressions in edge cases that afflict the latest rsync, I don't know. Pinning the Ubuntu package may prevent getting further regressions, but is preventing you getting any future such backported security fixes.

scsh•52m ago

> It does not control for commit complexity, security intensity, or bug severity. It does not distinguish between a one-line typo fix and a CVE patch. It is a blunt instrument. But the critics' accusation is also blunt: "Claude is making things worse." A blunt instrument is the fairest response.

If by fairest you mean to say that this analysis and response is sufficient, then I'm sorry but I have to disagree. We really need to understand if the nature of the bugs are worse from a user's perspective. Even if the rate stayed unchanged, if the result is the perceived quality of the software declined then I would personally consider that worse, especially if I were a project maintainer.

That's not meant to be wholly dismissive either. But in general, I don't think quantitative analysis alone is enough to fully answer this type of question.

skeledrew•6m ago

But it is fair. Up to this point I have yet to see anyone say they did an analysis of the code and found X regressions of Y severity. All they say is "there are more bugs because LLM". This analysis, which you can verify yourself if you wish, says "the bugs [number of] are pretty average even with LLM", which is a direct response to that. If you'd like a more nuanced analysis you're welcome to do one and share the result, if you're so inclined.

logicprog•45m ago

Okay, I really have to point out to everyone: the numbers and report cards are TEMPLATED IN BY A SCRIPT. Hallucinations are a moot point. https://github.com/alexispurslane/rsync-analysis/blob/main/s...

tappio•41m ago

A lot of people criticizing because it's heavily written with LLM, but I mean, if someone produced this piece pre-LLM, would they criticize it? is the critique due to use of LLM or due to the content being truly hard to follow? I read it and I would say, there are some problems with the writing, but its not a bad piece.

Of course this is a bigger problem, as its now harder to distinguish content that is "AI slop" with "content co-authored with AI that is carefully reviewed" with a quick glimpse, and the "AI smell" is quite off-putting. My initial reaction was also negative, but after glimpsing it through and reading the summaries, I found it decent summary, which also... speaks of this thread, of the content of the blog post and everything about the discussion and the strong feelings people have developed around the use of LLMs.

Anyhow, it would be good to disclose the repo with the code for the statistics & use of LLM in the writing right up front. Which model, and why it was used to do the writing, etc. Its enough to say "I think it writes better than I do" or "I was in a hurry, sorry" or what ever, but it really should be disclosed. It reads more honest.

ps. really... that sideways scroll? plz fix it.

logicprog•39m ago

Thank you for your constructive input, you're one of only a few others here who had any. I'll definitely do that. I didn't think, since the output was templated directly from the numbers generated by a reproducible python script, that people would get so up in arms about the aesthetics, but I guess I forgot to say that.

JasonSage•27m ago

> content co-authored with AI that is carefully reviewed

The problem I see is that this is indistinguishable to a reader at a glance.

Distancing the writing from the "AI smell" not only improves the quality by dropping the unnecessary ocean of rhetorical devices, it forces the human to have real weight and agency on what's being said.

I think that act of distancing from raw LLM output through refinement is a huge quality leap. Even if you're only doing the refinement with an LLM, it forces the writing to have more voice and ideas from the author.

I can see the work that went into the analysis here but again, as a casual reader, it's impossible to tell that there were any original ideas here expressed by the author.

sfink•17m ago

Wow.

I am pretty insensitive to AI writing. I have never commented before about something sounding like AI, because mostly I don't notice. But this was so over the top that I spent the whole article trying to decide whether it was an intentional parody of AI writing style.

This article's language is not en-US. It's not en-BR. It's en-SLOP.

Yes, that was my clumsy attempt at AI parody. Here's another: this article doesn't just have AI tells. It is AI tells.

Every sentence is saturated with AI style. Perhaps the author so AI-indoctrinated that they can't see this? It doesn't read as even vaguely plausible human writing. Which is mightily ironic given the thesis of "AI generated stuff is just fine, m'kay?" The writing style does more to defeat its conclusion than the analysis itself.

As for the substance of the analysis, it seems pretty good to me but I see some flaws that weaken it a bit.

The presence of "The Outlier Nobody Noticed" proves nothing and deserves no more than a passing mention. A random release introduced way more bugs than the Claude-containing releases. That provides evidence that Claude doesn't introduce more bugs only if your hypothesis is a very naive "AI is the only thing that can ever increase bug introduction rates."

The whole analysis has very limited data. It's necessarily based off a single pair of releases at the very end of the chronological timeline. You would never be able to reject a null hypothesis based only on that, so it's even less sound to present it as proving the null hypothesis. (By the same token, it would be incorrect for critics to claim that it proves their point. Did anyone claim this, though? The heated complaints seemed more based on priors about AI code.)

"The critics' claim is a simple comparison: did the rate go up?" That's reductive. For one, these releases are known to be in reaction to a flood of (AI-discovered!) security reports, which is a novel situation and in fact is a huge confound to anyone arguing about what those two releases mean -- they're both heavily AI-written, but in response to an unusual situation. When the samples are only drawn from a distinct scenario, statistic analysis can only speak to the quality of code in that scenario.

Also, another reasonable hypothesis could be: AI-written code has bugs of a different flavor that bothers users more. It's optimized for passing tests and convincing people and AIs that security holes are closed, which means other considerations like preserving functionality can more easily be regressed as compared to if humans were doing it. (If true, it still doesn't support the claim that depending on AI code is a catastrophe, fwiw.)

I'm not arguing the conclusion is wrong. I'm saying the analysis proves far less than it claims to. As for whether it's a debacle for rsync to become dependent on AI code generation, I think that's a reasonable debate to have but it's not going to be resolved this reductively.

'Bear-Repelling Poles' Attracting Attention; Developed by Man Who Faced Bear

Rave HN: Private HN rave at a loft in Manhattan

Ascetic Computing

China, HK Investors Banned from SpaceX IPO over Security

US posts another month of strong job gains in May

Trump officials planned to mark 2.7M living people as dead, whistleblower claims

Chebyshev Polynomials

How the Internet Crosses Oceans

If you've ever had to defend paying down tech debt vs. feature work

SAIdecar, for the small questions that don't belong the main agent session

Y Combinator Started (2012)

The Next Frontier of Visual AI Is Code

Immigrant Rights Lawyers File Lawsuit over Palantir's Elite

Nintendo Switch 2 with user-replaceable batteries coming to the EU

A Portrait of the Software Engineer, 2031

Permafrost tipping point triggered by warming-driven loss of old carbon

Show HN: Piqc – An open-source GPU waste scanner for LLM inference clusters

Indicator Function

The Evil MSI Background Is Back

In a First, Scientists Precisely Edit Human Embryo Genes

Parsing XML EXIF from .avif files (plus a rant)

Amazon engineers in Seattle slam employer for building AI data centers

We analyzed 50k APIs to see which cloud is winning

The Data Center Moves to Your Machine

Why Linux creator Linus Torvalds gets angry hearing "99% of code is AI"

How to build a 30M RPS CDN in 30 days with Rust and WASM

GitLab cuts 14% of staff as it scales its platform to serve AI workloads

Why math works so well in describing our universe?

Multi-omics and palynology of selected Philippine forest honey

Token-Mediating Back end: An alternative to the BFF architecture

'Bear-Repelling Poles' Attracting Attention; Developed by Man Who Faced Bear

Rave HN: Private HN rave at a loft in Manhattan

Ascetic Computing

China, HK Investors Banned from SpaceX IPO over Security

US posts another month of strong job gains in May

Trump officials planned to mark 2.7M living people as dead, whistleblower claims

Chebyshev Polynomials

How the Internet Crosses Oceans

If you've ever had to defend paying down tech debt vs. feature work

SAIdecar, for the small questions that don't belong the main agent session

Y Combinator Started (2012)

The Next Frontier of Visual AI Is Code

Immigrant Rights Lawyers File Lawsuit over Palantir's Elite

Nintendo Switch 2 with user-replaceable batteries coming to the EU

A Portrait of the Software Engineer, 2031

Permafrost tipping point triggered by warming-driven loss of old carbon

Show HN: Piqc – An open-source GPU waste scanner for LLM inference clusters

Indicator Function

The Evil MSI Background Is Back

In a First, Scientists Precisely Edit Human Embryo Genes

Parsing XML EXIF from .avif files (plus a rant)

Amazon engineers in Seattle slam employer for building AI data centers

We analyzed 50k APIs to see which cloud is winning

The Data Center Moves to Your Machine

Why Linux creator Linus Torvalds gets angry hearing "99% of code is AI"

How to build a 30M RPS CDN in 30 days with Rust and WASM

GitLab cuts 14% of staff as it scales its platform to serve AI workloads

Why math works so well in describing our universe?

Multi-omics and palynology of selected Philippine forest honey

Token-Mediating Back end: An alternative to the BFF architecture

Did Claude Increase Bugs in Rsync?

Comments