Tracking Copilot vs. Codex vs. Cursor vs. Devin PR Performance

https://aavetis.github.io/ai-pr-watcher/

108•HiPHInch•3d ago

Comments

zachlatta•5h ago

Wow, this is an amazing project. Great work!

frognumber•5h ago

Missing data: I don't make a codex PR if it's nonsense.

Poor data: If I make one, I either if I want to:

a) Merge it (success)

b) Modify it (sometimes success, sometimes not). In one case, Codex made the wrong changes in all the right places, but it was still easier to work from that by hand.

c) Pick ideas from it (partial success)

So simple merge rates don't say much.

osigurdson•3h ago

It isn't so much "poor" data as it is a fairly high bar for value generation. If it gets merged it is a fairly clear indicator that some value is created. If it doesn't get merged then it may be adding some value or it may not.

dimitri-vs•4h ago

This might be an obvious questions but why is Claude Code not included?

csallen•4h ago

I believe these are all "background" agents that, by default, are meant to write code and issue pull requests without you watching/babysitting/guiding the process. I haven't used Claude Code in a while, but from what I recall, it's not that.

koakuma-chan•4h ago

Claude Code can run in background and I don't see why it wouldn't be able to create pull requests if you gave it such a tool.

cap11235•4h ago

The prompts in Claude Code have specific instructions on doing pull requests.

``` grep 'gh pr ' ~/.claude/local/node_modules/@anthropic-ai/claude-code/cli.js - Create PR using gh pr create with the format below. Use a HEREDOC to pass the body to ensure correct formatting. gh pr create --title "the pr title" --body "$(cat <<'EOF' 1. Use \`gh pr view --json number,headRepository\` to get the PR number and repository info 1. If no PR number is provided in the args, use ${O4.name}("gh pr list") to show open PRs 2. If a PR number is provided, use ${O4.name}("gh pr view <number>") to get PR details 3. Use ${O4.name}("gh pr diff <number>") to get the diff ```

cap11235•4h ago

If you enable it in permissions, Claude is very happy to do so. For personal fun/experimental projects (usually I give it arXiv papers to implement), I generally have a couple Claude instances (on different projects) just chugging along all day. I have them write really detailed plans at the start (50-100 steps in the implementation plan, plus actual specifications for project structure, dev practices, and what the actual goals are). I iterate on these plan documents by having Claude write QUESTIONS.md which has dev questions for me to clarify, which I fill out with answers, and then instruct Claude to update the plan docs with my answers. Then most of my interaction throughout the day is just saying something like "P18" to implement implementation plan step #18. I instruct it in CLAUDE.md to stop after each step, output what automated tests have been written for P18's features, and I require that the LLM write a demo script that I can run that shows the features, using real APIs. I'm having a great time with it.

ilteris•4h ago

How much do you pay monthly? What kind of service do you use thanks

cap11235•4h ago

I'm on the $100 max plan. The default config uses Opus up until some percent of capacity, then uses Sonnet after, which resulted in my having to wait for 30 minutes to an hour to reset usage after running them for 8-10 hours. I've since switched to configuring it to only use Sonnet, then for what I know are "big" questions, I'll run opus for just that. Since then, I have yet to hit limits, so I don't feel the need for the $200 one.

unshavedyak•3h ago

I really need to try giving it a $100 month. Really not sure it's worth it, but if i'm less concerned about throttling or cost it might be more fun, interesting, etc.

cap11235•3h ago

It makes a psychological difference, yeah. I'm happy now just throwing any whim at it. For instance, I've been meaning for years to fix my disks, since every new computer just has me put the old drives into it, plus the new ones. Prior to the consolidation I had Claude do, oldest was from 2007 (good job, Western Digital 20 years ago). I had Claude write a plan on how to move files my most recent spinning disks, and also redo my mounts (for organization and improving my mount flags). I had it write the plan, I went "yeah", had it write a new fstab and a script to perform the moves in the project folder, had it "ultrathink" and web search a couple times to iterate on those for improvements it could suggest. Then I reviewed them, and had it apply the changes in a Claude instance with no automatic permissions beyond reading files in the project directory, so I manually approved the system modifications each.

furyofantares•3h ago

There's also a soft cap of 50 sessions per month, right?

cap11235•3h ago

Looks that way, but Anthropic docs vaguely say it is vague. I know I haven't hit any hard caps since only using opus manually, but I wouldn't know if I'm being throttled otherwise, or at least it isn't severe enough that I notice given they just churn in the background.

a_bonobo•3h ago

I think the OP's page works because these coding agents identify themselves as the PR author so the creator can just search Github's issue tracker for things like is:pr+head:copilot or is:pr+head:codex

It seems like Claude Code doesn't do that? some preliminary searching reveals that PRs generated by people using Claude Code use their own user account but may sign that they used Claude, example https://github.com/anthropics/claude-code/pull/1732

cap11235•3h ago

Claude does credit itself in the commit messages. eg:

feat: add progress bar for token probability calculation

- Add optional progress_cb parameter to get_token_probs function

- Integrate `rich` progress bar in CLI showing real-time token processing progress

- Add comprehensive tests for progress callback functionality

- Maintain backward compatibility with optional parameter

Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

a_bonobo•3h ago

OK then OP can slightly change their site by using a different search term:

https://github.com/search?q=is:pr+is:merged+Co-Authored-By:+...

Instead of looking at the author of the PR, look for that 'Co-Authored-By: Claude' text bit.

That way I get 753 closed PRs and '1k' PRs in total, that's a pretty good acceptance rate.

ofou•2h ago

I'd submit a PR with this idea to improve coverage of agents

behnamoh•4h ago

How about Google Jules?

also, of course OpenAI Codex would perform well because the tool is heavily tailored to this type of task, whereas Cursor is a more general-purpose (in the programming domain) tool/app.

tmvnty•4h ago

Merge rates is definitely a useful signal, but there are certainly other factors we need consider (PR small/big edits, refactors vs deps upgrades, direct merges, follow up PRs correcting merged mistakes, how easy it is to setup these AI agents, marketing, usage fees etc). Similar to how NPM downloads alone don’t necessarily reflect a package’s true success or quality.

osigurdson•3h ago

I suspect most are pretty small. But hey, that is fine as long as they are making code bases a bit better.

osigurdson•3h ago

I've been underwhelmed with dedicated tools like Windsurf and Cursor in the sense that they are usually more annoying than just using ChatGPT. They have their niche but they are just so incredibly flow destroying it is hard to use them for long periods of time.

I just started using Codex casually a few days ago though and already have 3 PRs. While different tools for different purposes make sense, Codex's fully async nature is so much nicer. It does simple things like improve consistency and make small improvements quite well which is really nice. Finally we have something that operates more like an appliance for a certain classes of problems. Previously it felt more like a teenager with a learners license.

deadbabe•3h ago

You can just use Cursor as a chat assistant if you want.

threeseed•2h ago

But then you're paying far more than just using Claude web which can be used for tasks other than coding.

deadbabe•1h ago

Your company can be paying for it

koakuma-chan•1h ago

How do I convince my company to pay for it?

elliotec•3h ago

Have you tried Claude code? I’m surprised it’s not in this analysis but in my personal experience, the competition doesn’t even touch it. I’ve tried them all in earnest. My toolkit has been (neo)vim and tmux for at least a decade now so I understand the apprehension for less terminal-inclined folks that prefer other stuff but it’s my jam and just crushes it.

cap11235•2h ago

Right, after the Sonnet 4 release it was the first time I could tell an agent something and just let it run comfortably. As for the tool itself, I think a large part of its ability comes from how it writes recursive todo-lists for itself, which are shown to the user, so you can intervene early on the occasions it goes full Monkey's Paw.

lukehoban•3h ago

(Disclaimer: I work on coding agents at GitHub)

This data is great, and it is exciting to see the rapid growth of autonomous coding agents across GitHub.

One thing to keep in mind regarding merge rates is that each of these products creates the PR at a different phase of the work. So just tracking PR create to PR merge tells a different story for each product.

In some cases, the work to iterate on the AI generated code (and potentially abandon it if not sufficiently good) is done in private, and only pushed to a GitHub PR once the user decides they are ready to share/merge. This is the case for Codex for example. The merge rates for product experiences like this will look good in the stats presented here, even if many AI generated code changes are being abandoned privately.

For other product experiences, the Draft PR is generated immediately when a task is assigned, and users can iterate on this “in the open” with the coding agent. This creates more transparency into both the success and failure cases (including logs of the agent sessions for both). This is the case for GitHub Copilot coding agent for example. We believe this “learning in the open” is valuable for individuals, teams, and the industry. But it does lead to the merge rates reported here appearing worse - even if logically they are the same as “task assignment to merged PR” success rates for other tools.

We’re looking forward to continuing to evolve the notion of Draft PR to be even more natural for these use cases. And to enabling all of these coding agents to benefit from open collaboration on GitHub.

soamv•3h ago

This is a great point! But there's an important tradeoff here about human engineering time versus the "learning in the open" benefits; a PR discarded privately consumes no human engineering time, a fact that the humans involved might appreciate. How do you balance that tradeoff? Is there such a thing as a diff that's "too bad" to iterate on with a human?

ambicapter•2h ago

Do people where you work spend time reviewing draft PRs? I wouldn’t do that unless asked to by the author.

drawnwren•2h ago

It’s hard enough for me to get time to review actual PRs, who are these engineers trawling through the drafts?

lukehoban•1h ago

I do agree there is a balance here, and that the ideal point in the spectrum is likely in between the two product experiences that are currently being offered here. There are a lot of benefits to using PRs for the review and iteration - familiar diff UX, great comment/review feedback mechanisms, ability to run CI, visibility and auth tracked natively within GitHub, etc. But Draft PRs are also a little too visible by default in GitHub today, and there are times when you want a shareable PR link that isn't showing up by default on the Pull Requests list in GitHub for your repo. (I frankly want this even for human-authored Draft PRs, but its even more compelling for agent authored PRs).

We are looking into paths where we can support this more personal/private kind of PR, which would provide the foundation within GitHub to support the best of both worlds here.

cjbarber•3h ago

Seems like the high order bit impacting results here might be how difficult the PR is?

TZubiri•3h ago

Why is there 170k PR for a product released last month, but 700 for a product that has been around for like 6 months and was so popular it got acquired for 3B?

simoncion•2h ago

It might be the case that "number of PRs" is roughly as good a metric as "number of lines of code produced".

throwaway314155•3h ago

Is this data not somewhat tainted by the fact that there's really zero way to identify how much a human was or wasn't "in the loop" before the PR was created?

tptacek•2h ago

I kind of wondered about that re: Devin vs. Cursor, because the people I know that happen to use Devin are also very hands-on with the code they end up merging.

But you could probably filter this a bit by looking at PR commit counts?

thorum•2h ago

With Jules, I almost always end up making significant changes before approving the PR. So “successful merge” is not great indicator of how well the model did in my case. I’ve merged PRs that were initially terrible after going in and fixing all the mistakes.

pryelluw•2h ago

Is it me or are there a lot of documentation related PRs? Not a majority, but enough to mask the impact of agent code.

zekone•2h ago

thanks for posting my project bradda

selvan•2h ago

Total PRs between Codex vs Cursor is 208K vs 705, this is an enormous difference in absolute PRs. Since cursor is very popular, how does their PRs is not even 1% of codex PRs?.

ezyang•1h ago

The happy path way of getting code out of Codex is a PR. This is emphatically not true for Cursor.

cap11235•1h ago

Feels like a sort of pollution.

rahimnathwani•1h ago

I didn't even realize Cursor could make PRs. I thought most people would create PRs themselves once they were happy with a series of commits.

zX41ZdbW•1h ago

It is also worth looking at the number of unique repositories for each agent, or the number of unique large repositories (e.g., by the threshold on the number of stars). Here is the report we can check:

https://play.clickhouse.com/play?user=play#V0lUSCByZXBvX3N0Y...

I've also added some less popular agents like jetbrains-junie, and added a link to a random pull request for each agent, so we can look at the example PRs.

ehsanu1•49m ago

It's hard to attribute PR merge rate with higher tool quality here. Another likely reason is level of complexity of task. Just looking at the first PR I saw from the github search for codex PRs, it was this one-line change that any tool, even years ago, could have easily accomplished: https://github.com/maruyamamasaya/yasukaribike/pull/20/files

nikolayasdf123•23m ago

yeah, GitHub Copilot PRs are unusable. from personal experience

ubj•17m ago

Where is Claude Code? Surprised to see it completely left out of this analysis.

NiekvdMaas•9m ago

Same for Google Jules

m3kw9•4m ago

Agents should also sign the pr with secret keys so people can’t just fake the commit message

Kagi Reaches 50k Users

Scientists Show Reforestation Helps Cool the Planet Even More Than Thought

FSE meets the FBI

Why Android can't use CDC Ethernet (2023)

Zig Devlog: Self-Hosted x86 Back End Is Now Default in Debug Mode

Tracking Copilot vs. Codex vs. Cursor vs. Devin PR Performance

Riding high in Germany on the world's oldest suspended railway

Building supercomputers for autocrats probably isn't good for democracy

Omnimax

Web Designs are Getting too Complicated

Administering immunotherapy in the morning seems to matter. Why?

The wire that transforms much of Manhattan into one big, symbolic home (2017)

Panjandrum: The 'giant firework' built to break Hitler's Atlantic Wall

Endangered classic Mac plastic color returns as 3D-printer filament

My first attempt at iOS app development

Show HN: Let’s Bend – Open-Source Harmonica Bending Trainer

Gaussian integration is cool

Analyzing IPv4 Trades with Gnuplot

Building an AI server on a budget

Generating Pixels One by One

Software Is About Promises

How Compiler Explorer Works in 2025

Joining Apple Computer (2018)

How to get started with writing tech video essays

The last six months in LLMs, illustrated by pelicans on bicycles

Poison everywhere: No output from your MCP server is safe

Self-Host and Tech Independence: The Joy of Building Your Own

Binfmtc – binfmt_misc C scripting interface

FAA to eliminate floppy disks used in air traffic control systems

<Blink> and <Marquee> (2020)

Tracking Copilot vs. Codex vs. Cursor vs. Devin PR Performance

Comments

Kagi Reaches 50k Users

Scientists Show Reforestation Helps Cool the Planet Even More Than Thought

FSE meets the FBI

Why Android can't use CDC Ethernet (2023)

Zig Devlog: Self-Hosted x86 Back End Is Now Default in Debug Mode

Tracking Copilot vs. Codex vs. Cursor vs. Devin PR Performance

Riding high in Germany on the world's oldest suspended railway

Building supercomputers for autocrats probably isn't good for democracy

Omnimax

Web Designs are Getting too Complicated

Administering immunotherapy in the morning seems to matter. Why?

The wire that transforms much of Manhattan into one big, symbolic home (2017)

Panjandrum: The 'giant firework' built to break Hitler's Atlantic Wall

Endangered classic Mac plastic color returns as 3D-printer filament

My first attempt at iOS app development

Show HN: Let’s Bend – Open-Source Harmonica Bending Trainer

Gaussian integration is cool

Analyzing IPv4 Trades with Gnuplot

Building an AI server on a budget

Generating Pixels One by One

Software Is About Promises

How Compiler Explorer Works in 2025

Joining Apple Computer (2018)

How to get started with writing tech video essays

The last six months in LLMs, illustrated by pelicans on bicycles

Poison everywhere: No output from your MCP server is safe

Self-Host and Tech Independence: The Joy of Building Your Own

Binfmtc – binfmt_misc C scripting interface

FAA to eliminate floppy disks used in air traffic control systems

<Blink> and <Marquee> (2020)