Orchestrating AI code review at scale

https://blog.cloudflare.com/ai-code-review/

69•pramodbiligiri•3d ago

Comments

rzmmm•4h ago

> The entire system also runs locally.

I think approaches like this don't need to run other than locally. Maybe integrated as pre-push hook. The system is nondeterministic, so it's at odds with the purpose of CI.

fhd2•4h ago

I'd argue it's pretty much like monitoring, which certainly benefits from multiple people seeing the same stats and alerts. I agree it's at odds with CI/CD and should probably not block anything, like deterministic checks commonly do.

proofofcontempt•3h ago

I'm not sure the people integrating it into CI process understand what CI is.

Scea91•3h ago

Same can be said about human review if the argument is non-determinism.

proofofcontempt•2h ago

Human review is about learning and there's an implied social contract in that someone is giving you their time to make you better. It isn't necessarily necessary but replacing it with AI shows a fundamental misunderstanding of why it is part of the process.

plmpsu•4h ago

I built a more naive version for our team using Copilot and GitHub actions and it works quite well (wish I had metrics too). The team loves it.

The ROI here is so high that I don't mind using the strongest model available for the actual code review. I don't trust Sonnet and such. Just let Opus or GPT 5.5 do the whole thing and pay a bit more for less complexity.

neebz•3h ago

do you also have separate prompts for each domain (security, architecture etc?).

would love to look into it if any part of it is open source

bob1029•3h ago

> One of the operational headaches we didn’t predict was that large, advanced models like Claude Opus 4.7 or GPT-5.4 can sometimes spend quite a while thinking through a problem, and to our users this can make it look exactly like a hung job.

I had the same problem in my recursive agent harness. It would always come back, but it could sometimes take up to 10 minutes depending. I fixed this by adding a required "purpose" argument to every tool and call/return event. As the recursive evaluation proceeds, every single thing that happens streams incremental purpose text to the user's browser (also using the magic of JSONL for this). The incremental progress events contain the purpose and a detail section (tool arg JSON) that the user can expand/collapse.

derwiki•2h ago

Nice trick! I am doing something similar but passing those incremental updates to Haiku for a short user-friendly message.

thih9•3h ago

> Today, when an engineer at Cloudflare opens a merge request, it gets an initial pass from a coordinated smörgåsbord of AI agents.

I’d prefer to have that happen as some sort of pre commit hook, before a merge request is sent. The feedback loop might be a bit faster and the process might produce less noise this way.

rhgraysonii•3h ago

Valid, but you lose the lived history that comes with the audit log of it being actual review back and forth and CI runs vs lost to a developers machine and only a relic in the commit log. I can see both sides, though.

Zanfa•2h ago

Can you elaborate about the practical value of having the history of back and forth, in a PR or even in the commit log? In my 20ish years of experience, I can’t recall a single instance where I’ve solved something thanks to having this work-in-progress state persisted in the repo history.

It’s exclusively been the other way around where having a smaller number of larger squished commits (post merge) that’s made the project be more maintainable.

cush•22m ago

People usually squash merge anyways

derwiki•2h ago

My company has the AI review agents, and you can run them locally, but practically it’s easier to just open a merge request to have CI run the agents. Especially if you’re juggling a bunch of merge requests.

etothet•3h ago

What’s the over/under on when Cloudflare will acquire OpenCode (and keep it open source)?

faangguyindia•2h ago

what's best workflow for solo devs?

criley2•19m ago

You can do basically the same thing as cloudflare except as a skill you run in your local harness. If you're going through the motions with PRs and are familiar with actions, you can have it run in a github action instead. But this is basically just a skill. The Claude code review skill is a simple version of exactly this.

jmakov•2h ago

Every iteration something can be found. How many times do you iterate e.g. on performance - use optimized struct, oh, you can change the architecture etc.? At that point one can just have a while loop for the agents to make changes until no comments left.

suika•18m ago

As a solo dev or rather nowadays more so only a decision maker / agent overseer, I came to enjoy letting my agents develop against a Gerrit repository / workflow. Dev agent pushes a CL, review agent picks it up (not just the diff, but the full repo), runs tests/reviews/review-subagents and concludes by posting a review as well as a vote. This goes back and forth with new patch sets / replies to the threads. Eventually the CL gets a +2 or whatever and I have the final call to manually submit it. It is way slower compared to just pushing through development with one agent doing everything yolo against a normal repository, but it seems to me that the additional time is well spent (no, I don't have fancy graphs or similar analysis to prove this other than my gut feeling after looking at recent development results).

Show HN: AISlop, a CLI for catching AI generated code smells

Tulip mania: when a single flower was worth more than a house (2025)

The UK Government's Low Value Purchase System Is a Waste of Time

Please Use AI

Claude Opus 4.8

Bricks and Minifigs Stole a Man's $200k Lego Collection

Local Git Remotes

High Density Living, 2000 Years Ago: Inside the Roman Apartment Building

Is This Sustainable?

Claude Code – Everything You Can Configure That the Docs Don't Tell You

Cedana (YC S23) Is Hiring

Orchestrating AI code review at scale

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

I made a million dollar product from my dorm room (2025)

An Obsessive Focus on UX: Pilot's Pressure-Regulating Kire-Na Highlighter

We should be more tired than the model

Let's compile Quake like it's 1997

Volkswagen blocks Home Assistant by requiring client assertion

Even (very) noisy LLM evaluators are useful for improving AI agents

HeidiSQL – Lightweight MariaDB, MySQL, SQL Server, PostgreSQL and SQLite Manager

Italians and Dutch share the same gestural instinct for teaching

Ten Basic Clouds

Is AI causing a repeat of Front end's Lost Decade?

Wterm – Terminal Emulator for the Web

Poll: How often do you check "newest"?

Headway Therapy Patients Forced to Scan Their Faces to Keep Getting Care

Nitpicking the shell history scene in 'Tron: Legacy'

Cars collect a startling amount of data about you

Show HN: Continue? Y/N: A 60-second game about AI agent permission fatigue

Digital Identity Management in Norway Is a Catastrophe

Orchestrating AI code review at scale

Comments

Show HN: AISlop, a CLI for catching AI generated code smells

Tulip mania: when a single flower was worth more than a house (2025)

The UK Government's Low Value Purchase System Is a Waste of Time

Please Use AI

Claude Opus 4.8

Bricks and Minifigs Stole a Man's $200k Lego Collection

Local Git Remotes

High Density Living, 2000 Years Ago: Inside the Roman Apartment Building

Is This Sustainable?

Claude Code – Everything You Can Configure That the Docs Don't Tell You

Cedana (YC S23) Is Hiring

Orchestrating AI code review at scale

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

I made a million dollar product from my dorm room (2025)

An Obsessive Focus on UX: Pilot's Pressure-Regulating Kire-Na Highlighter

We should be more tired than the model

Let's compile Quake like it's 1997

Volkswagen blocks Home Assistant by requiring client assertion

Even (very) noisy LLM evaluators are useful for improving AI agents

HeidiSQL – Lightweight MariaDB, MySQL, SQL Server, PostgreSQL and SQLite Manager

Italians and Dutch share the same gestural instinct for teaching

Ten Basic Clouds

Is AI causing a repeat of Front end's Lost Decade?

Wterm – Terminal Emulator for the Web

Poll: How often do you check "newest"?

Headway Therapy Patients Forced to Scan Their Faces to Keep Getting Care

Nitpicking the shell history scene in 'Tron: Legacy'

Cars collect a startling amount of data about you

Show HN: Continue? Y/N: A 60-second game about AI agent permission fatigue

Digital Identity Management in Norway Is a Catastrophe