A real-world benchmark for AI code review

https://www.qodo.ai/blog/how-we-built-a-real-world-benchmark-for-ai-code-review/

25•benocodes•2h ago

Comments

CuriouslyC•1h ago

I don't think LLMs are the right tool for pattern enforcement in general, better to get them to create custom lint rules.

Agents are pretty good at suggesting ways to improve a piece of code though, if you get a bunch of agents to wear different hats and debate improvements to a piece of software it can produce some very useful insights.

mbesto•1h ago

Cmd+F - "Overfitting"...nothing.

Nope, no mention of how they do anything to alleviate overfitting. These benchmarks are getting tiresome.

aetherspawn•1h ago

Your pricing page has a bug on it, the annual price is higher than the monthly price.

zamadatix•1h ago

I'm seeing $30/m at annual and $38/m at monthly? (maybe already fixed, hard to tell)

mdeeks•47m ago

I feel like pricing needs to be included here. I kind of don't care about 10 percentage points if the cost is dramatically higher. Cursor Bugbot is about the same cost but gives 10x the monthly quota of Qodo.

I know this is focused solely on performance, but cost is a major factor here.

falloutx•35m ago

Company creates a benchmark. Same company is best in that benchmark.

Story as old as time.

kachapopopow•30m ago

coderabbit being the worst while (presumeably) advertising the most seems to be check out at least, wouldn't believe the recall % seems bogus.

esafak•24m ago

I'm not as cynical as the others here; if there are no popular code review benchmarks why should they not design one?

Apparently this is in support of their 2.0 release: https://www.qodo.ai/blog/introducing-qodo-2-0-agentic-code-r...

> We believe that code review is not a narrow task; it encompasses many distinct responsibilities that happen at once. [...]

> Qodo 2.0 addresses this with a multi-agent expert review architecture. Instead of treating code review as a single, broad task, Qodo breaks it into focused responsibilities handled by specialized agents. Each agent is optimized for a specific type of analysis and operates with its own dedicated context, rather than competing for attention in a single pass. This allows Qodo to go deeper in each area without slowing reviews down.

> To keep feedback focused, Qodo includes a judge agent that evaluates findings across agents. The judge agent resolves conflicts, removes duplicates, and filters out low-signal results. Only issues that meet a high confidence and relevance threshold make it into the final review.

> Qodo’s agentic PR review extends context beyond the codebase by incorporating pull request history as a first-class signal.

logicx24•19m ago

Where's the code for this? I'd love to run our tool, https://tachyon.so/, against it.

mattvv•8m ago

Some feedback for the team, looked at pricing page and saw it more expensive ($30/dev/mo) and highly limiting (20prs per month per user). We have devs putting up that many prs in a single day. With this kind of plan pretty much no way we would even try this product

esafak•6m ago

It's true, those are some pre-AI quotas.

Voxtral Transcribe 2

Claude Code: connect to a local model when your quota runs out

Claude Code for Infrastructure

A real-world benchmark for AI code review

AI is killing B2B SaaS

Building a 24-bit arcade CRT display adapter from scratch

Remarkable Pro Colors

Tractor

Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

Microsoft's Copilot chatbot is running into problems

A sane but bull case on Clawdbot / OpenClaw

Litestream Writable VFS

RS-SDK: Drive RuneScape with Claude Code

The Great Unwind

Data Poems

Arcan-A12: Weaving a Different Web

Tell HN: Another round of Zendesk email spam

Converge (YC S23) Is Hiring Product Engineers (NYC, In-Person)

Turn any website into a live, structured data feed

The Codex app illustrates the shift left of IDEs and coding GUIs

Coding Agent VMs on NixOS with Microvm.nix

Spotlighting the World Factbook as We Bid a Fond Farewell

Claude is a space to think

Show HN: Ghidra MCP Server – 110 tools for AI-assisted reverse engineering

Show HN: Interactive California Budget (By Claude Code)

Technocracy 2.0

No More Hidden Changes: How MySQL 9.6 Transforms Foreign Key Management

Guinea worm on track to be 2nd eradicated human disease; only 10 cases in 2025

Writing an optimizing tensor compiler from scratch

A case study in PDF forensics: The Epstein PDFs

Voxtral Transcribe 2

Claude Code: connect to a local model when your quota runs out

Claude Code for Infrastructure

A real-world benchmark for AI code review

AI is killing B2B SaaS

Building a 24-bit arcade CRT display adapter from scratch

Remarkable Pro Colors

Tractor

Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

Microsoft's Copilot chatbot is running into problems

A sane but bull case on Clawdbot / OpenClaw

Litestream Writable VFS

RS-SDK: Drive RuneScape with Claude Code

The Great Unwind

Data Poems

Arcan-A12: Weaving a Different Web

Tell HN: Another round of Zendesk email spam

Converge (YC S23) Is Hiring Product Engineers (NYC, In-Person)

Turn any website into a live, structured data feed

The Codex app illustrates the shift left of IDEs and coding GUIs

Coding Agent VMs on NixOS with Microvm.nix

Spotlighting the World Factbook as We Bid a Fond Farewell

Claude is a space to think

Show HN: Ghidra MCP Server – 110 tools for AI-assisted reverse engineering

Show HN: Interactive California Budget (By Claude Code)

Technocracy 2.0

No More Hidden Changes: How MySQL 9.6 Transforms Foreign Key Management

Guinea worm on track to be 2nd eradicated human disease; only 10 cases in 2025

Writing an optimizing tensor compiler from scratch

A case study in PDF forensics: The Epstein PDFs

A real-world benchmark for AI code review

Comments