Agentic AI Code Review: From Confidently Wrong to Evidence-Based

https://platformtoolsmith.com/blog/agentic-ai-code-review/

2•sharp-dev•2h ago

Comments

sharp-dev•2h ago

TL;DR: AI code reviewer went from "confidently wrong" to actually useful. Fix: stopped pre-selecting context, gave the model tools to fetch evidence itself. Now it either cites file:line or stays quiet.

The Problem

Our AI reviewer flagged a "blocker." Cited the diff, built a plausible argument, suggested a fix. The senior engineer spent 20 minutes disproving it. Did the guard clause get missed? Two files away. The model never had that file, so it guessed and sounded certain. Pre-selecting context doesn't work. Code review follows evidence chains, and chains aren't predictable.

The Fix

Agentic loop:

Model: "This calls validate()" → search_code("validate") Model: "Two call sites use withRetry(). Third doesn't." → get_file_content("config/defaults.go") Model: "Missing timeout. Bug found." → submit_code_review(structured_output) Model fetches what it needs. Loop ends when it submits structured findings (path, line, severity, evidence—not prose).

What Changed

- Before: "This might break retries." - After: "In foo/bar.go:123, call bypasses withRetry(). Other call sites use it (see search results). Wrap or document."

The Pieces

1. Tools — boring, fast, deterministic. get_file_content, search_code. Treat them like production APIs. 2. Terminal action — structured JSON submission, not Markdown. No evidence? Can't submit. 3. Loop — model turn → tool turn → repeat. Aggressive context shrinking (old results truncated, diff stays). 4. Guardrails — iteration caps, timeouts, self-critique checklist.

Evaluation

Pick 5-10 PRs where you know the real risks. Check: - Found the issue? - Cited exact file:line? - Hallucinated anything? - Fetched evidence when uncertain? Iterate on tools, not prompts.

The Pattern

Don't build bigger prompts. Build a loop where the model can fetch evidence, test hypotheses, and submit only when it can cite sources. That's the difference between "sounds right" and "is right."

Yakuza creator's new game in doubt as NetEase pulls funding

Freestiler – PMTiles vector tilesets from R and Python

How not to test LLM models

Utilization metrics across accelerators (GPUs, TPUs, and so on)

Behavioral Effects of High Peak Power Microwave Pulses (1992) [pdf]

Microsoft Outlook app now showing paid spam/phishing ad's

Show HN: PDF to JPG converter that runs in the browser (no uploads)

Show HN: ClarifyDoc – explains contracts in plain English

Small web publishing tools and frameworks

Self-hosted docs platform – 4 PHP files, no database, free GitBook alternative

Ask HN: What should an international dev do today?

AI Agent Site Score Scanner

Can the mental health benefits of exercise be bottled?

Coasts: Localhost service isolation and orchestration for Git worktrees

China's AI progress by the numbers: GLM-5 benchmarks, robotaxi, and Huawei chips

Show HN: VectorLens – See why your RAG hallucinates, no config

Agentic Debt

Show HN: Dashboard for monitoring multiple Claude Code sessions

Neuroscientists have pinpointed a potential biological signature for psychopathy

60 Minutes Havana Syndrome report finds U.S. government tested energy weapon

Flexible feline spines shed light on "falling cat" problem

Iran Transformed

Agent Skill to Use a Debugger

EU publishers won a piece of a shrinking pie

Fukushima at 15: Living with radioactive hot spots and stigma

Show HN: ChopChopGo – Sigma-based threat hunting for Linux forensic artifacts

Animator Pro (Autodesk Animator) Source Code

We strongly oppose the Unified Attestation initiative

Oscar Pool Ballot, 98th Academy Awards

Advanced Pet Screen Drawing Techniques