Over the past few months we’ve completely rebuilt our detection engine, and I wanted to share a few things we did to get more out of LLMs.
Context: cubic specializes in code reviews for teams with complex codebases, like Better Auth, Cal.com, and PostHog. Our users have high standards. It’s important that reviews have real depth and actually understand the codebase.
In the past, we've sometimes struggled with producing reviews that had deep insight into complex changes. It didn't feel like we were leaving comments that truly understood both the codebase and the intent behind the PR. If we pushed reasoning to the max, it could get there, but it would take ages, often 15+ minutes for a review, which many people disliked.
We've spent the last few months rebuilding our AI review engine from scratch, and we've completely redone how the reviewer works, bit by bit.
Like the Ship of Theseus, cubic ended up so different (and better) that we're releasing it as cubic 2.0.
I should say up front: I’m biased because I work on this, and part of the point is awareness. But the main reason I’m posting is that the work feels broadly useful if you’re building anything LLM-based where you need both quality and speed.
*Why this is a "2.0"*
We were optimizing for two things:
1. Higher signal reviews (comments people actually act on) 2. Lower latency
Quality: 3 months ago, about 20% of comments that cubic left would be addressed by a developer. We measure this by having an LLM look at commits after a cubic comment and judge whether the change implemented what cubic flagged. Today that number is 60%+. For some teams it’s over 90%.
Speed: median time to review a PR was roughly halved; P90 divided by 3.
*What we changed (the parts that mattered)*
1. Pre-mapping the codebase ("AI wiki")
A big inefficiency in LLM code review (and code writing) is that every PR forces the model to rebuild a mental map of the repo from scratch. In large repos, just figuring out "where am I" can consume a lot of context and tokens before you even get to reasoning about the diff.
We built an "AI wiki" that pre-maps the important parts of a codebase and reuses that as context for reviews.
As a side effect, the wiki is also useful to humans (and AIs through an MCP), especially for onboarding or for non-technical people trying to understand a system. Example Firecrawl: https://www.cubic.dev/wikis/firecrawl/firecrawl
2. External context tools, plus getting tool usage under control
We added tools to fetch external documentation when needed. The hard part was not adding the tool, it was getting the model to use it correctly. This took a lot of prompt iteration and guardrails, and it ended up being more important than we expected.
3. Learning loop, with more weight on senior reviewers
We leaned harder into learning from user interactions and feedback. One change that helped a lot recently was identifying the senior reviewers in an org and weighting their feedback more heavily. In practice, that made the system converge faster toward what "good" looks like for that team.
4. Sandbox snapshotting for large repos
On larger repos, we were wasting minutes on setup work (clone time, environment prep), and we were doing it for every PR. We added a snapshotting approach that cut a lot of that overhead.
Anyway, thanks for reading. Happy to answer questions about any of the above. I’d also love feedback from people who’ve tried AI code review tools:
* What made you keep them, or turn them off? * What metrics would you trust to measure "review quality"? * Where do current tools fail in ways that are genuinely harmful?
cubic is here (and free for public repos): https://cubic.dev/home
DenisDolya•1h ago
pomarie•1h ago