DeepCodeBench: Real-World Codebase Understanding by Q&A Benchmarking

https://www.qodo.ai/blog/deepcodebench-real-world-codebase-understanding-by-qa-benchmarking/

84•blazercohen•5mo ago

Comments

four_fifths•4mo ago

If you do a bit of digging into most of the popular benchmarks that all the big labs report on, you'll see pretty quickly that they have almost zero correlation with any real world tasks.

The approach that they're taking here of working backwards from a OS repo pull request and reverse engineering a question is unusually well thought out for a benchmark.

I haven't dug into more of the dataset questions yet, but the example they give in the blog post for the question generated for Hugging Face Transformer's repo gives me hope that this could actually be a solid benchmark:

> How do the fast image and video processor base classes prevent shared mutable state when instantiating multiple instances?

qsort•4mo ago

I particularly like their usage of LLM-as-a-judge. They don't go "hey chatgpt, sort these from best to worst based on vibes", rather they extract a set of ground truths and check how the answer compares, a task that SOTA LLM can do kind of reliably. It's a very smart way to circumvent the problems introduced by pure LLM-as-a-judge methods.

Tiberium•4mo ago

Seems like an interesting benchmark, but my takeaway from the results is that Codex is almost as good enough as their custom solution (no mention of the underlying model) and only requires a $20 ChatGPT subscription to start using it (of course with limits), without having to shell out $$$ for an enterprise Qodo plan to use Qodo Aware - https://www.qodo.ai/products/qodo-aware/. The "free" plan in Qodo Aware only lets users work with 100 hand-picked open-source repositories.

It also would be nice if the article clearly mentioned what specific model settings were used for Claude Code and Codex. Both of those allow changing the reasoning level, so if the benchmark was done using the default settings, it seems a little unfair - they have a result of their own agent at high reasoning as a separate entry.

esafak•4mo ago

This is in relation to their newly-announced "context agent": https://www.qodo.ai/blog/introducing-qodo-aware-deep-codebas...

asdev•4mo ago

Agentic search is good enough for code search and code understanding, indexing/fancy techniques will only slight outperform for a lot more effort

Show HN: HalalCodeCheck – Verify food ingredients offline

Student makes cosmic dust in a lab, shining a light on the origin of life

In the Australian outback, we're listening for nuclear tests

'Hermès orange' iPhone sparks Apple comeback in China

Show HN: Goxe 19k Logs/S on an I5

The async builder pattern in Rust

(Golang) Self referential functions and the design of options

Show HN: Model Training Memory Simulator

Claude Code Controller

Software design is now cheap

Show HN: Are You Random? – A game that predicts your "random" choices

Poland to probe possible links between Epstein and Russia

Effectiveness of AI detection tools in identifying AI-generated articles

Warsaw Circle

Reverse Engineering Raiders of the Lost Ark for the Atari 2600

The AI4Agile Practitioners Report 2026

Digital Independence Day

What a bot hacking attempt looks like: SQL injections galore

Show HN: FlashMesh – An encrypted file mesh across Google Drive and Dropbox

Show HN: AgentLens – Open-source observability and audit trail for AI agents

Show HN: ShipClaw – Deploy OpenClaw to the Cloud in One Click

Unlock the Power of Real-Time Google Trends Visit: Www.daily-Trending.org

Explanation of British Class System

Show HN: Jwtpeek – minimal, user-friendly JWT inspector in Go

Willow – Protocols for an uncertain future [video]

Feedback on a client-side, privacy-first PDF editor I built

Clay Christensen's Milkshake Marketing (2011)

Show HN: WeaveMind – AI Workflows with human-in-the-loop

Show HN: Seedream 5.0: free AI image generator that claims strong text rendering

A contributor trust management system based on explicit vouches