Claude Can (Sometimes) Prove It

https://www.galois.com/articles/claude-can-sometimes-prove-it

23•lairv•2d ago

Comments

r0ze-at-hn•14m ago

> What I’ve found is that given a tool that can detect mistakes, the agent can often correct them

This is the most important line of the entire article

When iterating on a Manifesto for AI Software Development (https://metamagic.substack.com/p/manifesto-for-ai-software-d...) over the last two years the key attribute more than any other that I found was empirical validation. While AI (and humans) are not able to accurately judge their own work when we give AI (and human) the ability to do empirical validation its success skyrockets. This might be intuitive, but there are still papers testing that this applies to AI too. While reaching to have the AI write unit tests I've been embracing fuzzing because then AI can't cheat with bonus tests. The idea of reaching back to school and using interactive theorem proving didn't even cross my mind and now that it has been presented it is a whole paradigm shift on how to push my AI use forward so it can work even more autonomously.

AI can iterate at speeds humans can't. When it has even basic empirical validation (building the code, running tests) it removes the human from the loop. Move that to using fuzzing (such as with golang) and you get way better coverage and way better progress before a human has to intervene. So it isn't a surprise that interactive theorem proving is a perfect match for AI.

It is interesting how this same lesson plays out elsewhere, earlier in the article

> Why is ITP so hard? Some reasons are quite obvious: interfaces are confusing, libraries are sparse, documentation is poor, error messages are mysterious. But these aren’t interesting problems

Remember when llvm got really good c++ error messages and it was life changing? High quality error messages means we could find/fix the error fast and iterate fast. These are actually the MOST interesting problems because it enable the user to learn faster. When a user has high success they will use a product again and again. High quality error messages in all tools will enable Claude code to be able to work longer on problems without human intervention, make less mistakes and overall work faster.

While error messages should always be good a new question that really hammers this home is "When AI encounters this error message, can it fix the problem?"

SCREAM CIPHER ("ǠĂȦẶAẦ ĂǍÄẴẶȦ")

Less is safer: How Obsidian reduces the risk of supply chain attacks

If all the world were a monorepo

Show HN: FocusStream – Focused, distraction-free YouTube for learners

Claude Can (Sometimes) Prove It

Compiling with Continuations

High-performance read-through cache for object storage

PYREX vs. Pyrex: What's the Difference?

Sangaku Puzzle I Can't Solve

Show HN: WeUseElixir - Elixir project directory

Hidden risk in Notion 3.0 AI agents: Web search tool abuse for data exfiltration

Ants that seem to defy biology – They lay eggs that hatch into another species

Feedmaker: URL + CSS selectors = RSS feed

The best YouTube downloaders, and how Google silenced the press

Internet Archive's big battle with music publishers ends in settlement

Supporting Our AI Overlords: Redesigning Data Systems to Be Agent-First

Show HN: Zedis – A Redis clone I'm writing in Zig

LLM-Deflate: Extracting LLMs into Datasets

If you are good at code review, you will be good at using AI agents

I'm Not a Robot Game

Three-Minute Take-Home Test May Identify Symptoms Linked to Alzheimer's Disease

Kernel: Introduce Multikernel Architecture Support

Micro-LEDs boost random number generation

Your very own humane interface: Try Jef Raskin's ideas at home

Shipping 100 hardware units in under eight weeks

A 3D-Printed Business Card Embosser

An untidy history of AI across four books

R MCP Server

Show the Physics

Trump to impose $100k fee for H-1B worker visas, White House says

Claude Can (Sometimes) Prove It

Comments

SCREAM CIPHER ("ǠĂȦẶAẦ ĂǍÄẴẶȦ")

Less is safer: How Obsidian reduces the risk of supply chain attacks

If all the world were a monorepo

Show HN: FocusStream – Focused, distraction-free YouTube for learners

Claude Can (Sometimes) Prove It

Compiling with Continuations

High-performance read-through cache for object storage

PYREX vs. Pyrex: What's the Difference?

Sangaku Puzzle I Can't Solve

Show HN: WeUseElixir - Elixir project directory

Hidden risk in Notion 3.0 AI agents: Web search tool abuse for data exfiltration

Ants that seem to defy biology – They lay eggs that hatch into another species

Feedmaker: URL + CSS selectors = RSS feed

The best YouTube downloaders, and how Google silenced the press

Internet Archive's big battle with music publishers ends in settlement

Supporting Our AI Overlords: Redesigning Data Systems to Be Agent-First

Show HN: Zedis – A Redis clone I'm writing in Zig

LLM-Deflate: Extracting LLMs into Datasets

If you are good at code review, you will be good at using AI agents

I'm Not a Robot Game

Three-Minute Take-Home Test May Identify Symptoms Linked to Alzheimer's Disease

Kernel: Introduce Multikernel Architecture Support

Micro-LEDs boost random number generation

Your very own humane interface: Try Jef Raskin's ideas at home

Shipping 100 hardware units in under eight weeks

A 3D-Printed Business Card Embosser

An untidy history of AI across four books

R MCP Server

Show the Physics

Trump to impose $100k fee for H-1B worker visas, White House says