Show HN: Symbolic Circuit Distillation: prove program to LLM circuit equivalence

https://github.com/neelsomani/symbolic-circuit-distillation

16•nsomani•1mo ago

Hi HN, I've been exploring various applications of formal methods to ML/interpretability and I've been hoping to get more eyes on the approach.

I have been working on a small interpretability project I call Symbolic Circuit Distillation. The goal is to take a tiny neuron-level circuit (like the ones in OpenAI's "Sparse Circuits" work) and automatically recover a concise Python program that implements the same algorithm, along with a bounded formal proof that the two are equivalent on a finite token domain.

Roughly, the pipeline is:

1. Start from a pruned circuit graph for a specific behavior (e.g. quote closing or bracket depth) extracted from a transformer. 2. Treat the circuit as an executable function and train a tiny ReLU network ("surrogate") that exactly matches the circuit on all inputs in a bounded domain (typically sequences of length 5–10 over a small token alphabet). 3. Search over a constrained DSL of common transformer motifs (counters, toggles, threshold detectors, small state machines) to synthesize candidate Python programs. 4. Use SMT-based bounded equivalence checking to either: - Prove that a candidate program and the surrogate agree on all inputs in the domain, or - Produce a counterexample input that rules the program out.

If the solver finds a proof, you get a small, human-readable Python function plus a machine-checkable guarantee that it matches the original circuit on that bounded domain.

Why I built this

Mechanistic interpretability has gotten pretty good at extracting "small crisp circuits" from large models, but turning those graphs into clean, human-readable algorithms is still very manual. My goal here is to automate that last step: go from "here is a sparse circuit" to "here is a verified algorithm that explains what it does", without hand-holding.

What works today

- Tasks: quote closing and bracket-depth detection from the OpenAI circuit_sparsity repo. - Exact surrogate fitting on a finite token domain. - DSL templates for simple counters, toggles, and small state machines. - SMT-based bounded equivalence between: sparse circuit -> ReLU surrogate -> Python program in the DSL.

Limitations and open questions

- The guarantees are bounded: equivalence is only proven on a finite token domain (short sequences and a small vocabulary). - Currently focused on very small circuits. Scaling to larger circuits and longer contexts is open engineering and research work. - The DSL is hand-designed around a few motifs. I am not yet learning the DSL itself or doing anything very clever in the search.

What I would love feedback on

- Are the problem framing and guarantees interesting to people working on mechanistic interpretability or formal methods? - Suggestions for next benchmarks: which circuits or behaviors would you want to see distilled next? - Feedback on the DSL design, search strategy, and SMT setup.

Happy to answer questions about implementation details, the SMT encoding, integration with OpenAI's Sparse Circuits repo, or anything else.

Comments

aappleby•1mo ago

No examples in the readme?

nsomani•1mo ago

There are two examples provided - quote matching and bracket closing.

RFCs vs. READMEs: The Evolution of Protocols

Kanchipuram Saris and Thinking Machines

Chinese chemical supplier causes global baby formula recall

I've used AI to write 100% of my code for a year as an engineer

Looking for 4 Autistic Co-Founders for AI Startup (Equity-Based)

AI-native capabilities, a new API Catalog, and updated plans and pricing

What changed in tech from 2010 to 2020?

From Human Ergonomics to Agent Ergonomics

Advanced Inertial Reference Sphere

Toyota Developing a Console-Grade, Open-Source Game Engine with Flutter and Dart

Typing for Love or Money: The Hidden Labor Behind Modern Literary Masterpieces

Show HN: A longitudinal health record built from fragmented medical data

CoreWeave's $30B Bet on GPU Market Infrastructure

Creating and Hosting a Static Website on Cloudflare for Free

"The Stanford scam proves America is becoming a nation of grifters"

Elon Musk on Space GPUs, AI, Optimus, and His Manufacturing Method

X (Twitter) is back with a new X API Pay-Per-Use model

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

Show HN: Deterministic signal triangulation using a fixed .72% variance constant

Scientists Discover Levitating Time Crystals You Can Hold, Defy Newton’s 3rd Law

When Michelangelo Met Titian

Solving NYT Pips with DLX

Baldur's Gate to be turned into TV series – without the game's developers

Interview with 'Just use a VPS' bro (OpenClaw version) [video]

EchoJEPA: Latent Predictive Foundation Model for Echocardiography

Disablling Go Telemetry

Effective Nihilism

The UK government didn't want you to see this report on ecosystem collapse

No 10 blocks report on impact of rainforest collapse on food prices

Seedance 2.0 Is Coming