I have been working on a small interpretability project I call Symbolic Circuit Distillation. The goal is to take a tiny neuron-level circuit (like the ones in OpenAI's "Sparse Circuits" work) and automatically recover a concise Python program that implements the same algorithm, along with a bounded formal proof that the two are equivalent on a finite token domain.
Roughly, the pipeline is:
1. Start from a pruned circuit graph for a specific behavior (e.g. quote closing or bracket depth) extracted from a transformer.
2. Treat the circuit as an executable function and train a tiny ReLU network ("surrogate") that exactly matches the circuit on all inputs in a bounded domain (typically sequences of length 5–10 over a small token alphabet).
3. Search over a constrained DSL of common transformer motifs (counters, toggles, threshold detectors, small state machines) to synthesize candidate Python programs.
4. Use SMT-based bounded equivalence checking to either:
- Prove that a candidate program and the surrogate agree on all inputs in the domain, or
- Produce a counterexample input that rules the program out.
If the solver finds a proof, you get a small, human-readable Python function plus a machine-checkable guarantee that it matches the original circuit on that bounded domain.
Why I built this
Mechanistic interpretability has gotten pretty good at extracting "small crisp circuits" from large models, but turning those graphs into clean, human-readable algorithms is still very manual. My goal here is to automate that last step: go from "here is a sparse circuit" to "here is a verified algorithm that explains what it does", without hand-holding.
What works today
- Tasks: quote closing and bracket-depth detection from the OpenAI circuit_sparsity repo.
- Exact surrogate fitting on a finite token domain.
- DSL templates for simple counters, toggles, and small state machines.
- SMT-based bounded equivalence between:
sparse circuit -> ReLU surrogate -> Python program in the DSL.
Limitations and open questions
- The guarantees are bounded: equivalence is only proven on a finite token domain (short sequences and a small vocabulary).
- Currently focused on very small circuits. Scaling to larger circuits and longer contexts is open engineering and research work.
- The DSL is hand-designed around a few motifs. I am not yet learning the DSL itself or doing anything very clever in the search.
What I would love feedback on
- Are the problem framing and guarantees interesting to people working on mechanistic interpretability or formal methods?
- Suggestions for next benchmarks: which circuits or behaviors would you want to see distilled next?
- Feedback on the DSL design, search strategy, and SMT setup.
Happy to answer questions about implementation details, the SMT encoding, integration with OpenAI's Sparse Circuits repo, or anything else.
nsomani•56m ago
I have been working on a small interpretability project I call Symbolic Circuit Distillation. The goal is to take a tiny neuron-level circuit (like the ones in OpenAI's "Sparse Circuits" work) and automatically recover a concise Python program that implements the same algorithm, along with a bounded formal proof that the two are equivalent on a finite token domain.
Roughly, the pipeline is:
1. Start from a pruned circuit graph for a specific behavior (e.g. quote closing or bracket depth) extracted from a transformer. 2. Treat the circuit as an executable function and train a tiny ReLU network ("surrogate") that exactly matches the circuit on all inputs in a bounded domain (typically sequences of length 5–10 over a small token alphabet). 3. Search over a constrained DSL of common transformer motifs (counters, toggles, threshold detectors, small state machines) to synthesize candidate Python programs. 4. Use SMT-based bounded equivalence checking to either: - Prove that a candidate program and the surrogate agree on all inputs in the domain, or - Produce a counterexample input that rules the program out.
If the solver finds a proof, you get a small, human-readable Python function plus a machine-checkable guarantee that it matches the original circuit on that bounded domain.
Why I built this
Mechanistic interpretability has gotten pretty good at extracting "small crisp circuits" from large models, but turning those graphs into clean, human-readable algorithms is still very manual. My goal here is to automate that last step: go from "here is a sparse circuit" to "here is a verified algorithm that explains what it does", without hand-holding.
What works today
- Tasks: quote closing and bracket-depth detection from the OpenAI circuit_sparsity repo. - Exact surrogate fitting on a finite token domain. - DSL templates for simple counters, toggles, and small state machines. - SMT-based bounded equivalence between: sparse circuit -> ReLU surrogate -> Python program in the DSL.
Limitations and open questions
- The guarantees are bounded: equivalence is only proven on a finite token domain (short sequences and a small vocabulary). - Currently focused on very small circuits. Scaling to larger circuits and longer contexts is open engineering and research work. - The DSL is hand-designed around a few motifs. I am not yet learning the DSL itself or doing anything very clever in the search.
What I would love feedback on
- Are the problem framing and guarantees interesting to people working on mechanistic interpretability or formal methods? - Suggestions for next benchmarks: which circuits or behaviors would you want to see distilled next? - Feedback on the DSL design, search strategy, and SMT setup.
Happy to answer questions about implementation details, the SMT encoding, integration with OpenAI's Sparse Circuits repo, or anything else.