I can see this will be tolerant of simple renames, but seems unlikely this hash will survive any real refactor of code
You are right that AST hashing is brittle. That is why I wrote an engine that virtually executes the induction variables to determine that a `range` loop, a C-style `for` loop, and a raw `goto` loop are all mathematically performing the same operation (Iterate 0 to N).
I just pushed a proof to the repo that runs those three exact scenarios. They produce the identical SHA-256 fingerprint.
It also handles Control Flow Normalization, so `if a > b { return 1 }` fingerprints identically to `if b <= a { return 1 }` (inverted condition + swapped branches).
It won't catch O(n) vs O(log n) algorithm changes, but it catches the "syntactic sugar" refactors that make up 90% of code churn.
You can view the proof code here:
https://github.com/BlackVectorOps/semantic_firewall/blob/mai...
Or run it yourself:
go run examples/proof.go
BlackVectorOps•3w ago
I built this because I've become paranoid about "safe" refactors in the wake of supply chain attacks like the xz backdoor.
We spend a lot of time reviewing code for syntax, but we lack good tools for verifying that a large refactor (e.g., renaming variables, changing loop styles) preserves the exact business logic. Standard SHA256 hashes break if you change a single whitespace or variable name, which makes them useless for verifying semantic equivalence.
I built Semantic Firewall (sfw) to solve this. It is an open-source tool that fingerprints Go code based on its behavior, not its bytes.
How it works:
1. SSA Conversion: It loads the Go source into Static Single Assignment form using golang.org/x/tools/go/ssa.
2. Canonicalization: It renames registers (v0, v1) deterministically and normalizes control flow graphs. This ensures that `if a { x } else { y }` fingerprints the same even if branches are swapped with inverted conditions.
3. Scalar Evolution (SCEV): This was the hardest part. I implemented an SCEV engine that mathematically solves loop trip counts. This means a `for range` loop and a `for i++` loop that iterate N times produce the exact same fingerprint.
Here is a quick example of what it catches:
These two produce identical hashes. If you change the logic (e.g. `i < len(buf)-1`), the hash diverges immediately.It’s written in Go and available as a CLI or GitHub Action. I’d love to hear your thoughts on the approach or edge cases I might have missed in the normalization phase.
Repo: https://github.com/BlackVectorOps/semantic_firewall