Diffsitter – A Tree-sitter based AST difftool to get meaningful semantic diffs

https://github.com/afnanenayet/diffsitter

89•mihau•13h ago

Comments

fjfaase•13h ago

Discussed before on https://news.ycombinator.com/item?id=27875333

koozz•9h ago

I thought I’ve seen it before. I use Difftastic myself, amazing diffs. https://github.com/Wilfred/difftastic

alwillis•2h ago

Same.

jbellis•9h ago

If you're looking for something more complete and actively maintained, check out https://github.com/GumTreeDiff/gumtree.

(I evaluated semantic diff tools for use in Brokk but I ultimately went with standard textual diff; the main hangup that I couldn't get past is that semantic diff understandably works very poorly when you have a syntactically invalid file due to an in-progress edit.)

pests•9h ago

I watched a video long ago about how the Roslyn C# compiler handled this but I forget the details.

pfdietz•8h ago

The interesting problem here would be how do you produce a robust parse tree for invalid inputs, in the sense of stably parsing large sections of the text in ways that don't change too much. The tree would have to be an extension of an actual parse tree, with nodes indicating sections that couldn't be fully parsed or had errors. The diff algorithm would have to also be robust in the face of such error nodes.

For the parsing problem, maybe something like Early's algorithm that tries to minimize an error term?

You need this kind of robust parser for languages with preprocessors.

o11c•3h ago

Unfortunately, this depends on making good decisions during language design; it's not something you can retrofit with a new lexer and parser.

One very important rule is: no token can span more than one (possibly backslash-extended) line. This means having neither delimited comments (use multiple single-line comments; if your editor is too dumb for this you really need a new editor) nor multi-line strings (but you can do implicit concatenation of a string literal flavor that implicitly includes the newline; as a side-effect this fixes the indentation problem).

If you don't follow this rule, you might as well give up on robustness, because how else are you going to ever resynchronize after an error?

For parsing you can generally just aggressively pop on mismatched parens, unexpected semicolons, or on keywords only allowed in a top-ish level context. Of course, if your language is insane (like C typedefs), you might not be able to parse the next top-level function/class anyway. GNU statement-expressions, by contrast, are an actually useful thing that requires some thought. But again, language design choices can mitigate this (such as making classes values, template argument equivalent to array indexing, and statements expressions).

pfdietz•3h ago

> how else are you going to ever resynchronize after an error?

An error-cost-minimizing dynamic programming parser could do this.

o11c•1h ago

That fundamentally misunderstands the problem in multiple ways:

* this is still during lexing, not yet to parsing

* there are multiple valid token sequences that vary only with a single character at the start of the file. This is very common with Python multi-line strings in particular, since they are widely used as docstrings.

ilyagr•4h ago

In case anybody happens to be interested in testing `gumtree` with https://github.com/jj-vcs/jj, I think I got them to work together. See https://github.com/GumTreeDiff/gumtree/wiki/VCS-Integration#... (assumes Docker).

affyboi•3h ago

Note that diffsitter isn’t abandoned or anything. I took a year off working and just started a new job so I’ve been busy. I’ve got a laundry list of stuff I want to do with this project that will get done (at some point)

the__alchemist•9h ago

Is there an anti-tree-sitter version too?

davepeck•8h ago

yes, although it's sort of the same as Context-Free-Typing-sitter

esafak•7h ago

Some make a semantic diff splitter please! Break up big commits into small, atomic, meaningful ones.

0x457•6h ago

Well, that's what git-patch is: https://patch-diff.githubusercontent.com/raw/denoland/deno/p...

esafak•5h ago

I can't make sense of that link. How many parts was the diff split up into, and along what lines?

0x457•5h ago

Yeah, I don't know why I linked that as an example. Wanted to show structure of a patch. Each commit of a patch already has everything ready to be processed and chunked IF you keep them - small, atomic, semantically meaningful. As in do smaller commits.

ethan_smith•4h ago

Check out git-imerge or git-absorb which can help with this problem by intelligently splitting or absorbing changes into the right commits.

alwillis•2h ago

First time I used absorb was in Mercurial back in the day: https://gregoryszorc.com/blog/2018/11/05/absorbing-commit-ch...

pmkary•7h ago

What a genius idea.

affyboi•3h ago

Nah I think most people could make something like this in a weekend

vrm•6h ago

This is neat! I think in general there are really deep connections between semantically meaningful diffs (across modalities) and supervision of AI models. You might imagine a human-in-the-loop workflow where the human makes edits to a particular generation and then those edits are used as supervision for a future implementation of that thing. We did some related work here: https://www.tensorzero.com/blog/automatically-evaluating-ai-... on the coding use case but I'm interested in all the different approaches to the problem and especially on less structured domains.

dcre•6h ago

See also https://mergiraf.org/ for a tool that uses ASTs to resolve (some) merge conflicts.

Iwan-Zotow•6h ago

integration to VSCODE?

1-more•4h ago

ilyagr•4h ago

https://github.com/Wilfred/difftastic/wiki/Structural-Diffs is a nice list of alternatives.

Difftastic itself is great as well! The author wrote up nice posts about its design: https://www.wilfred.me.uk/blog/2022/09/06/difftastic-the-fan..., https://difftastic.wilfred.me.uk/diffing.html.

Postgres LISTEN/NOTIFY does not scale

Show HN: Pangolin – Open source alternative to Cloudflare Tunnels

What is Realtalk’s relationship to AI? (2024)

Show HN: Open source alternative to Perplexity Comet

Batch Mode in the Gemini API: Process More for Less

FOKS: Federated Open Key Service

Graphical Linear Algebra

Flix – A powerful effect-oriented programming language

Measuring the impact of AI on experienced open-source developer productivity

Belkin ending support for older Wemo products

Red Hat Technical Writing Style Guide

Yamlfmt: An extensible command line tool or library to format YAML files

Launch HN: Leaping (YC W25) – Self-Improving Voice AI

Turkey bans Grok over Erdoğan insults

How to prove false statements: Practical attacks on Fiat-Shamir

eBPF: Connecting with Container Runtimes

Regarding Prollyferation: Followup to "People Keep Inventing Prolly Trees"

Show HN: Cactus – Ollama for Smartphones

Grok 4

Analyzing database trends through 1.8M Hacker News headlines

Not So Fast: AI Coding Tools Can Reduce Productivity

Diffsitter – A Tree-sitter based AST difftool to get meaningful semantic diffs

Matt Trout has died

Is Gemini 2.5 good at bounding boxes?

The ChompSaw: A Benchtop Power Tool That's Safe for Kids to Use

Foundations of Search: A Perspective from Computer Science (2012) [pdf]

Show HN: Typeform was too expensive so I built my own forms

Final report on Alaska Airlines Flight 1282 in-flight exit door plug separation

Radiocarbon dating reveals Rapa Nui not as isolated as previously thought

Optimizing a Math Expression Parser in Rust

Postgres LISTEN/NOTIFY does not scale

Show HN: Pangolin – Open source alternative to Cloudflare Tunnels

What is Realtalk’s relationship to AI? (2024)

Show HN: Open source alternative to Perplexity Comet

Batch Mode in the Gemini API: Process More for Less

FOKS: Federated Open Key Service

Graphical Linear Algebra

Flix – A powerful effect-oriented programming language

Measuring the impact of AI on experienced open-source developer productivity

Belkin ending support for older Wemo products

Red Hat Technical Writing Style Guide

Yamlfmt: An extensible command line tool or library to format YAML files

Launch HN: Leaping (YC W25) – Self-Improving Voice AI

Turkey bans Grok over Erdoğan insults

How to prove false statements: Practical attacks on Fiat-Shamir

eBPF: Connecting with Container Runtimes

Regarding Prollyferation: Followup to "People Keep Inventing Prolly Trees"

Show HN: Cactus – Ollama for Smartphones

Grok 4

Analyzing database trends through 1.8M Hacker News headlines

Not So Fast: AI Coding Tools Can Reduce Productivity

Diffsitter – A Tree-sitter based AST difftool to get meaningful semantic diffs

Matt Trout has died

Is Gemini 2.5 good at bounding boxes?

The ChompSaw: A Benchtop Power Tool That's Safe for Kids to Use

Foundations of Search: A Perspective from Computer Science (2012) [pdf]

Show HN: Typeform was too expensive so I built my own forms

Final report on Alaska Airlines Flight 1282 in-flight exit door plug separation

Radiocarbon dating reveals Rapa Nui not as isolated as previously thought

Optimizing a Math Expression Parser in Rust

Diffsitter – A Tree-sitter based AST difftool to get meaningful semantic diffs

Comments