(I evaluated semantic diff tools for use in Brokk but I ultimately went with standard textual diff; the main hangup that I couldn't get past is that semantic diff understandably works very poorly when you have a syntactically invalid file due to an in-progress edit.)
For the parsing problem, maybe something like Early's algorithm that tries to minimize an error term?
You need this kind of robust parser for languages with preprocessors.
One very important rule is: no token can span more than one (possibly backslash-extended) line. This means having neither delimited comments (use multiple single-line comments; if your editor is too dumb for this you really need a new editor) nor multi-line strings (but you can do implicit concatenation of a string literal flavor that implicitly includes the newline; as a side-effect this fixes the indentation problem).
If you don't follow this rule, you might as well give up on robustness, because how else are you going to ever resynchronize after an error?
For parsing you can generally just aggressively pop on mismatched parens, unexpected semicolons, or on keywords only allowed in a top-ish level context. Of course, if your language is insane (like C typedefs), you might not be able to parse the next top-level function/class anyway. GNU statement-expressions, by contrast, are an actually useful thing that requires some thought. But again, language design choices can mitigate this (such as making classes values, template argument equivalent to array indexing, and statements expressions).
An error-cost-minimizing dynamic programming parser could do this.
* this is still during lexing, not yet to parsing
* there are multiple valid token sequences that vary only with a single character at the start of the file. This is very common with Python multi-line strings in particular, since they are widely used as docstrings.
If people want to move forward they'll look past it. Garbage in, garbage out.
My experience many years back with using just a syntax highlighting tuned lexer to build character-level diffs showed a lot of great promise: https://github.com/WorldMaker/tokdiff
> Each commit of a patch already has everything ready to be processed and chunked IF you keep them - small, atomic, semantically meaningful. As in do smaller commits.
Reads like:
User1: I need help with my colleagues who do not make independent, small, semantically intact commits
User2: well, have you tried making smaller, more independent, semantically intact commits?
---
My interpretation of the wish is to convert this, where they have intermixed two semantically independent changes in one diff:
+++ a/alpha.py
--- b/alpha.py
def doit():
- awesome = 3.14
+ awesome = 4.56
- print("my dog is fluffy")
+ print("my cat is fluffy")
into this +++ a/alpha.py
--- b/alpha.py
def doit():
- awesome = 3.14
+ awesome = 4.56
print("my dog is fluffy")
+++ a/alpha.py
--- b/alpha.py
def doit():
awesome = 3.14
- print("my dog is fluffy")
+ print("my cat is fluffy")
where each one could be cherry-picked at will because they don't semantically collideThe semantics part would be knowing that this one could not be split in that manner, because the cherry-pick would change more than just a few lines, it would change the behavior
+++ a/alpha.py
--- b/alpha.py
def doit():
- the_weight = 3.14
+ the_weight = 4.56
- print("my dog weighs %f", the_weight)
+ print("my cat weighs %f", the_weight)
I'm sure these are very contrived examples, but it's the smallest one I could whip up offhandDifftastic itself is great as well! The author wrote up nice posts about its design: https://www.wilfred.me.uk/blog/2022/09/06/difftastic-the-fan..., https://difftastic.wilfred.me.uk/diffing.html.
Although - for more exotic applications parsing structural data I've found langium is far more capable as a platform. Typescript is also a pleasant departure from common AST tools.
Every user gets their own preferred formatting, and linters and tools could operate on already-parsed trees
The thing is that this means sacrificing the enormous advantage of plaintext, which is that it is enormously interoperable: we use a huge quantity of text-based tools to work with source code, including non-code-specific ones (grep, sed…)
Also, code is meant to be read by humans: things like alignement and line returns really do matter (although opinions often differ about the “right” way)
And of course the lesser (?) problem of invalid ASTs.
I know a lot of people think source control should only have buildable code, but that's what CI processes are for and people use source control (and diffs) for a lot of things that don't need to pass CI 100% of the time.
That may be fine if you are happy with the plain text status quo, but if your goal is to avoid or minimize merge conflicts (as most people want when talking about semantic diff), you don't really solve that as well as you'd like.
(Additionally, and it is a lot less of a concern for git on disk storage but for some git-based email flows and other VCSes patch size matters and a consistent style of diffs between patches can be a useful storage or transfer optimization. Plain text diffs are more likely to produce a lot bigger patches compared to optimization wins you might get from a semantic diff; a mixture of merges between semantic and plain text diffs is often a worst of both worlds case in overall patch sizes as they churn against each other.)
The ultimate goal is to simplify the building and maintenance of a port of an actively-maintained codebase or specification by avoiding the need to know how every last upstream change corresponds to the downstream.
Just from an initial peek at the repo, I might have to take a look at how the author is processing their TreeSitter grammars -- writing the queries by hand is a bit of a slow process. I'm sure there are other good ideas in there too, and Diffsitter looks like it'd be perfect for displaying the actual semantic changes.
Early prototype, heavily relies on manual annotations in the downstream: https://github.com/NTmatter/rawr
(yes, it's admittedly a "Rewrite it in Rust" tool at the moment, but I'd like it to be a generic "Rewrite it in $LANG" in the future)
fjfaase•7mo ago
koozz•7mo ago
alwillis•7mo ago