DiffX – Next-Generation Extensible Diff Format

365•todsacerdoti•1d ago

Comments

charcircuit•1d ago

>A single diff can’t represent a list of commits

I don't understand why this would be a problem. Each commit can have its own diff.

sedatk•1d ago

I think that's for packaging PRs in a single diff file. Here's an example: https://diffx.org/spec/examples.html#diff-of-multiple-commit...

kazinator•1d ago

It's a solved problem. Commits can be represented as e-mails. (e.g. "git format-patch").

The mails can be catenated together into one file; this is called the mbox format.

Get three commits as three separate files:

  $ git checkout
  Your branch is up to date with 'origin/master'.
  $ git format-patch HEAD~3
  0001-Issue-successful-status-out-of-cdlog.recover.patch
  0002-cdalias-take-zero-one-or-two-args.patch
  0003-cdunalias-improve-test-for-undefined-alias.patch

Now checkout to before those commits, detaching HEAD:

  $ git checkout HEAD~3
  Note: checking out 'HEAD~3'.

  You are in 'detached HEAD' state.

  HEAD is now at b3c0288 New feature: auto recovery.

Now, catenate those files together to create one "patch.mbox" file:

  $ cat 000* > patch.mbox

"git am" takes the whole file and applies it:

  $ git am patch.mbox
  Applying: Issue successful status out of cdlog.recover.
  Applying: cdalias: take zero, one or two args.
  Applying: cdunalias: improve test for undefined alias.

Clean up: return to master, delete the file:

  $ git checkout master
  Warning: you are leaving 3 commits behind, not connected to
  any of your branches:
  [ .... ]
  Switched to branch 'master'
  Your branch is up to date with 'origin/master'.
  $ rm patch.mbox

sedatk•1d ago

Interesting. I wonder why they didn't mention this in their docs. They brush over "Git diff is close to the ideal..." but don't go into specifics like this.

anitil•1d ago

I worked somewhere that maintained their own branch of some software in this way, and they then committed the patches to git. A strange workflow for sure, but it seemed to mostly work

chrismorgan•1d ago

I maintained a long-lived branch once that involved major refactorings, with a lot of machine translation and a lot of manual translation. Periodic rerbasing involved a lot of diff diffing, and editing patch files was often a very practical approach.

mdaniel•1d ago

This submission about some mysql fork ships that fork as a couple of 15MB .patch files in git https://news.ycombinator.com/item?id=44148039 -> https://github.com/enhancedformysql/enhancedformysql/blob/fb...

I am not the target audience for such a thing and thus didn't try to apply said patch but my experience has been that I'm not going to hold my breath

chipx86•1d ago

`git format-patch` and `git am` are great! Not all SCMs have something like that (many don't even have a diff format with enough information to reliably look up a file or apply all of its changes).

Having the whole state of a series of commits up-front can help with tools like ours that need that information ideally in one place and need to do analysis across multiple commits.

We also wanted to avoid issues where, due to a bug or network error or whatever, a missing patch in a sequence could result in a broken set of changes to apply. This is obvious when you're patching locally, but less so when you're sending to a tool to process later. Having it up-front reduces the chances of problems and simplifies a lot of edge cases.

procaryote•1d ago

perhaps the solution is to stop using those SCMs?

chipx86•1d ago

If wishing made it so...

I have so many stories I could tell at this point.

account42•1d ago

I mean ok but you having a business reason to spend effort on supporting legacy systems doesn't really make for a convincing argument why others should care about your "standard" especially when the largest VCS (by far) already supports most of the things you are adding but in an incompatible way.

kazinator•1d ago

If you're forced to use crappy STM, you can still make "stealth" uses of a better SCM in your private workflows.

I have stories also. In one company almost twenty years ago, some Java-spewing troglodytes went on a branch and did stupid things, like simultaneously rename A.java to B.java, B.java to C.java and C.java to A.java! While of course changing all their content too.

When it came time to merge, this was completely beyond the power of Subversion to sort out.

They came to me for help, and I whipped out my Meta-CVS, with excellent support not only for renaming but for doing snapshot imports which figure out renaming by file similarity. We imported all the baselines into it, and it did the merge flawlessly across the rotated renames.

We took the resulting merged baseline out of Meta-CVS and cleanly committed it to the Subversion trunk.

Don't discount other SCMs being useful as tools in SCM problems, even if they are not the main source-of-truth SCM used by the org.

kazinator•1d ago

Electricity and running water are great; alas, not all houses everywhere have them ...

Who cares?

The solution to SCMs not doing this and that is to implement ones which do. We have done that.

You can unlock the cage, and that is enough; if the apes want to stay, let them stay.

xyzzy_plugh•1d ago

I find this whole document hard to read. A "diff" colloquially refers to the difference between two things -- files, directory trees, whatever. What TFA refers to as a diff has been always known as a patch, at least to me.

This is nothing about diffs, but entirely about patch metadata management. I mean, sure, noble goal, but this is just shuffling bits around. If they proposed that metadata was required to be JSON that would be one thing, but instead it's some weird self-describing length-delimited nonsense that just disguises the same problems that exist today. It's already extensible! Just type words!

I've spent a lot of time parsing things out of git commits and patch files and while some standardization would be neat, this isn't it.

That said I find the argument that git diff style is more or less canonical more compelling than I have in the past. So there's that.

> A single diff can't represent a list of commits

A patch set can! Why on earth would you want that represented by a single diff is beyond me.

eddd-ddde•1d ago

The last bit is the first thing I thought. Just use multiple diffs!

ed•1d ago

The patch format addresses all of these issues, no?

https://git-scm.com/docs/git-format-patch

laserbeam•1d ago

TIL these are a thing. Thanks! (Just a regular joe on the internet, not the author)

genocidicbunny•1d ago

It might solve it for git, but this looks like something the Review Board team came up with, and they have to integrate with many other version control systems like SVN, CVS, Perforce..etc. Seems like this is meant to address supporting many different version control systems with a single format.

I've worked at a place that used Review Board, and SVN as their primary vcs, but many devs used a local git-svn mirror for their work. Sometimes this caused problems with uploading diffs, especially if svn and git-svn were being mixed in one review. Having the Review Board cli generate a common diff format for both would have helped with that.

chipx86•1d ago

Exactly that. They all do things so differently that you end up creating and maintaining a separate parser for every SCM's diff format, and sometimes doing a lot of normalization of content or modification to include information the format lacks that's needed to apply the patches. And those are just for the ones that actually have a diff format -- many don't.

We needed something for ourselves at the very least. Much of DiffX came from thinking about these pain points and from talking to other SCM vendors whose engineers have also given some thought to these problems.

motorest•1d ago

> They all do things so differently that you end up creating and maintaining a separate parser for every SCM's diff format (...)

...and you seriously believe that pushing yet another ad-hoc format, and one which no one at all uses, is a way to address your concern?

seanhunter•1d ago

Cue https://xkcd.com/927/

genocidicbunny•1d ago

They are using it. Review Board is a successful project that's been around for a long time, and it's solving a problem they had. One of the most common workflows with Review Board for source code reviews is to use the RBTools command line tools to post or update reviews. The cli would be the one generating the diff (although it supports uploading diffs that you generate iirc.) I haven't looked into the details, but I assume RBTools can generate DiffX diffs which is probably easier for the backend to process. (E - from what chipx86 has said in some of his posts here, they have been using it for several years now)

I don't really see this as pushing anything, more as documentation of something they did for themselves, but are also willing to provide to anyone else if they want to use it. Same as how the source code for the core Review Board product is available for anyone.

If you're happy with the diff format you're using in your workflow, keep using that. No one's twisting your arm to switch to DiffX.

motorest•1d ago

> They are using it.

Which mainstream VCS supports this format?

genocidicbunny•1d ago

You're just being obtuse. It's been explained multiple times that there are tools external to the version control systems that can generate and consume this format. Just because there's no 'svn-diffx' or 'hg-diffx' command/tool built into the vcs itself, doesn't mean that this format can't be generated and used by other tools.

So to answer your question, any vcs that has had tools written for it to generate this format. And it sounds like it's most of the major ones as far as Review Board is concerned.

genocidicbunny•1d ago

DiffX would have been nice to have available way back when I was trying to add support for our custom in-house vcs to Review Board. We had to either contort the diffs from our vcs to some format already understood by Review Board, which was sometimes difficult due to how the vcs structured the data it stored, or add a whole new parser to our Review Board instance, which would have been a major maintenance pain.

As an aside, I applaud you for creating Review Board. I've introduced its usage with several teams that I've worked with, and it really helped change how those teams operated, from a fly-by-night sort of development to actually having a process; The reduction in bugs and improvement in code quality were quite useful too.

chipx86•1d ago

I'm really glad it was useful for you and your teams! :) Hearing that kind of thing always brightens my day. I've felt very lucky getting to work on this as my job all these years.

It'd have been amazing having something like DiffX when we started building some of these SCM integrations too. It's really saved us a lot of trouble with some of the recent ones we've built (PlasticSCM / Unity Version Control, Keysight SOS, and ClearCase), which didn't have a format to work with and needed a lot of extra metadata for lookups and some other stuff.

motorest•1d ago

> Seems like this is meant to address supporting many different version control systems with a single format.

I'm sorry, this is simply wrong at so many levels. You're lauding this as a solution in search for a problem. As OP pointed out, this is already a solved problem as proven by Git. Git is not using a proprietary format. The problem of "integrate with many other version control systems" depends on whether those version control systems want to work on adding support for this feature. I guarantee you there isn't a single SVN or Mercurial maintainer complaining that they would love to share patches with Git but they are blocked because they cannot implement, let alone design, a format to exchange patches. That is not the hard part. That doesn't even register as a concern.

chipx86•1d ago

Git is using a proprietary variant on top of Unified Diffs. Unified Diffs themselves convey very little information about the file being modified, focusing solely on the line-based contents of text files and allowing vendors to provide their own "garbage" lines containing anything else. Every SCM that tracks information beyond line changes in a diff fills out the garbage data differently.

The intent here isn't to let you copy changes from one type of repository to another, but to have a format that can be generated from many SCMs that a tool could parse in a consistent way.

Right now, tools working with diffs from multiple types of SCMs need at least one diff parser per SCM (some provide multiple formats, or have significantly changed compatibility between releases.

For SCMs that lack a diff format (there are several) or lack one that contains enough information to identify a file or its changes (there are several), tools also need to choose a method to represent that information. That often means yet another custom diff format that is specific to the tool consuming the diff.

We've spent over 20 years dealing with the headaches and pain points here, giving it a lot of thought. DiffX (which is now a few years old itself) has worked out very well as a solution for us. This wasn't done in a vacuum, but rather has gone through many rounds of discussion with developers at a few different SCM vendors who have given thought to these issues and supplied much valuable feedback and improvements for the spec.

motorest•1d ago

> Git is using a proprietary variant on top of Unified Diffs.

What definition of "proprietary" are you using?

chipx86•1d ago

Created by the Git team for Git's purposes, rather than something documented or proposed for wider adoption.

Other SCMs can and do use a Git-style diff format, but as there's no defined grammar, there are sometimes important differences. For example, Mercurial's Git-style diffs represent the revisions in a different format than Git's does with different meanings, reuse Git "index" lines for binary files but include SHAs of the file contents instead of any sort of revision, and have a header block that should be stripped out before sending to a Git-style diff parser.

bawolff•1d ago

Aren't you doing the same thing? After all this is just review Board's custom diff format that nobody else uses.

chipx86•1d ago

Yep! We spent 20 years dealing with these problems and in those 20 years nobody really solved these pain points. So we talked to some SCM vendors, bounced ideas around, built a spec, got feedback from them, repeated off-and-on for a couple years until we got the current draft, and implemented it for our needs.

It's been a few years now, and so far so good for the purposes we built it for. And it's there for any other tool or SCM authors to use if it also happens to be useful to them.

seanhunter•1d ago

Feels more like in 20 years nobody else really has those pain points.

1. For most people using multiple SCMs is just a huge and easily-avoidable mistake. Most people can just mandate a single SCM for a project and then all these problems are moot.

2. For the things listed in TFA

    A single diff can’t represent a list of commits

That's what "patch" and "patch format" is for. It works great.

    There’s no standard way to represent binary patches

Very unclear why anyone needs this. There's no standard way to code-review a binary diff (it depends what the blob is that you're diffing) so how would it help if you had this standard way to represent the diff?

    Diffs don’t know about text encodings (which is more of a problem than you might think)

This goes away if people on a project agree a particular encoding (which is going to be utf-8 lets face it). If someone sends a diff in an incorrect file encoding via diffx it will still apply wrong if someone uses a non-diffx aware (aka standard) tool to apply it. So diffx doesn't really fix this problem.

    Diffs don’t have any standard format for arbitrary metadata, so everyone implements it their own way.

This goes away if you just use one SCM for a project which you should anyway for everyone's sanity.

genocidicbunny•1d ago

> 1. For most people using multiple SCMs is just a huge and easily-avoidable mistake. Most people can just mandate a single SCM for a project and then all these problems are moot.

You talk about SCMs, we're talking about VCSs. Where it's not just source code under control, or even source code with a handful of binary assets. Imagine dealing with a VCS that has to handle 15 years and a few petabytes of binary assets. Or individual files that were multiple gigabytes and had changes made to them several times per day. Can git do that gracefully just by itself? Or SVN? Even Perforce struggled with something like that back in the day.

>Very unclear why anyone needs this. There's no standard way to code-review a binary diff (it depends what the blob is that you're diffing) so how would it help if you had this standard way to represent the diff?

A standard way of handling the binary data doesn't mean understanding the binary data. You can leave that up to specific tools. What you need though is a way to somehow package up and describe those binary diffs enough that you can transport the diff data and pick the right tool to show you the actual differences.

> This goes away if you just use one SCM for a project which you should anyway for everyone's sanity.

And if wishes were fishes, I'd never be hungry again. If you have a lot of history, a lot of data, a lot of workflows and tools built up around multiple VCSs, then changing that to just one VCS is going to be a massive undertaking. And not every VCS can handle all of the kinds of data that might get input into it. Some are going to be good at text data, some might handle binary assets better. Some might have a commit model that makes sense for one type of workflow but not for another. For example, you might be dealing with binary assets where you can only have one person working on a specific file at a time because there's no real way to merge changes from multiple people, so they need to lock it. For text assets though, you might be able to handle having multiple people work on a file. To afford both workflows, your VCS now needs to not only support both locking modes, but be hyper-aware of the specific content to know which kind of locking to permit for specific files.

The world doesn't always fit into the nice little models that the most popular VCSs provide. So if you're trying to not limit your product to supporting just those handful of popular VCSs, you can't just assume everything will fit into one of those models.

motorest•1d ago

> Imagine dealing with a VCS that has to handle 15 years and a few petabytes of binary assets.

Part of the problem is that you're fabricating imaginary problems that no one is actually experiencing, and only to try to argue that the solution for this imaginary problems is a file format.

Does this sound reasonable to you?

naasking•1d ago

How are you so convinced that no one is experiencing these problems?

jamoche•1d ago

I once closed a bug with a comment that it was old enough to drink. Also that the lines mentioned in the bug no longer existed, although the file did. Couldn't even satisfy my curiosity about what it had originally looked like, change history didn't go back that far.

15 years is nothing to OS level codebases.

genocidicbunny•1d ago

> 15 years is nothing to OS level codebases.

Or a gamedev studio that has been making various MMO's since the early 2000's.

drabbiticus•1d ago

> you're fabricating imaginary problems

That's a very strong statement. A less aggressive approach to discussion might involve asking for a concrete example of a problem rather than assuming bad faith argument.

Off the top of my head, and just spitballing, I would be more surprised if mature game devs or animation studios didn't want to version control pretty massive asset libraries.

genocidicbunny•1d ago

> Off the top of my head, and just spitballing, I would be more surprised if mature game devs or animation studios didn't want to version control pretty massive asset libraries.

Exactly the sort of thing I had in mind.

genocidicbunny•1d ago

I'm speaking from personal experience, so yes, it sounds reasonable to me. I had to deal with exactly that situation.

genocidicbunny•1d ago

>I guarantee you there isn't a single SVN or Mercurial maintainer complaining that they would love to share patches with Git

I was one of those maintainers. So you're already wrong there. As I described in my parent comment, I've worked somewhere this was an actual problem I encountered. I was responsible for both maintaining our SVN repository, and our Review Board instance, so I have had to actually deal with this.

motorest•1d ago

> I was one of those maintainers. So you're already wrong there.

Please point out exactly where support for DiffX was implemented in either Mercurial or SVN.

genocidicbunny•1d ago

DiffX is a bit younger than the tooling I've written, but I have added custom diffing tools to the SVN client for one team I worked with. I've also written plenty of tools that used the information provided by a VCS (sometimes even poking around in the server-side data), but external to it. So given a few days to refresh my memory on the interfaces, I could probably whip something up for SVN pretty quickly.

itake•1d ago

One of my issues that remains unsolved with diff tools is they are dependent on new line attributes.

Reviewing changes on a long line (like compressed json or long array) is too difficult.

devman0•1d ago

git does have word diffing if you need something more granular than line diffing, the default delimiter being whitespace.

itake•1d ago

oo I didnt' realize that. that seems pretty close to what I want, but the git tools (cli or Github Desktop) still print the entire line.

For line diffing, it clips to only show the ~3 line before and after the change. But the word diff still prints the entire line and you have to scroll to find the change in the line wrapping.

eddd-ddde•1d ago

You can use `delta` as a diff pager and it includes word-diff-syntax-highlight.

saghm•1d ago

I think part of the problem is that the common format used is somewhat of a compromise between being human readable and parseable by tools. I can sort of see where the author of this tool is coming from with trying to address some of that with metadata, but I feel like the better way might be to come to with a format that isn't reliant on plaintext but instead can be rendered to something more readable. Coming up with a way to calculate reasonable diffs on files like you mention that doesn't generate worse diffs than we have now for existing stuff would be challenging, but it doesn't feel like it would be impossible to solve.

chipx86•1d ago

Absolutely agree. I think there's a lot of avenues to explore for better diff representations for structured data (which would also be great for ASTs, something we've been thinking about).

This format is meant to be an extension of Unified Diffs (much like the diff formats of most SCMs), and not something entirely new and focusing on other areas. But if more specific diff formats become widespread, we could directly support encoding them within DiffX as well, as we do for binary diffs formats.

laserbeam•1d ago

I really don’t like the highly hierarchical format, that there’s a “..meta” and a “…meta” somewhere else. I can imagine we want to annotate the whole diff, each file and each chunk. That’s a total of 3 levels of depth. Let’s just give them distinct names and not go full yaml with a format for once?

This helps with readability (if one of the “meta” blocks is missing, for example, I could still tell at a glance what it refers to without counting dots), and is less error prone (it make little sense to me why the metadata associated with a whole diff should have the same fields as the metadata of a file).

Furthermore, why do we have two formats? Json and key=value pairs? Is there any reason to not just use one format because it sounds like the number of things we’d want to annotate is quite small. Having a single structure makes it much easier to write parsers or integrate with existing tooling (grep, sed or jq - but not both at once)

Other notes:

- please allow trailing commas in lists

- diffs are inherently splittable. I can grab half of a diff and apply it. How does your format influence that? I guess it breaks because I would need to copy the preamble, then skip 20 lines, then copy the block I need?

- revisions are a file property? Not a commit checksum? (I might just be dumb here)

chipx86•1d ago

In the early drafts, we played with a number of approaches for the structure. Things like "commit-meta", etc. In the end, we broke it down into `#<section_level><section_type>`, just to simplify the parsing requirements. Every meta block is a meta block, and knowing what section level you're supposed to be in and comparing to what section level you get become a matter of "count the dots".

The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for. The parsing rules for the header are intentionally very simple.

JSON was chosen after a lot of discussion between us and outside parties and after experimentation with other grammars. The header for a meta block can specify a format used to serialize the data, in case down the road something supplants JSON in a meaningful way. We didn't want to box ourselves in, but we also don't want to just let any format sit in there (as that brings us back to the format compatibility headaches we face today).

For the other notes:

1. Compatibility is a key factor here, so we'd want to go with base-level JSON. I'd prefer being able to have trailing commas in lists, but not enough to make life hard for someone implementing this without access to a JSON5 parser.

2. If your goal is to simply feed to GNU patch (or similar), you can still split it. This extra data is in the Unified Diff "garbage" areas, so they'll be ignored anyway (so long as they don't conflict, and we take care to ensure that in our recommendations on encoding).

If your goal is to split into two DiffX files, it does become more complicated in that you'd need to re-add the leading headers.

That said, not all diff formats used in the wild can be split and still retain all metadata. Mercurial diffs, for example, have a header that must be present at the top to indicate parent commit information. You can remove that and still feed to GNU patch, but Mercurial (or tools supporting the format) will no longer have the information on the parent commit.

3. Revisions depend heavily on the SCM. Some SCMs use a commit identifier. Some use per-file identifiers. Some use a combination of the two. Some use those plus additional information that either gets injected into the diff or needs to be known out-of-bounds. There's a wide variety of requirements here across the SCM landscape.

quotemstr•1d ago

> Compatibility is a key factor here, so we'd want to go with base-level JSON. I'd prefer being able to have trailing commas in lists, but not enough to make life hard for someone implementing this without access to a JSON5 parser.

Everyone has access to a JSON5 parser. Everyone has to suffer for the sake of a few people who don't to pay the trifling tax of pip installing something --- when they're using an external library for a novel file format _anyway_?

genocidicbunny•1d ago

> Everyone has access to a JSON5 parser.

That's just a lack of imagination. When you're making a product for teams that span everything from a brand new startup using the latest tooling to teams that are working on software that runs on embedded systems from the 90's, you need to consider things like this.

roblabla•1d ago

There are json5 parsers written in C89 out there. And your embedded systems from the 90s probably doesn't have a JSON parser built in at all either... If you're going to build your own json parser, adding json5 support on top is really trivial.

genocidicbunny•1d ago

That doesn't mean it's not going to be difficult to use that parser. Not everyone has the luxury of being able to use third-party code, or having the time allotted to write a JSON5 parser. The JSON parser some places are using may have been written two decades ago and works well enough that there's little motivation to implement JSON5 support. Sometimes it's just company policy or internal politics that prevent the usage.

It's also just not that big a deal overall for the intended use of the DiffX format. It's mainly machine-generated and machine-consumed. There's human readability concerns for sure, but the format looks to be designed mainly for tools to create and consume, so missing a few features that JSON5 brings is not that big of a deal.

quotemstr•1d ago

So the whole world should suffer through vanilla JSON because someone, somewhere, has an overbearing and paranoid software approval process? That's the attitude the delayed universal unicode adoption by a decade.

genocidicbunny•1d ago

That's a bit dramatic. This isn't something as universal as Unicode. You really only need to care about this if you're writing tools that generate or consume the DiffX format, which is not something most people will be doing. The whole world isn't suffering their decision to use JSON instead of JSON5.

DannyBee•1d ago

"That doesn't mean it's not going to be difficult to use that parser. Not everyone has the luxury of being able to use third-party code, or having the time allotted to write a JSON5 parser."

Why are these people the target market?

I understand it may be important to you, but that isn't the same as "matters to target market/audience".

On top of that, the same constraints you mention here would stop you parsing current git patch formats, and lots of other things anyway. So you were never going to be using modern tools that might care here.

This is all also really meta. Who exactly is writing software with >1% market share, needs to parse the patch format, and can't access a JSON parser.

Instead of this theoretical discussion, let's have a concrete one.

genocidicbunny•1d ago

In this specific instance, those people are part of the target market because the project chooses to make them part of the target market. It's worked well enough for Review Board.

DannyBee•1d ago

I don't think this is true, and honestly, I think it would be a mistake to consider it - they can't serve everyone, down that path is madness. FWIW - I even have a JSON parser in my RTOS-that-must-run-in-less-than-512k.

I also think that target of "embedded systems from the 90's" makes no sense because the tooling for the embedded system, which is what would conceivably want to handle patch format, ran on the host, which easily had access to a JSON parser.

But let's assume it does matter - let's be super concrete - assume they want to serve 95-99% of the users of patch format (i doubt it's even that high).

Which exact pieces of software with even >1% market share that need to process patch format don't have access to a JSON parser?

laserbeam•1d ago

> 1. Compatibility is a key factor here, so we'd want to go with base-level JSON. I'd prefer being able to have trailing commas in lists, but not enough to make life hard for someone implementing this without access to a JSON5 parser.

This is what I was referring to. This is not json:

> #..meta: format=json, length=270

> The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for. The parsing rules for the header are intentionally very simple.

Exactly my point. That level of flexibility for a .patch format to support another language embedded in it is overwhelming. Keep in mind that you are proposing a textual format, not a binary format. So people will use 3rd party text parsing tools to play with it. And having 2 distinct languages in there makes that annoying and a pain.

laserbeam•1d ago

> The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for.

One more thing you should prepare for whenever you have "free-form bits of metadata". They somehow turn into: "some user was storing 100MB blobs in there, and that broke our other thing".

WhyNotHugo•1d ago

What was your reasoning for discarding the existing header format used by git?

koiueo•1d ago

> format (string – recommended): > > This would indicate the metadata format. Currently, only json is officially supported, and is the default if not provided.

JSON doesn't seem a good choice for representing metadata in a format that aims to be universal. It is unnecessarily complicated for this purpose IMO

motorest•1d ago

> JSON doesn't seem a good choice for representing metadata in a format that aims to be universal. It is unnecessarily complicated for this purpose IMO

That's an odd statement. Can you please explain why you believe that JSON is "unnecessarily complicated" to represent metadata.

quotemstr•1d ago

What's wrong with JSON?

* JSON is just barely powerful enough to need a library to parse it, but not powerful enough to have comments or trailing commas, so editing is needlessly annoying.

* It's human-readable, but deciphering nested data structures is annoying, especially when things are formatted as long lines. If you have to pipe something to jq to be able to read it, it's broken as a text-based document format.

* JSON is needlessly strict. If I write {foo: 5}, my intent is crystal clear. I shouldn't have to write {"foo": 5}. Come on. Who's really helped by this kind of syntactic hairshirt?

* despite being a strict schoolmarm of text formats, JSON is still vague. Yes, it has numbers. How big can they be? Who knows?

I mean, JSON is fine-ish, I guess, as an interchange format, in which I'm looking at it only for the occasional debugging session. But as a format for documents meant to be read by humans? Ugh. Anything, anything at all but JSON.

dolmen•1d ago

How often do you edit patch files?

When editing a patch file, how often do you edit metadata beyond the file content differences?

It seems that the proposed DiffX is meant to be produced/edited only by machines, so JSON doesn't seem too much of a problem for this use case.

koiueo•1d ago

You missed the parser part and mistakenly focused on editability.

Diffx authors wrote a prototype in Python where JSON support is built-in. Go parse it in bash or C.

Having to parse JSON for the sake of applying a patch is not exactly wise

watusername•1d ago

> editing is needlessly annoying.

Luckily, this does not matter in DiffX because the whole thing goes up in flames when you change the length anyways :)

yu3zhou4•1d ago

I got me xkcd 927 vibe somehow https://xkcd.com/927/

motorest•1d ago

The name "standard" seems to be too overloaded, and used as a synonym for "specification". There is nothing standard about this proposal.

gjvc•1d ago

they couldn't call it xdiff (which would match the "extensible diff" name) because it would clash with https://github.com/libgit2/xdiff

touristtam•1d ago

Well there are a few "diffx" out there: https://github.com/search?q=%22diffx%22&type=repositories in Java, Typescript, Python and other.

dedicate•1d ago

For stuff like commit histories or complex changes, isn't the real power in the tools around the diff (think Git itself, or code review platforms) rather than trying to cram everything into one super-format?

chipx86•1d ago

It is. However, first the tools need to be able to grab the necessary information from the diff to, say, locate that file or its metadata in a repository.

Git's diff format contains enough information to do this, but many don't. When this happens, tools like ours (a code review platform) is forced to modify the diff format or otherwise attach information to do these sorts of operations.

We've had to do this enough times, and work around enough inconsistencies and breakages in various SCM-specific diff format variants that we decided to address the problem head-on, pulling our experiences and those of some of the SCM vendors we work with to draft a solution to the problem.

KingOfCoders•1d ago

Lets add JSON to everything.

motorest•1d ago

> Lets add JSON to everything.

What's wrong with JSON?

bawolff•1d ago

Nothing is wrong with json, but diff files are generally meant to be hunan readable, and json can be annoying to read at a glance.

redleader55•1d ago

What actual problem is this trying to solve? They mention patch/diff format not being good enough, but they don't explain for whom. Are GNU Patch people complaining? What are these people building that needs a better patch format?

genocidicbunny•1d ago

Looks like this is being used by Review Board, which heavily relies on diffs for source code reviews, and supports a whole bunch of version control systems.

chipx86•1d ago

I have a much-too-long-for-one-comment write-up about this, but it's basically for those who build SCMs or tools that need to work with SCMs. End users shouldn't have to care about this.

There's not one patch/diff format. There's often at least one per SCM. A couple are pretty good (Git's), many are okay (Subversion's), and many are really bad or non-existent.

I founded one of the older code review products, Review Board (turning 20 next year), and we deal with the problems this is trying to solve all the time, across over a dozen SCMs. So we're the ones complaining :) And much of this is based on extensive feedback from SCM vendors we've spoken to about this at length.

Most people shouldn't have to care. But it benefits tools like ours that have to deal with the nightmare that is the world of diff formats.

Jean-Papoulos•1d ago

This is trying to do multiple things (commit info & file diff) in one. Not a good idea. Commit info should live in the repo metadata (no matter which form this takes), and diff should be its own thing.

chipx86•1d ago

The repository metadata absolutely should own the commit information.

Diff files are there to represent a delta state of the repository, a difference between a range of changes. Those may be one or more commits, or one more changes across individual files (not all SCMs manage state in terms of atomic commits). File changes, file attribute changes, SCM-specific metadata changes, and commit history information.

That delta state should be able to be applied to another tree in order to get the same end result. This is what diff files are ultimately there for.

Git diffs do this today, and they do it well (but they're pretty Git-specific). Many SCMs (and there are a lot of them) don't include a format on that level, or a format at all. Hence DiffX.

eqvinox•1d ago

I think there's a confusion here between diffs and patches. I would call the thing you're describing a patch, not a diff, and then everything makes much more sense.

(This is not helped by the fact that diffs often come with a .patch file extension… or that the "patch" tool processes diffs…)

Or anyway the nomenclature really sucks in this field. I guess I have no clue whether I have a minority view here.

chipx86•1d ago

Yeah the nomenclature sucks.

When talking about the file, the two terms are often used interchangeably (and are usually a .diff or a .patch extension).

For fun, the GNU Patch manpage says:

"NAME: patch - apply a diff file to an original"

followed by:

"patch takes a patch file patchfile containing ..."

"patch tries to skip any leading garbage, apply the diff, and then ..."

chipx86•1d ago

Hi, one of the authors of DiffX here. I didn't expect to find this on Hacker News tonight.

I'm going to try to address some of the reasoning behind DiffX here, and answer a few things that have come up in the comments. I'll start by saying, the issues we're addressing are more issues encountered by tools that work with these files, not necessarily the end users of these tools. Most people never have to think hard about the structure of these files, but we do.

A little background. I co-founded Review Board, a code review product that's been around since 2006. We work with a wide variety of source code management solutions, and because of this, we deal extensively with .diff/.patch files. And nobody generates these in any consistent way.

If Git were the only SCM out there, there'd be no need for DiffX. But there are at least a dozen SCMs in use in production across companies today, and more being developed. And each does things differently.

Git, Mercurial, Subversion, ClearCase, Perforce, and others all have their own bespoke type of format, built largely (but not always consistently or fully compatibly) on Unified Diffs. These often augment Unified Diffs by injecting:

* Revision identifiers (which can come in all kinds of forms -- numbers, hashes, paths, reserved words -- and sometimes need to be paired with other information to resolve a file). Sometimes these are on the `---`/`+++` lines, sometimes not.

* Symlink information (some tools provide the old and new symlink paths, some just the new, some neither, some the file contents)

* File modes (similarly, out of those that convey file modes, some provide more details than others, and these can impact application or processing of a patch)

* Commit descriptions (and if not done right, a stray `---` or metadata keyword can break some parsers)

Or any number of other common or SCM-specific metadata.

Pretty much all of these represent data in their own ways. This information goes in the "garbage" area of Unified Diffs, which basically means tools like GNU patch ignore it, but tools aware of that specific variant can parse it out.

At this point in my life, I've written a couple of dozen bespoke diff parsers at this point. Depending on the tool generating the diff, there's all sorts of parsing issues that can come up:

* Varying encodings for file paths and text strings (like commit messages), which can mean a patch on one system doesn't apply on another, depending on the tools generating it.

* BOMs that sometimes show up in strings (we hit this with Perforce years back).

* Messages or metadata can sometimes include characters that resemble Unified Diff content or other variant-specific syntax and can break patching/parsing.

* Differences in how information like symlinks/file modes are conveyed (see above).

* Binary files are almost never able to be represented beyond a "Binary files X and Y differ" line.

* Newlines are sometimes outright broken within the diff (particularly with mixed line endings) and can sometimes break patching.

Just to name a few.

Many SCMs don't even have a diff format to begin with, just generating a Unified Diff. These don't contain any revision information needed to locate a file. In these cases, or when important information is unavailable in some diff variant, we're stuck rolling our own.

Also, here's something you wouldn't normally expect to be a problem, but can be in practice more than you'd think, is that some very large diffs (we've seen ones hundreds of megabytes in size -- don't get me started) are time-consuming to parse. To know everything about the diff, you need to read and scan every line. To generate a list of filenames or stats on a diff, you need to effectively parse all of it.

So we took all those pain points, talked to developers working on a few SCMs, got their pain points and thoughts, and drafted the initial DiffX spec. Went through several rounds with them, iterated until we got where we are today.

The spec had some important goals:

1. Not being vendor-specific.

Git patches were built for Git, and even Git-like patches from Mercurial or Subversion have quirks that can break parsing in a Git-specific patch parser. There's no grammar for how Git stores the metadata and the clients require knowledge up-front of the value types, which isn't a good fit for some of the SCMs out there.

We wanted to draft something that could be used more generically, able to be adopted by newer tools while also being able to represent the information provided from existing tools.

2. Support for arbitrary and injectable metadata.

Some of the formats we work with don't contain enough information to locate the file + revision within a repository. Some require additional information, like an explicit branch or workspace ID or a counterpart changeset number.

Even Git diffs don't always provide enough. They provide a Blob SHA, but not a commit SHA by default. This is a problem when talking to APIs on Git hosting services that require a commit SHA along with (or instead of) a blob SHA.

And some have useful data that can't be fetched after the fact.

So a common headache is that we need to inject additional information in the diffs we generate in order to allow the appropriate data to be looked up.

y using a form of metadata storage for the diff file, the commit, and the files within, we have the ability to inject that additional information without worrying about corrupting a state machine or regex or whatever method is used for some parser (or some older version of our own parsers).

We eventually chose JSON here. We initially had a grammar that looked more like Git's format, but found ourselves dealing with some of the same challenges that YAML had. We didn't want the "NO" problem and we didn't want every client to have to decide on what the value of a string in a piece of metadata should be. Some metadata (such as revision IDs in some SCMs) differentiate between a number and a string that may look like a number, and that information is important to know up-front.

The consensus was that there was more value in JSON than some other format, since it's well-understood, parsers are readily available, and there's no sign that it's disappearing or dropping out of maintenance any time soon.

The format allows for future metadata formats here if, say, json5 or YAML#++ becomes a well-adopted standard 10 years from now.

3. Parsing and mutability.

When working with these files, it's sometimes important to scan for information in the file to do some pre-processing. How many files are in the diff? Is this past some threshold that might trigger a rate limit if we fetch data for each from a repository? Are there binary files in this change?

When these files get very large (which can happen in enterprises when posting a change merging two long-running branches together -- no, people shouldn't ever post 100MB diffs, but they do), these operations can get expensive.

So we built in some parsing aids (content lengths for sections, section hierarchy identifiers) to allow for more efficient parsing. We can read a section header, know where we are and what section we're nested in, and jump to the next piece to read. This is far more efficient than parsing-to-scan and avoids a lot of headaches.

We also get mutability. Generating a diff and attaching metadata to a file in the middle of a diff becomes a lot faster and safer this way.

A consumer never needs to do this. A tool does.

We figured we'd address the text encoding issues while we were at it, because oh boy can these cause problems. A whole topic of its own.

4. Multi-commit files.

Yep, there's Git format-patch. That works great for Git. If I'm on Perforce or SOS or ClearCase and I want to represent a series of commits, I don't have an equivalent format.

If one wants to be able to send a diff spanning a series of commits somewhere for processing or application, being able to do that with one file is valuable. One file means one thing to upload, no risk of a patch ordering issue or a missing patch in the series. The tool processing the diff file would have all the state it needs up-front.

5. Binary files.

Binary files are important. A lot of projects are more than source code in text format. Images, documents, 3D models.. these get left out of diff files today by default.

The exception is Git, which can represent changes to binary files as Git Literal and Git Delta formats. This is largely undocumented (outside of our spec) and not supported by really anything else.

We review binary files, so we wanted this. Talking to other SCM vendors, some found this a pain point as well but didn't have a solution in place. So we wanted this to be documented and addressed in the spec.

This is already very long, but I wanted to give a bit of insight into the kinds of problems and inconsistencies that tools (not necessarily end users) have to deal with, and how this is meant to address some of those problems.

chipx86•1d ago

Sorry for the absolute wall of text (I tried nesting things under bullet points but that didn't work out well). Hopefully some of that is useful.

Key point: Much of this is about solving issues with tools that work with the varying file formats of diffs. It's not really something end users should ever have to care about.

petepete•1d ago

It's really useful, thanks.

I think the original post left people a bit confused on why it exists but you explained it really clearly here.

chipx86•1d ago

I should also mention, we didn't want to invent a brand-new diff format that required all-new tooling (a replacement for GNU patch, for instance). We want this to be able to work with existing tools that understand Unified Diffs and respect the garbage areas (as most do).

There are a lot of alternative approaches to how one might generate a diff (see VCDIFF for binary files), and much that's worth thinking about for generating diff-like formats that are not line-based. But this is not meant to be those. It is meant to be able to incorporate those as time goes on (as it does with VCDIFF today).

lelandbatey•1d ago

I like that it's a very compatible and text-munging-friendly format. While folks may not like bits such as JSON, I think it's a good decision and leaves the door open for friendly extension.

kvemkon•1d ago

> length=629

If you work often enough with patches, you know it is not so rare you need to modify the patch itself resulting in different length. Please do not hard-code the length.

xorcist•1d ago

From the outside, it seems like what you really wanted to write was a diff tool, not a new patch format.

You already are user facing. Why interpret a user facing format behind the scenes? It makes no sense. The document speaks about specifying yet another diff format, but luckily, it does nothing of the sort but specifies a new patch set format. But those are by necessity VCS- and file format specific.

edam30•1d ago

Nice

forrestthewoods•1d ago

> Most people and tools work with Unified Diffs. They look like this:

No I don’t. That plus-minus single column view is complete and total trash. Always has been, always will be.

I’ve used Araxis Merge for almost 20 years. Beyond Compare 3 is also a good choice. Not once in my entire life career have I ever relied on a “unified diff” or a “patch” or any of that garbage.

_Algernon_•1d ago

PhilipRoman•1d ago

What? Diffs and patches are orthogonal to what tools you use to view or merge them. The format defined by TFA is not necessarily meant for human consumption either.

forrestthewoods•1d ago

> Diffs and patches are orthogonal to what tools you use to view or merge them

Diffs and patches are not the foundation. You don’t have to use them at all.

quotemstr•1d ago

> They don’t standardize encodings, revisions, metadata, or even how filenames or paths are represented!

Sure, but some unified diffs, e.g. the ones produced by git, are quite regular. It's also common practice to express diffs as RFC822 email messages (often because they come that way), with headers and descriptive text.

I can't see DiffX getting traction. It's too alien. Too divorced from present practice, no matter how theoretically robust. It's like XHTML2.

Solving the same problems, I'd just establish conventions for sticking the needed metadata in RFC822-style pseudo-headers above the diff. This approach would work with, not against, existing tooling.

Not everything needs to be JSON.

signa11•1d ago

difftastic: https://difftastic.wilfred.me.uk/ uses tree-sitter for better diff-info, and is, imho, superior to this.

chipx86•1d ago

difftastic is great!

This isn't a tool for viewing changes to files or to ASTs. This is a way of being able to generate a single diff file for processing or patching that addresses the kinds of problems we've encountered in over 20 years of building diff parsing tooling and working with over a dozen SCMs with varying levels of completeness or brokenness of bespoke custom diff formats.

It's not an end user tool, but a useful format for tools like code review products to use.

JanisErdmanis•1d ago

This looks great. The diff is quite inefficient for patching with the C preprocessor branches.

Since it patches the code, looking at its tree structure, is the diff human readable, and can it be edited directly? This is a major contributor to why I opt for sed for patching.

touristtam•1d ago

Seen it before, but I might try this time.

FWIW: mise install is breaking due to the submodule. I had to resort to brew install

HelloNurse•1d ago

A staggering amount of unnecessary and counterproductive scope creep in just 4 items:

    A single diff can’t represent a list of commits

    There’s no standard way to represent binary patches

    Diffs don’t know about text encodings (which is more of a problem than you might think)

    Diffs don’t have any standard format for arbitrary metadata, so everyone implements it their own way.

Of these, only a notation for binary patches would be a reasonable generalization of diff files. Everything else is the internal data structure or protocol of some specific revision control system, only exchanged between its clients and servers and backups.

tankenmate•1d ago

Not so, obviously it is less common these days, but I still use patch(1) and friends enough to run into problems from time to time. This is especially true when you have devs on different platforms (don't even get me started on filename mangling / case-folding issues).

Borg3•1d ago

Oh, then this is management issue, not tooling. You need to sit down and analize where your stuff will be developled. Some very basic rules to start with: file names need to be all lower case (they are case-insensitive), use 7bit ASCII encoding for source code files. And vioala :)

bawolff•1d ago

What exactly is the lowest common denominator platform we are trying to target here where we need 7bit ascii? MS-dos?

keybored•1d ago

Could just be Linux. Filenames are just bytes so two equivalent Unicode filenames that have been normalized differently could be confusing. I guess?

I guess since I’m too afraid to use non-ASCII in filenames much.

bawolff•1d ago

I guess that is fair. If i remember right mac uses NFD where literally everyone else in the world uses NFC (linux might not normalize but basically it usually ends up being NFC).

That said, i feel like this is something most tooling could just handle, and not really an issue.

Certainly its not a problem diffX is going to solve since it appears to only store charset and not filename normalization rules.

dotancohen•1d ago

I had this condition a few years ago. A folder shared with Dropbox was then renormalized either by Dropbox or by another system, then when it was synced back to the original machine I had two folders with identical names, normalized differently.

I still have some ls and hd output that I stored in my notes files, if anybody is interested.

dotancohen•1d ago

Here, found it:

  $ ls
  Español  Español  Français  Français
  $ ls | hexdump -C
  00000000  45 73 70 61 6e cc 83 6f  6c 0a 45 73 70 61 c3 b1  |Espan..ol.Espa..|
  00000010  6f 6c 0a 46 72 61 6e 63  cc a7 61 69 73 0a 46 72  |ol.Franc..ais.Fr|
  00000020  61 6e c3 a7 61 69 73 0a                           |an..ais.|
  00000028

bawolff•1d ago

The first one (6e cc 83) is NFD which is used by mac, the second one (c3 b1) is NFC which is used by everyone else.

dotancohen•1d ago

Thanks. I did have a company Mac in 2017, and it was connected to that account.

Joker_vD•1d ago

> I’m too afraid to use non-ASCII in filenames much.

I suggest installing a fresh Linux distribution with e.g. bg_BG.UTF-8 locale and playing with it, especially with XDG directories like "Плот", "Свалени" and "Документи", and apps that should use them by default. Everything should Just Work™.

Although I admit that when reporting bugs for apps that can't handle non-ASCII paths, the responses from the developers (unless they're themselves from non-English speaking countries, but sometimes even then) quite often seem to be very thinly veiled "I can't be bothered to figure out where I botch things, why can't you just speak English like all reasonable people".

bawolff•1d ago

To be fair, as far as unicode goes, cryllic is kind of the easy case (no combining characters, no rtl, etc). In some ways its even easier than (non-english) latin scripts because in latin you can get easily confused with windows-1252 where things sort of work where if you are accidentally using a legacy 8bit encoding with cryllic you are more likely to figure that out quickly.

HelloNurse•15h ago

It's "Cyrillic", named after St. Cyrill.

theamk•17h ago

Any system which uses encodings, including Windows and Linux in non-utf8 locale.

NavinF•1d ago

Poe's law at work. Replies are taking you literally, but I'm almost certain that you're joking. Very few large projects exclusively have lowercase filenames

chipx86•1d ago

We build a code review product that interfaces with over a dozen SCMs. In about 20 years of writing diff parsers, we've encountered all kinds of problems and limitations in SCM-generated diff files (which we have to process) that we wouldn't ever have expected to even consider thinking about before. This all comes from the pain points and lessons learned in that work, and has been a huge help in solving these for us.

These aren't problems end users should hopefully ever need to worry about, but they're problems that tools need to worry about and work around. Especially for SCMs that don't have a diff format of their own, have one that is missing data (in some, not all changes can be represented, e.g. deleted files), or don't include enough information for another tool to identify the file in a repository.

HelloNurse•1d ago

Better file formats cannot, by themselves, improve an inferior SCM tool that, for instance, processes files with the wrong text encoding or forgets deleted and renamed files: they would only have helped you for the purpose of developing your code review tool.

Standards are meant for interchange, like (as mentioned in other comments) producing a patch file by any means and having someone else apply it regardless of what they use for version control.

didsomeonesay•1d ago

If the project owner posted this or is reading here: watch out with the project naming; DivX is a highly litigative brand these days.

touristtam•1d ago

di_ff_x not di_v_x, you div (joking).

On a serious note, I don't see how that would impact them.

bawolff•1d ago

Are these really problems? I feel like i've never really encountered any of these issues and have trouble imagining when they would crop up (except binary files).

- encoding - even if your file is not utf-8, why would that matter? You would still run the patch algorithm the same way. It doesn't really matter if the characters are valid utf-8

-why would i want a single diff to represent multiple commits? Having multiple diffs seems much more natural.

-metadata... i guess, but also the metadata seems like it would mostly only be useful inside a single system.

chipx86•1d ago

Generally-speaking, you probably shouldn't have to deal with these problems unless you're writing a tool that has to interface with certain SCMs or SCMs used in certain environments. I'll give you some examples for each of these points:

1. There are two important areas where encoding can matter: The filename and the diff content.

Git pays attention to filename encoding, but most SCMs don't, so when a diff is generated, it's based on the local encoding. If there are any non-ASCII characters in that filename, a diff generated in one environment with one encoding set can end up not applying to another (or, in our case, not being able to be looked up from a repository). This isn't common but it can happen (we've seen this on Perforce and Subversion).

Then there's the content. Many SCMs will actually give you a representation of a text file and not the raw contents itself. That text file will be re-encoded for your local/preferred encoding, and newlines may be adjusted as well (`\r\n`, `\n`). The text file is then re-encoded back when pushing the change. This allows people in different environments to operate on the same file regardless of what encoding they're working with.

This doesn't necessarily make its way into the diff, though. So when you send a diff from a less-common encoding to a tool to process it, and that tries to apply it to the file checked out with its encoding, it can fail to patch.

The solution is to either know the encoding of the file you're processing, or try to guess it (some tools, like ours, let you specify a list of preferred encodings to try).

It's best if you can know it up-front.

Bonus Fun Fact: On some SCMs (Perforce comes to mind), checking out a file on Windows and then diffing it Linux via a shared mount can get you a diff with `\r\r\n` newlines. It's a bad time and breaks patching. And it used to come up a lot, until we worked around it.

Also, Perforce for a while would sometimes normalize encodings incorrectly and you'd end up with BOMs in the diff, breaking GNU patch.

2. It does when you're working with them directly for applying and patching. If you're handing them off to a tool for processing, if there's any risk of one file in a sequence not being included, you can end up with breakages that maybe you don't see until later processing.

It's also just really nice having all the state and metadata up-front so we can process it in one go in a consistent way without having to sanity-check all the diffs against each other.

When working locally, it also depends on your tooling. `git format-patch` and `git am` are great, but are for Git. If I'm working with (let's just say) Subversion, I need to do my own thing or find another tool.

3. It's critical for the kind of information needed to locate files in a repository. Some systems need a commit-wide identifier. Some need per-file identifiers. Some need a combination of the two. Some need those plus additional data not otherwise represented in the path or revision (generally more enterprise SCMs targeting certain use cases).

It's also critical for representing information that isn't in the Unified Diff format (namely, anything but the filename). So, symlink information, file modes, SCM-specific properties on a file or directory, to name a few. This information needs to live somewhere if a SCM provides it, and it's up to every SCM to choose how and where to store that data (and then how it's encoded, etc.).

account42•1d ago

> Then there's the content. Many SCMs will actually give you a representation of a text file and not the raw contents itself. That text file will be re-encoded for your local/preferred encoding, and newlines may be adjusted as well (`\r\n`, `\n`). The text file is then re-encoded back when pushing the change.

Yeah, don't do that.

> This allows people in different environments to operate on the same file regardless of what encoding they're working with.

No it causes hard to understand bugs because now what people see on their device and what is tracked in source control differs, defeating the entire purpose of having source control in the first place. This isn't theoretical at all btw.

> The solution is to either know the encoding of the file you're processing

In general, there is no such encoding - source control tools need to be able to deal with files not valid in any single encoding.

account42•1d ago

> - encoding - even if your file is not utf-8, why would that matter? You would still run the patch algorithm the same way. It doesn't really matter if the characters are valid utf-8

Yeah I don't see a use-case for a patch encoding either - just treat the patch data as ascii-delimited binary mistery goo. Patch files need to be able to deal with mixed encoding text (e.g. to fix it) so you can't really just have one encoding anyway.

rwmj•1d ago

They're not problems at all. They probably should have asked people who regularly use diffs what actual problems they have, rather than trying to reinvent some overengineered yaml in a vacuum.

ris•1d ago

Binary data - definitely a problem.

xiphias2•1d ago

This format may be backwards compatible, but not forwards compatible.

JSON solved this mostly by standardizing what most implementations were already doing, so that would be a great thing to do.

If git diff isn't documented, the solution is not to create a new format, but to go through the source code and document it.

bravesoul2•1d ago

If disk space is no issue, a good difference format is just the entire original file and the entire new file.

If an editor was involved and you want better mergability you could include the original file and the sequence of CRDT ops.

SuperNinKenDo•1d ago

Isn't that technically not the case if you wanted to apply a specific patch without every other change that's been made to the same file.

The more changes, the more likely it is for a patch to fail, but in principle it seems like cherry-picking and applying a change is a valid use of a diff.

account42•1d ago

Git already conceptually stores an entire tree of files for each commit and has no problem with rebasing, cherry-picking etc. And the patches it generates for you are derived on the fly from those snapshots - a commit doesn't have a fixed canonical patch text.

SuperNinKenDo•1d ago

But this is a proposal for a diff format, not a specification for a VCS.

aidenn0•1d ago

The original file plus the file with the changes you care about is a superset of the information included in a diff file.

greatgib•1d ago

Extending/reworking the format is probably good but I don't think that using multiline (indentation dependant)json or yaml would be good for such a thing.

One of the interesting point of diff files is that all commands are on single lines. You can easily parse or manipulate with simple shell tools just stripping lines out.

eddd-ddde•1d ago

So much this. At the very least metadata should be inside BEGIN and END markers that allow for easy extraction with something like awk. Not sprinkled around _multiple_ JSON objects you have to merge in manually.

tlb•1d ago

The most general and unambiguous way to represent a diff is to just include the contents of the two files. It's more data, but that's rarely an issue these days.

So instead of `diff a b | patch c`, where the data through the pipe needs to be in some interchange format, you'd run `apply a b c` and the apply command can use whatever internal representation it likes.

Diffs also aren't great for human reading. A color-coded side-by-side view is better. For which you also want to start with the two files.

There's really no need to ever transmit a diff and deal with all the format vagaries when you can just send the two files.

vintermann•1d ago

> It's more data, but that's rarely an issue these days.

I do think it's a bit annoying that a program gets updated, and you have to download the whole 130 gb again.

But especially how quickly and easily you can compress two almost-identical files, I think your approach has a lot going for it. It may even be possible to get clever and send over just a hash of the original file, and a version of the new file which has been compressed with the original file as prefix (but without the actual compressed data for that).

thecupisblue•1d ago

>There's really no need to ever transmit a diff and deal with all the format vagaries when you can just send the two files.

Well, depends what are you doing, and in 2025, they are more relevant than ever.

Asking an LLM to output a diff for an edit can save you a staggering amount of tokens and cut the latency of it's response by 5-10x. I've done it your way, a custom diff way and then added a standard diff one, and even back then with GPT 3.5 there was a huge difference, let alone now with way larger models.

There is a lot of diff's in the dataset so telling it to create a standard diff is usually no different than asking it to create a whole file in terms of accuracy (depending on the task), but saves you all the output tokens and reduces the amount of compute/time required to infer all those tokens.

Updating a code running in a sandbox on a 3rd machine over the wire and speed is relevant? You want a diff. I did it your way first for ease, but knowing how much data and compute I was wasting on that, it was a low hanging optimisation to use a diff, and it worked wonders. Yeah, for most usecases it would be an overkill, but for this usecase miliseconds were important.

If you have file A and A2 and a diff AxA2, constructing file A2 is easy and saves you all the A-diff data.

Merging them is the only potential issue due to conflicts, and that is where a human or LLM has to come in and that is where just having a file A2 to overwrite the original one would be easy, but conflict occurs only in cases where you might not want that to happen.

TLDR; diff good.

layer8•1d ago

A diff between two files isn’t unique, meaning there can be better or worse diffs between the same two versions of a file, depending on the file format and possibly the purpose of the diff. Similarly, there can be different strategies for applying a diff as a patch.

Having a diff format allows decoupling the implementation of diff creation from the implementation of diff application, turning a potential n*m problem into an n+m problem.

blacklion•1d ago

It is hard to manipulate pairs of files without special container. For example, you want to attach chages to e-mail, changes cover 10 files + 1 removed file + 2 added files. Will you pack it to tar/zip with two folders `old` and `new` inside or what? Looks like pre-VCS era solution, when we did manaul "version control" by copying `project` to `project-19950112-final-for-sure` :)

account42•1d ago

Congratulations, you have added "standard" n+1 for patch and other tools to deal with.

blacklion•1d ago

So, self-delimitered format (JSON) is embedded in format with lengths? I change one space in JSON, JSOM is valid, whole DiffX file is invalid.

Nice, nice.

Format looks very clunky and messy, to be honest, mixture of self-invented headers and JSON payloads, strange structure (without comments here I will not notice different number of dots in `.meta`), need essentialy two parsers.

Idea to have extended diff with standard way to put metadata is good.

This implementation looks bad, sorry.

eisbaw•1d ago

Why are diffs still so text-based? That should be a last resort. Typically when you edit a text file, your action has meaning beyond the primitive edit. You probably changed a variable name, moved a function above another. Replaced a whole chunk of stuff with other stuff, etc.

The problem with diffs is that they are not easily portable - because their intent is so (accidentally) low-level. Imbue diffX with semantics, and they can become more readable - by humans and AI alike.

karmakaze•1d ago

Odd that it's a format that couldn't decide on a format "#..meta: format=json, length=270" The "length=270" also has a redundant/fragile smell to it.

Other thought was when was the last time I had problems with a diff file--can't recall (maybe decades ago). Probably only a problem when working with multiple VCSes in which case you could make a diff translator that understands each one intimately.

bronlund•1d ago

"We have four different standards, why don't we just make one proper!". And that's why we now have five standards.

Jokes aside, good luck with this one :)

b0a04gl•1d ago

tbh this feels like overengineering the wrong pain. devs aren't begging for more metadata in diffs .. they're begging for tooling that doesn't explode when someone renames a file or changes line endings. most teams can't even agree on commit message formats, and now we're gonna throw structured diff metadata into the mix? cool idea, but feels like a clean spec solving a messy human problem. the chaos isn't in the diff format/, it's in the humans using it.

WhyNotHugo•1d ago

They could have built on top of git’s header syntax for metadata (which itself is based on email headers) instead of reinventing it in a new flavour of pseudo-JSON.

xkcd927 strikes again!

crabbone•1d ago

> #...diff: length=629

What an awful idea... Now if I have to edit a patch I need to count characters in it? Come on...

looneysquash•1d ago

How does ReviewBoard make use of diffx if none of the existing tools support it?

Is that a chicken and egg problem, or it is useful by itself?

diffxx•23h ago

I'm ready for their follow up product.

Aurora – 500-watt SDR ham radio transceiver announced

I Got Blocked by Wikipedia on Purpose – Just to Burn My Name into Their Logs

Woman sues IBM over lost job, claims she was passed over because she is white

Launching Kaizly – summer learning made simple for your child

Feds charge 12 more suspects in RICO case over crypto crime spree

Show HN: String Flux – Simplify everyday string transformations for developers

Gleam JavaScript gets 30% faster

Show HN: Dietnb – Prevent Jupyter notebooks from bloating with Base64 images

How to Improve Data Quality

A short history of Greenland, in six maps

Don't Settle for Mediocre Front End Testing

CEO Sundar Pichai says Google to keep hiring engineers

Facet: Reflection for Rust

Ask HN: Is GPU nondeterminism bad for AI?

Discord's CTO Is Just as Worried About Enshittification as You Are

What LLMss Don't Talk About: Empirical Study of Moderation & Censorship Practice

Ask HN: Should movie theaters allow you to watch movies in 30 minute chunks?

Soviet Radio Manufacturer Logos

Vapor: Swift, but on a Server

$300 Ukrainian drones vs. $100M Russian bombers

Show HN: YOYO – AI Version Control for Vibe Coding

Trump and Musk enter bitter feud – and Washington buckles up

Musk: SpaceX will ground Dragon spacecraft used to shuttle astronauts to ISS

Tokasaurus: An LLM Inference Engine for High-Throughput Workloads

Technical Interviews in the Age of LLMs

APL Interpreter – An implementation of APL, written in Haskell (2024)

Ask HN: Validating a Tool to Help Founders Stay Focused and Build What Matters

U.S. Research Stock Returns Data

Meta Advertising Manual

Remote Development with X2Go

Aurora – 500-watt SDR ham radio transceiver announced

I Got Blocked by Wikipedia on Purpose – Just to Burn My Name into Their Logs

Woman sues IBM over lost job, claims she was passed over because she is white

Launching Kaizly – summer learning made simple for your child

Feds charge 12 more suspects in RICO case over crypto crime spree

Show HN: String Flux – Simplify everyday string transformations for developers

Gleam JavaScript gets 30% faster

Show HN: Dietnb – Prevent Jupyter notebooks from bloating with Base64 images

How to Improve Data Quality

A short history of Greenland, in six maps

Don't Settle for Mediocre Front End Testing

CEO Sundar Pichai says Google to keep hiring engineers

Facet: Reflection for Rust

Ask HN: Is GPU nondeterminism bad for AI?

Discord's CTO Is Just as Worried About Enshittification as You Are

What LLMss Don't Talk About: Empirical Study of Moderation & Censorship Practice

Ask HN: Should movie theaters allow you to watch movies in 30 minute chunks?

Soviet Radio Manufacturer Logos

Vapor: Swift, but on a Server

$300 Ukrainian drones vs. $100M Russian bombers

Show HN: YOYO – AI Version Control for Vibe Coding

Trump and Musk enter bitter feud – and Washington buckles up

Musk: SpaceX will ground Dragon spacecraft used to shuttle astronauts to ISS

Tokasaurus: An LLM Inference Engine for High-Throughput Workloads

Technical Interviews in the Age of LLMs

APL Interpreter – An implementation of APL, written in Haskell (2024)

Ask HN: Validating a Tool to Help Founders Stay Focused and Build What Matters

U.S. Research Stock Returns Data

Meta Advertising Manual

Remote Development with X2Go

DiffX – Next-Generation Extensible Diff Format

Comments