This is nothing about diffs, but entirely about patch metadata management. I mean, sure, noble goal, but this is just shuffling bits around. If they proposed that metadata was required to be JSON that would be one thing, but instead it's some weird self-describing length-delimited nonsense that just disguises the same problems that exist today. It's already extensible! Just type words!
I've spent a lot of time parsing things out of git commits and patch files and while some standardization would be neat, this isn't it.
That said I find the argument that git diff style is more or less canonical more compelling than I have in the past. So there's that.
> A single diff can't represent a list of commits
A patch set can! Why on earth would you want that represented by a single diff is beyond me.
I've worked at a place that used Review Board, and SVN as their primary vcs, but many devs used a local git-svn mirror for their work. Sometimes this caused problems with uploading diffs, especially if svn and git-svn were being mixed in one review. Having the Review Board cli generate a common diff format for both would have helped with that.
We needed something for ourselves at the very least. Much of DiffX came from thinking about these pain points and from talking to other SCM vendors whose engineers have also given some thought to these problems.
...and you seriously believe that pushing yet another ad-hoc format, and one which no one at all uses, is a way to address your concern?
I don't really see this as pushing anything, more as documentation of something they did for themselves, but are also willing to provide to anyone else if they want to use it. Same as how the source code for the core Review Board product is available for anyone.
If you're happy with the diff format you're using in your workflow, keep using that. No one's twisting your arm to switch to DiffX.
Which mainstream VCS supports this format?
So to answer your question, any vcs that has had tools written for it to generate this format. And it sounds like it's most of the major ones as far as Review Board is concerned.
As an aside, I applaud you for creating Review Board. I've introduced its usage with several teams that I've worked with, and it really helped change how those teams operated, from a fly-by-night sort of development to actually having a process; The reduction in bugs and improvement in code quality were quite useful too.
It'd have been amazing having something like DiffX when we started building some of these SCM integrations too. It's really saved us a lot of trouble with some of the recent ones we've built (PlasticSCM / Unity Version Control, Keysight SOS, and ClearCase), which didn't have a format to work with and needed a lot of extra metadata for lookups and some other stuff.
I'm sorry, this is simply wrong at so many levels. You're lauding this as a solution in search for a problem. As OP pointed out, this is already a solved problem as proven by Git. Git is not using a proprietary format. The problem of "integrate with many other version control systems" depends on whether those version control systems want to work on adding support for this feature. I guarantee you there isn't a single SVN or Mercurial maintainer complaining that they would love to share patches with Git but they are blocked because they cannot implement, let alone design, a format to exchange patches. That is not the hard part. That doesn't even register as a concern.
The intent here isn't to let you copy changes from one type of repository to another, but to have a format that can be generated from many SCMs that a tool could parse in a consistent way.
Right now, tools working with diffs from multiple types of SCMs need at least one diff parser per SCM (some provide multiple formats, or have significantly changed compatibility between releases.
For SCMs that lack a diff format (there are several) or lack one that contains enough information to identify a file or its changes (there are several), tools also need to choose a method to represent that information. That often means yet another custom diff format that is specific to the tool consuming the diff.
We've spent over 20 years dealing with the headaches and pain points here, giving it a lot of thought. DiffX (which is now a few years old itself) has worked out very well as a solution for us. This wasn't done in a vacuum, but rather has gone through many rounds of discussion with developers at a few different SCM vendors who have given thought to these issues and supplied much valuable feedback and improvements for the spec.
What definition of "proprietary" are you using?
Other SCMs can and do use a Git-style diff format, but as there's no defined grammar, there are sometimes important differences. For example, Mercurial's Git-style diffs represent the revisions in a different format than Git's does with different meanings, reuse Git "index" lines for binary files but include SHAs of the file contents instead of any sort of revision, and have a header block that should be stripped out before sending to a Git-style diff parser.
It's been a few years now, and so far so good for the purposes we built it for. And it's there for any other tool or SCM authors to use if it also happens to be useful to them.
1. For most people using multiple SCMs is just a huge and easily-avoidable mistake. Most people can just mandate a single SCM for a project and then all these problems are moot.
2. For the things listed in TFA
A single diff can’t represent a list of commits
That's what "patch" and "patch format" is for. It works great. There’s no standard way to represent binary patches
Very unclear why anyone needs this. There's no standard way to code-review a binary diff (it depends what the blob is that you're diffing) so how would it help if you had this standard way to represent the diff? Diffs don’t know about text encodings (which is more of a problem than you might think)
This goes away if people on a project agree a particular encoding (which is going to be utf-8 lets face it). If someone sends a diff in an incorrect file encoding via diffx it will still apply wrong if someone uses a non-diffx aware (aka standard) tool to apply it. So diffx doesn't really fix this problem. Diffs don’t have any standard format for arbitrary metadata, so everyone implements it their own way.
This goes away if you just use one SCM for a project which you should anyway for everyone's sanity.You talk about SCMs, we're talking about VCSs. Where it's not just source code under control, or even source code with a handful of binary assets. Imagine dealing with a VCS that has to handle 15 years and a few petabytes of binary assets. Or individual files that were multiple gigabytes and had changes made to them several times per day. Can git do that gracefully just by itself? Or SVN? Even Perforce struggled with something like that back in the day.
>Very unclear why anyone needs this. There's no standard way to code-review a binary diff (it depends what the blob is that you're diffing) so how would it help if you had this standard way to represent the diff?
A standard way of handling the binary data doesn't mean understanding the binary data. You can leave that up to specific tools. What you need though is a way to somehow package up and describe those binary diffs enough that you can transport the diff data and pick the right tool to show you the actual differences.
> This goes away if you just use one SCM for a project which you should anyway for everyone's sanity.
And if wishes were fishes, I'd never be hungry again. If you have a lot of history, a lot of data, a lot of workflows and tools built up around multiple VCSs, then changing that to just one VCS is going to be a massive undertaking. And not every VCS can handle all of the kinds of data that might get input into it. Some are going to be good at text data, some might handle binary assets better. Some might have a commit model that makes sense for one type of workflow but not for another. For example, you might be dealing with binary assets where you can only have one person working on a specific file at a time because there's no real way to merge changes from multiple people, so they need to lock it. For text assets though, you might be able to handle having multiple people work on a file. To afford both workflows, your VCS now needs to not only support both locking modes, but be hyper-aware of the specific content to know which kind of locking to permit for specific files.
The world doesn't always fit into the nice little models that the most popular VCSs provide. So if you're trying to not limit your product to supporting just those handful of popular VCSs, you can't just assume everything will fit into one of those models.
Part of the problem is that you're fabricating imaginary problems that no one is actually experiencing, and only to try to argue that the solution for this imaginary problems is a file format.
Does this sound reasonable to you?
15 years is nothing to OS level codebases.
Or a gamedev studio that has been making various MMO's since the early 2000's.
That's a very strong statement. A less aggressive approach to discussion might involve asking for a concrete example of a problem rather than assuming bad faith argument.
Off the top of my head, and just spitballing, I would be more surprised if mature game devs or animation studios didn't want to version control pretty massive asset libraries.
Exactly the sort of thing I had in mind.
I was one of those maintainers. So you're already wrong there. As I described in my parent comment, I've worked somewhere this was an actual problem I encountered. I was responsible for both maintaining our SVN repository, and our Review Board instance, so I have had to actually deal with this.
Please point out exactly where support for DiffX was implemented in either Mercurial or SVN.
Reviewing changes on a long line (like compressed json or long array) is too difficult.
For line diffing, it clips to only show the ~3 line before and after the change. But the word diff still prints the entire line and you have to scroll to find the change in the line wrapping.
This format is meant to be an extension of Unified Diffs (much like the diff formats of most SCMs), and not something entirely new and focusing on other areas. But if more specific diff formats become widespread, we could directly support encoding them within DiffX as well, as we do for binary diffs formats.
This helps with readability (if one of the “meta” blocks is missing, for example, I could still tell at a glance what it refers to without counting dots), and is less error prone (it make little sense to me why the metadata associated with a whole diff should have the same fields as the metadata of a file).
Furthermore, why do we have two formats? Json and key=value pairs? Is there any reason to not just use one format because it sounds like the number of things we’d want to annotate is quite small. Having a single structure makes it much easier to write parsers or integrate with existing tooling (grep, sed or jq - but not both at once)
Other notes:
- please allow trailing commas in lists
- diffs are inherently splittable. I can grab half of a diff and apply it. How does your format influence that? I guess it breaks because I would need to copy the preamble, then skip 20 lines, then copy the block I need?
- revisions are a file property? Not a commit checksum? (I might just be dumb here)
The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for. The parsing rules for the header are intentionally very simple.
JSON was chosen after a lot of discussion between us and outside parties and after experimentation with other grammars. The header for a meta block can specify a format used to serialize the data, in case down the road something supplants JSON in a meaningful way. We didn't want to box ourselves in, but we also don't want to just let any format sit in there (as that brings us back to the format compatibility headaches we face today).
For the other notes:
1. Compatibility is a key factor here, so we'd want to go with base-level JSON. I'd prefer being able to have trailing commas in lists, but not enough to make life hard for someone implementing this without access to a JSON5 parser.
2. If your goal is to simply feed to GNU patch (or similar), you can still split it. This extra data is in the Unified Diff "garbage" areas, so they'll be ignored anyway (so long as they don't conflict, and we take care to ensure that in our recommendations on encoding).
If your goal is to split into two DiffX files, it does become more complicated in that you'd need to re-add the leading headers.
That said, not all diff formats used in the wild can be split and still retain all metadata. Mercurial diffs, for example, have a header that must be present at the top to indicate parent commit information. You can remove that and still feed to GNU patch, but Mercurial (or tools supporting the format) will no longer have the information on the parent commit.
3. Revisions depend heavily on the SCM. Some SCMs use a commit identifier. Some use per-file identifiers. Some use a combination of the two. Some use those plus additional information that either gets injected into the diff or needs to be known out-of-bounds. There's a wide variety of requirements here across the SCM landscape.
Everyone has access to a JSON5 parser. Everyone has to suffer for the sake of a few people who don't to pay the trifling tax of pip installing something --- when they're using an external library for a novel file format _anyway_?
That's just a lack of imagination. When you're making a product for teams that span everything from a brand new startup using the latest tooling to teams that are working on software that runs on embedded systems from the 90's, you need to consider things like this.
It's also just not that big a deal overall for the intended use of the DiffX format. It's mainly machine-generated and machine-consumed. There's human readability concerns for sure, but the format looks to be designed mainly for tools to create and consume, so missing a few features that JSON5 brings is not that big of a deal.
Why are these people the target market?
I understand it may be important to you, but that isn't the same as "matters to target market/audience".
On top of that, the same constraints you mention here would stop you parsing current git patch formats, and lots of other things anyway. So you were never going to be using modern tools that might care here.
This is all also really meta. Who exactly is writing software with >1% market share, needs to parse the patch format, and can't access a JSON parser.
Instead of this theoretical discussion, let's have a concrete one.
I also think that target of "embedded systems from the 90's" makes no sense because the tooling for the embedded system, which is what would conceivably want to handle patch format, ran on the host, which easily had access to a JSON parser.
But let's assume it does matter - let's be super concrete - assume they want to serve 95-99% of the users of patch format (i doubt it's even that high).
Which exact pieces of software with even >1% market share that need to process patch format don't have access to a JSON parser?
This is what I was referring to. This is not json:
> #..meta: format=json, length=270
> The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for. The parsing rules for the header are intentionally very simple.
Exactly my point. That level of flexibility for a .patch format to support another language embedded in it is overwhelming. Keep in mind that you are proposing a textual format, not a binary format. So people will use 3rd party text parsing tools to play with it. And having 2 distinct languages in there makes that annoying and a pain.
One more thing you should prepare for whenever you have "free-form bits of metadata". They somehow turn into: "some user was storing 100MB blobs in there, and that broke our other thing".
JSON doesn't seem a good choice for representing metadata in a format that aims to be universal. It is unnecessarily complicated for this purpose IMO
That's an odd statement. Can you please explain why you believe that JSON is "unnecessarily complicated" to represent metadata.
* JSON is just barely powerful enough to need a library to parse it, but not powerful enough to have comments or trailing commas, so editing is needlessly annoying.
* It's human-readable, but deciphering nested data structures is annoying, especially when things are formatted as long lines. If you have to pipe something to jq to be able to read it, it's broken as a text-based document format.
* JSON is needlessly strict. If I write {foo: 5}, my intent is crystal clear. I shouldn't have to write {"foo": 5}. Come on. Who's really helped by this kind of syntactic hairshirt?
* despite being a strict schoolmarm of text formats, JSON is still vague. Yes, it has numbers. How big can they be? Who knows?
I mean, JSON is fine-ish, I guess, as an interchange format, in which I'm looking at it only for the occasional debugging session. But as a format for documents meant to be read by humans? Ugh. Anything, anything at all but JSON.
When editing a patch file, how often do you edit metadata beyond the file content differences?
It seems that the proposed DiffX is meant to be produced/edited only by machines, so JSON doesn't seem too much of a problem for this use case.
Diffx authors wrote a prototype in Python where JSON support is built-in. Go parse it in bash or C.
Having to parse JSON for the sake of applying a patch is not exactly wise
Luckily, this does not matter in DiffX because the whole thing goes up in flames when you change the length anyways :)
Git's diff format contains enough information to do this, but many don't. When this happens, tools like ours (a code review platform) is forced to modify the diff format or otherwise attach information to do these sorts of operations.
We've had to do this enough times, and work around enough inconsistencies and breakages in various SCM-specific diff format variants that we decided to address the problem head-on, pulling our experiences and those of some of the SCM vendors we work with to draft a solution to the problem.
What's wrong with JSON?
There's not one patch/diff format. There's often at least one per SCM. A couple are pretty good (Git's), many are okay (Subversion's), and many are really bad or non-existent.
I founded one of the older code review products, Review Board (turning 20 next year), and we deal with the problems this is trying to solve all the time, across over a dozen SCMs. So we're the ones complaining :) And much of this is based on extensive feedback from SCM vendors we've spoken to about this at length.
Most people shouldn't have to care. But it benefits tools like ours that have to deal with the nightmare that is the world of diff formats.
Diff files are there to represent a delta state of the repository, a difference between a range of changes. Those may be one or more commits, or one more changes across individual files (not all SCMs manage state in terms of atomic commits). File changes, file attribute changes, SCM-specific metadata changes, and commit history information.
That delta state should be able to be applied to another tree in order to get the same end result. This is what diff files are ultimately there for.
Git diffs do this today, and they do it well (but they're pretty Git-specific). Many SCMs (and there are a lot of them) don't include a format on that level, or a format at all. Hence DiffX.
(This is not helped by the fact that diffs often come with a .patch file extension… or that the "patch" tool processes diffs…)
Or anyway the nomenclature really sucks in this field. I guess I have no clue whether I have a minority view here.
When talking about the file, the two terms are often used interchangeably (and are usually a .diff or a .patch extension).
For fun, the GNU Patch manpage says:
"NAME: patch - apply a diff file to an original"
followed by:
"patch takes a patch file patchfile containing ..."
"patch tries to skip any leading garbage, apply the diff, and then ..."
I'm going to try to address some of the reasoning behind DiffX here, and answer a few things that have come up in the comments. I'll start by saying, the issues we're addressing are more issues encountered by tools that work with these files, not necessarily the end users of these tools. Most people never have to think hard about the structure of these files, but we do.
A little background. I co-founded Review Board, a code review product that's been around since 2006. We work with a wide variety of source code management solutions, and because of this, we deal extensively with .diff/.patch files. And nobody generates these in any consistent way.
If Git were the only SCM out there, there'd be no need for DiffX. But there are at least a dozen SCMs in use in production across companies today, and more being developed. And each does things differently.
Git, Mercurial, Subversion, ClearCase, Perforce, and others all have their own bespoke type of format, built largely (but not always consistently or fully compatibly) on Unified Diffs. These often augment Unified Diffs by injecting:
* Revision identifiers (which can come in all kinds of forms -- numbers, hashes, paths, reserved words -- and sometimes need to be paired with other information to resolve a file). Sometimes these are on the `---`/`+++` lines, sometimes not.
* Symlink information (some tools provide the old and new symlink paths, some just the new, some neither, some the file contents)
* File modes (similarly, out of those that convey file modes, some provide more details than others, and these can impact application or processing of a patch)
* Commit descriptions (and if not done right, a stray `---` or metadata keyword can break some parsers)
Or any number of other common or SCM-specific metadata.
Pretty much all of these represent data in their own ways. This information goes in the "garbage" area of Unified Diffs, which basically means tools like GNU patch ignore it, but tools aware of that specific variant can parse it out.
At this point in my life, I've written a couple of dozen bespoke diff parsers at this point. Depending on the tool generating the diff, there's all sorts of parsing issues that can come up:
* Varying encodings for file paths and text strings (like commit messages), which can mean a patch on one system doesn't apply on another, depending on the tools generating it.
* BOMs that sometimes show up in strings (we hit this with Perforce years back).
* Messages or metadata can sometimes include characters that resemble Unified Diff content or other variant-specific syntax and can break patching/parsing.
* Differences in how information like symlinks/file modes are conveyed (see above).
* Binary files are almost never able to be represented beyond a "Binary files X and Y differ" line.
* Newlines are sometimes outright broken within the diff (particularly with mixed line endings) and can sometimes break patching.
Just to name a few.
Many SCMs don't even have a diff format to begin with, just generating a Unified Diff. These don't contain any revision information needed to locate a file. In these cases, or when important information is unavailable in some diff variant, we're stuck rolling our own.
Also, here's something you wouldn't normally expect to be a problem, but can be in practice more than you'd think, is that some very large diffs (we've seen ones hundreds of megabytes in size -- don't get me started) are time-consuming to parse. To know everything about the diff, you need to read and scan every line. To generate a list of filenames or stats on a diff, you need to effectively parse all of it.
So we took all those pain points, talked to developers working on a few SCMs, got their pain points and thoughts, and drafted the initial DiffX spec. Went through several rounds with them, iterated until we got where we are today.
The spec had some important goals:
1. Not being vendor-specific.
Git patches were built for Git, and even Git-like patches from Mercurial or Subversion have quirks that can break parsing in a Git-specific patch parser. There's no grammar for how Git stores the metadata and the clients require knowledge up-front of the value types, which isn't a good fit for some of the SCMs out there.
We wanted to draft something that could be used more generically, able to be adopted by newer tools while also being able to represent the information provided from existing tools.
2. Support for arbitrary and injectable metadata.
Some of the formats we work with don't contain enough information to locate the file + revision within a repository. Some require additional information, like an explicit branch or workspace ID or a counterpart changeset number.
Even Git diffs don't always provide enough. They provide a Blob SHA, but not a commit SHA by default. This is a problem when talking to APIs on Git hosting services that require a commit SHA along with (or instead of) a blob SHA.
And some have useful data that can't be fetched after the fact.
So a common headache is that we need to inject additional information in the diffs we generate in order to allow the appropriate data to be looked up.
y using a form of metadata storage for the diff file, the commit, and the files within, we have the ability to inject that additional information without worrying about corrupting a state machine or regex or whatever method is used for some parser (or some older version of our own parsers).
We eventually chose JSON here. We initially had a grammar that looked more like Git's format, but found ourselves dealing with some of the same challenges that YAML had. We didn't want the "NO" problem and we didn't want every client to have to decide on what the value of a string in a piece of metadata should be. Some metadata (such as revision IDs in some SCMs) differentiate between a number and a string that may look like a number, and that information is important to know up-front.
The consensus was that there was more value in JSON than some other format, since it's well-understood, parsers are readily available, and there's no sign that it's disappearing or dropping out of maintenance any time soon.
The format allows for future metadata formats here if, say, json5 or YAML#++ becomes a well-adopted standard 10 years from now.
3. Parsing and mutability.
When working with these files, it's sometimes important to scan for information in the file to do some pre-processing. How many files are in the diff? Is this past some threshold that might trigger a rate limit if we fetch data for each from a repository? Are there binary files in this change?
When these files get very large (which can happen in enterprises when posting a change merging two long-running branches together -- no, people shouldn't ever post 100MB diffs, but they do), these operations can get expensive.
So we built in some parsing aids (content lengths for sections, section hierarchy identifiers) to allow for more efficient parsing. We can read a section header, know where we are and what section we're nested in, and jump to the next piece to read. This is far more efficient than parsing-to-scan and avoids a lot of headaches.
We also get mutability. Generating a diff and attaching metadata to a file in the middle of a diff becomes a lot faster and safer this way.
A consumer never needs to do this. A tool does.
We figured we'd address the text encoding issues while we were at it, because oh boy can these cause problems. A whole topic of its own.
4. Multi-commit files.
Yep, there's Git format-patch. That works great for Git. If I'm on Perforce or SOS or ClearCase and I want to represent a series of commits, I don't have an equivalent format.
If one wants to be able to send a diff spanning a series of commits somewhere for processing or application, being able to do that with one file is valuable. One file means one thing to upload, no risk of a patch ordering issue or a missing patch in the series. The tool processing the diff file would have all the state it needs up-front.
5. Binary files.
Binary files are important. A lot of projects are more than source code in text format. Images, documents, 3D models.. these get left out of diff files today by default.
The exception is Git, which can represent changes to binary files as Git Literal and Git Delta formats. This is largely undocumented (outside of our spec) and not supported by really anything else.
We review binary files, so we wanted this. Talking to other SCM vendors, some found this a pain point as well but didn't have a solution in place. So we wanted this to be documented and addressed in the spec.
This is already very long, but I wanted to give a bit of insight into the kinds of problems and inconsistencies that tools (not necessarily end users) have to deal with, and how this is meant to address some of those problems.
Key point: Much of this is about solving issues with tools that work with the varying file formats of diffs. It's not really something end users should ever have to care about.
I think the original post left people a bit confused on why it exists but you explained it really clearly here.
There are a lot of alternative approaches to how one might generate a diff (see VCDIFF for binary files), and much that's worth thinking about for generating diff-like formats that are not line-based. But this is not meant to be those. It is meant to be able to incorporate those as time goes on (as it does with VCDIFF today).
If you work often enough with patches, you know it is not so rare you need to modify the patch itself resulting in different length. Please do not hard-code the length.
You already are user facing. Why interpret a user facing format behind the scenes? It makes no sense. The document speaks about specifying yet another diff format, but luckily, it does nothing of the sort but specifies a new patch set format. But those are by necessity VCS- and file format specific.
No I don’t. That plus-minus single column view is complete and total trash. Always has been, always will be.
I’ve used Araxis Merge for almost 20 years. Beyond Compare 3 is also a good choice. Not once in my entire life career have I ever relied on a “unified diff” or a “patch” or any of that garbage.
Diffs and patches are not the foundation. You don’t have to use them at all.
Sure, but some unified diffs, e.g. the ones produced by git, are quite regular. It's also common practice to express diffs as RFC822 email messages (often because they come that way), with headers and descriptive text.
I can't see DiffX getting traction. It's too alien. Too divorced from present practice, no matter how theoretically robust. It's like XHTML2.
Solving the same problems, I'd just establish conventions for sticking the needed metadata in RFC822-style pseudo-headers above the diff. This approach would work with, not against, existing tooling.
Not everything needs to be JSON.
This isn't a tool for viewing changes to files or to ASTs. This is a way of being able to generate a single diff file for processing or patching that addresses the kinds of problems we've encountered in over 20 years of building diff parsing tooling and working with over a dozen SCMs with varying levels of completeness or brokenness of bespoke custom diff formats.
It's not an end user tool, but a useful format for tools like code review products to use.
Since it patches the code, looking at its tree structure, is the diff human readable, and can it be edited directly? This is a major contributor to why I opt for sed for patching.
FWIW: mise install is breaking due to the submodule. I had to resort to brew install
A single diff can’t represent a list of commits
There’s no standard way to represent binary patches
Diffs don’t know about text encodings (which is more of a problem than you might think)
Diffs don’t have any standard format for arbitrary metadata, so everyone implements it their own way.
Of these, only a notation for binary patches would be a reasonable generalization of diff files. Everything else is the internal data structure or protocol of some specific revision control system, only exchanged between its clients and servers and backups.I guess since I’m too afraid to use non-ASCII in filenames much.
That said, i feel like this is something most tooling could just handle, and not really an issue.
Certainly its not a problem diffX is going to solve since it appears to only store charset and not filename normalization rules.
I still have some ls and hd output that I stored in my notes files, if anybody is interested.
$ ls
Español Español Français Français
$ ls | hexdump -C
00000000 45 73 70 61 6e cc 83 6f 6c 0a 45 73 70 61 c3 b1 |Espan..ol.Espa..|
00000010 6f 6c 0a 46 72 61 6e 63 cc a7 61 69 73 0a 46 72 |ol.Franc..ais.Fr|
00000020 61 6e c3 a7 61 69 73 0a |an..ais.|
00000028
I suggest installing a fresh Linux distribution with e.g. bg_BG.UTF-8 locale and playing with it, especially with XDG directories like "Плот", "Свалени" and "Документи", and apps that should use them by default. Everything should Just Work™.
Although I admit that when reporting bugs for apps that can't handle non-ASCII paths, the responses from the developers (unless they're themselves from non-English speaking countries, but sometimes even then) quite often seem to be very thinly veiled "I can't be bothered to figure out where I botch things, why can't you just speak English like all reasonable people".
These aren't problems end users should hopefully ever need to worry about, but they're problems that tools need to worry about and work around. Especially for SCMs that don't have a diff format of their own, have one that is missing data (in some, not all changes can be represented, e.g. deleted files), or don't include enough information for another tool to identify the file in a repository.
Standards are meant for interchange, like (as mentioned in other comments) producing a patch file by any means and having someone else apply it regardless of what they use for version control.
On a serious note, I don't see how that would impact them.
- encoding - even if your file is not utf-8, why would that matter? You would still run the patch algorithm the same way. It doesn't really matter if the characters are valid utf-8
-why would i want a single diff to represent multiple commits? Having multiple diffs seems much more natural.
-metadata... i guess, but also the metadata seems like it would mostly only be useful inside a single system.
1. There are two important areas where encoding can matter: The filename and the diff content.
Git pays attention to filename encoding, but most SCMs don't, so when a diff is generated, it's based on the local encoding. If there are any non-ASCII characters in that filename, a diff generated in one environment with one encoding set can end up not applying to another (or, in our case, not being able to be looked up from a repository). This isn't common but it can happen (we've seen this on Perforce and Subversion).
Then there's the content. Many SCMs will actually give you a representation of a text file and not the raw contents itself. That text file will be re-encoded for your local/preferred encoding, and newlines may be adjusted as well (`\r\n`, `\n`). The text file is then re-encoded back when pushing the change. This allows people in different environments to operate on the same file regardless of what encoding they're working with.
This doesn't necessarily make its way into the diff, though. So when you send a diff from a less-common encoding to a tool to process it, and that tries to apply it to the file checked out with its encoding, it can fail to patch.
The solution is to either know the encoding of the file you're processing, or try to guess it (some tools, like ours, let you specify a list of preferred encodings to try).
It's best if you can know it up-front.
Bonus Fun Fact: On some SCMs (Perforce comes to mind), checking out a file on Windows and then diffing it Linux via a shared mount can get you a diff with `\r\r\n` newlines. It's a bad time and breaks patching. And it used to come up a lot, until we worked around it.
Also, Perforce for a while would sometimes normalize encodings incorrectly and you'd end up with BOMs in the diff, breaking GNU patch.
2. It does when you're working with them directly for applying and patching. If you're handing them off to a tool for processing, if there's any risk of one file in a sequence not being included, you can end up with breakages that maybe you don't see until later processing.
It's also just really nice having all the state and metadata up-front so we can process it in one go in a consistent way without having to sanity-check all the diffs against each other.
When working locally, it also depends on your tooling. `git format-patch` and `git am` are great, but are for Git. If I'm working with (let's just say) Subversion, I need to do my own thing or find another tool.
3. It's critical for the kind of information needed to locate files in a repository. Some systems need a commit-wide identifier. Some need per-file identifiers. Some need a combination of the two. Some need those plus additional data not otherwise represented in the path or revision (generally more enterprise SCMs targeting certain use cases).
It's also critical for representing information that isn't in the Unified Diff format (namely, anything but the filename). So, symlink information, file modes, SCM-specific properties on a file or directory, to name a few. This information needs to live somewhere if a SCM provides it, and it's up to every SCM to choose how and where to store that data (and then how it's encoded, etc.).
Yeah, don't do that.
> This allows people in different environments to operate on the same file regardless of what encoding they're working with.
No it causes hard to understand bugs because now what people see on their device and what is tracked in source control differs, defeating the entire purpose of having source control in the first place. This isn't theoretical at all btw.
> The solution is to either know the encoding of the file you're processing
In general, there is no such encoding - source control tools need to be able to deal with files not valid in any single encoding.
Yeah I don't see a use-case for a patch encoding either - just treat the patch data as ascii-delimited binary mistery goo. Patch files need to be able to deal with mixed encoding text (e.g. to fix it) so you can't really just have one encoding anyway.
JSON solved this mostly by standardizing what most implementations were already doing, so that would be a great thing to do.
If git diff isn't documented, the solution is not to create a new format, but to go through the source code and document it.
If an editor was involved and you want better mergability you could include the original file and the sequence of CRDT ops.
The more changes, the more likely it is for a patch to fail, but in principle it seems like cherry-picking and applying a change is a valid use of a diff.
One of the interesting point of diff files is that all commands are on single lines. You can easily parse or manipulate with simple shell tools just stripping lines out.
So instead of `diff a b | patch c`, where the data through the pipe needs to be in some interchange format, you'd run `apply a b c` and the apply command can use whatever internal representation it likes.
Diffs also aren't great for human reading. A color-coded side-by-side view is better. For which you also want to start with the two files.
There's really no need to ever transmit a diff and deal with all the format vagaries when you can just send the two files.
I do think it's a bit annoying that a program gets updated, and you have to download the whole 130 gb again.
But especially how quickly and easily you can compress two almost-identical files, I think your approach has a lot going for it. It may even be possible to get clever and send over just a hash of the original file, and a version of the new file which has been compressed with the original file as prefix (but without the actual compressed data for that).
Well, depends what are you doing, and in 2025, they are more relevant than ever.
Asking an LLM to output a diff for an edit can save you a staggering amount of tokens and cut the latency of it's response by 5-10x. I've done it your way, a custom diff way and then added a standard diff one, and even back then with GPT 3.5 there was a huge difference, let alone now with way larger models.
There is a lot of diff's in the dataset so telling it to create a standard diff is usually no different than asking it to create a whole file in terms of accuracy (depending on the task), but saves you all the output tokens and reduces the amount of compute/time required to infer all those tokens.
Updating a code running in a sandbox on a 3rd machine over the wire and speed is relevant? You want a diff. I did it your way first for ease, but knowing how much data and compute I was wasting on that, it was a low hanging optimisation to use a diff, and it worked wonders. Yeah, for most usecases it would be an overkill, but for this usecase miliseconds were important.
If you have file A and A2 and a diff AxA2, constructing file A2 is easy and saves you all the A-diff data.
Merging them is the only potential issue due to conflicts, and that is where a human or LLM has to come in and that is where just having a file A2 to overwrite the original one would be easy, but conflict occurs only in cases where you might not want that to happen.
TLDR; diff good.
Having a diff format allows decoupling the implementation of diff creation from the implementation of diff application, turning a potential n*m problem into an n+m problem.
Nice, nice.
Format looks very clunky and messy, to be honest, mixture of self-invented headers and JSON payloads, strange structure (without comments here I will not notice different number of dots in `.meta`), need essentialy two parsers.
Idea to have extended diff with standard way to put metadata is good.
This implementation looks bad, sorry.
The problem with diffs is that they are not easily portable - because their intent is so (accidentally) low-level. Imbue diffX with semantics, and they can become more readable - by humans and AI alike.
Other thought was when was the last time I had problems with a diff file--can't recall (maybe decades ago). Probably only a problem when working with multiple VCSes in which case you could make a diff translator that understands each one intimately.
Jokes aside, good luck with this one :)
xkcd927 strikes again!
What an awful idea... Now if I have to edit a patch I need to count characters in it? Come on...
Is that a chicken and egg problem, or it is useful by itself?
charcircuit•1d ago
I don't understand why this would be a problem. Each commit can have its own diff.
sedatk•1d ago
kazinator•1d ago
The mails can be catenated together into one file; this is called the mbox format.
Get three commits as three separate files:
Now checkout to before those commits, detaching HEAD: Now, catenate those files together to create one "patch.mbox" file: "git am" takes the whole file and applies it: Clean up: return to master, delete the file:sedatk•1d ago
anitil•1d ago
chrismorgan•1d ago
mdaniel•1d ago
I am not the target audience for such a thing and thus didn't try to apply said patch but my experience has been that I'm not going to hold my breath
chipx86•1d ago
Having the whole state of a series of commits up-front can help with tools like ours that need that information ideally in one place and need to do analysis across multiple commits.
We also wanted to avoid issues where, due to a bug or network error or whatever, a missing patch in a sequence could result in a broken set of changes to apply. This is obvious when you're patching locally, but less so when you're sending to a tool to process later. Having it up-front reduces the chances of problems and simplifies a lot of edge cases.
procaryote•1d ago
chipx86•1d ago
I have so many stories I could tell at this point.
account42•1d ago
kazinator•1d ago
I have stories also. In one company almost twenty years ago, some Java-spewing troglodytes went on a branch and did stupid things, like simultaneously rename A.java to B.java, B.java to C.java and C.java to A.java! While of course changing all their content too.
When it came time to merge, this was completely beyond the power of Subversion to sort out.
They came to me for help, and I whipped out my Meta-CVS, with excellent support not only for renaming but for doing snapshot imports which figure out renaming by file similarity. We imported all the baselines into it, and it did the merge flawlessly across the rotated renames.
We took the resulting merged baseline out of Meta-CVS and cleanly committed it to the Subversion trunk.
Don't discount other SCMs being useful as tools in SCM problems, even if they are not the main source-of-truth SCM used by the org.
kazinator•1d ago
Who cares?
The solution to SCMs not doing this and that is to implement ones which do. We have done that.
You can unlock the cage, and that is enough; if the apes want to stay, let them stay.