DiffX – Next-Generation Extensible Diff Format

2025-06-042:38374152diffx.org

If you’re a software developer, you’ve probably worked with diff files. Git diffs, Subversion diffs, CVS diffs.. Some kind of diff. You probably haven’t given it a second thought, really. You make…

If you’re a software developer, you’ve probably worked with diff files. Git diffs, Subversion diffs, CVS diffs.. Some kind of diff. You probably haven’t given it a second thought, really. You make some changes, run a command, a diff comes out. Maybe you hand it to someone, or apply it elsewhere, or put it up for review.

Diff files show the differences between two text files, in the form of inserted (+) and deleted (-) lines. Along with this, they contain some basic information used to identify the file (usually just the name/relative path within some part of the tree), maybe a timestamp or revision, and maybe some other information.

Most people and tools work with Unified Diffs. They look like this:

--- readme 2016-01-26 16:29:12.000000000 -0800
+++ readme 2016-01-31 11:54:32.000000000 -0800
@@ -1 +1,3 @@
 Hello there
+
+Oh hi!

Or this:

Index: readme
===================================================================
RCS file: /cvsroot/readme,v
retrieving version 1.1
retrieving version 1.2
diff -u -p -r1.1 -r1.2
--- readme 26 Jan 2016 16:29:12 -0000 1.1
+++ readme 31 Jan 2016 11:54:32 -0000 1.2
@@ -1 +1,3 @@
 Hello there
+
+Oh hi!

Or this:

diff --git a/readme b/readme
index d6613f5..5b50866 100644
--- a/readme
+++ b/readme
@@ -1 +1,3 @@
 Hello there
+
+Oh hi!

Or even this:

Index: readme
===================================================================
--- (revision 123)
+++ (working copy)
Property changes on: .
-------------------------------------------------------------------
Modified: myproperty
## -1 +1 ##
-old value
+new value

Or this!

==== //depot/proj/logo.png#1 ==A== /src/proj/logo.png ====
Binary files /tmp/logo.png and /src/proj/logo.png differ

Unified Diffs themselves are not a viable standard for modern development. They only standardize parts of what we consider to be a diff, namely the ---/+++ lines for file identification, @@ ... @@ lines for diff hunk offsets/sizes, and -/+ for inserted/deleted lines. They don’t standardize encodings, revisions, metadata, or even how filenames or paths are represented!

This makes it very hard for patch tools, code review tools, code analysis tools, etc. to reliably parse any given diff and gather useful information, other than the changed lines, particularly if they want to support multiple types of source control systems. And there’s a lot of good stuff in diff files that some tools, like code review tools or patchers, want.

You should see what GNU Patch has to deal with.

Unified Diffs have not kept up with where the world is going. For instance:

  • A single diff can’t represent a list of commits

  • There’s no standard way to represent binary patches

  • Diffs don’t know about text encodings (which is more of a problem than you might think)

  • Diffs don’t have any standard format for arbitrary metadata, so everyone implements it their own way.

We’re long past the point where diffs should be able to do all this. Tools should be able to parse diffs in a standard way, and should be able to modify them without worrying about breaking anything. It should be possible to load a diff, any diff, using a Python module or Java package and pull information out of it.

Unified Diffs aren’t going away, and they don’t need to. We just need to add some extensibility to them. And that’s completely doable, today.

Unified Diffs, by nature, are very forgiving, and they’re everywhere, in one form or another. As you’ve seen from the examples above, tools shove all kinds of data into them. Patchers basically skip anything they don’t recognize. All they really lack is structure and standards.

Git’s diffs are the closest things we have to a standard diff format (in that both Git and Mercurial support it, and Subversion pretends to, but poorly), and the closest things we have to a modern diff format (as they optionally support binary diffs and have a general concept of metadata, though it’s largely Git-specific).

They’re a good start, though still not formally defined. Still, we can build upon this, taking some of the best parts from Git diffs and from other standards, and using the forgiving nature of Unified Diffs to define a new, structured Unified Diff format.

We propose a new format called Extensible Diffs, or DiffX files for short. These are fully backwards-compatible with existing tools, while also being future-proof and remaining human-readable.

#diffx: encoding=utf-8, version=1.0
#.change:
#..preamble: indent=4, length=319, mimetype=text/markdown Convert legacy header building code to Python 3. Header building for messages used old Python 2.6-era list comprehensions with tuples rather than modern dictionary comprehensions in order to build a message list. This change modernizes that, and swaps out six for a 3-friendly `.items()` call.
#..meta: format=json, length=270
{
 "author": "Christian Hammond <christian@example.com>",
 "committer": "Christian Hammond <christian@example.com>",
 "committer date": "2021-06-02T13:12:06-07:00",
 "date": "2021-06-01T19:26:31-07:00",
 "id": "a25e7b28af5e3184946068f432122c68c1a30b23"
}
#..file:
#...meta: format=json, length=176
{
 "path": "/src/message.py",
 "revision": {
 "new": "f814cf74766ba3e6d175254996072233ca18a690",
 "old": "9f6a412b3aee0a55808928b43f848202b4ee0f8d"
 }
}
#...diff: length=629
--- /src/message.py
+++ /src/message.py
@@ -164,10 +164,10 @@
  not isinstance(headers, MultiValueDict)):
  # Instantiating a MultiValueDict from a dict does not ensure that
  # values are lists, so we have to ensure that ourselves.
- headers = MultiValueDict(dict(
- (key, [value])
- for key, value in six.iteritems(headers)
- ))
+ headers = MultiValueDict({
+ key: [value]
+ for key, value in headers.items()
+ })   if in_reply_to:
             headers['In-Reply-To'] = in_reply_to

DiffX files are built on top of Unified Diffs, providing structure and metadata that tools can use. Any DiffX file is a complete Unified Diff, and can even contain all the legacy data that Git, Subversion, CVS, etc. may want to store, while also structuring data in a way that any modern tool can easily read from or write to using standard parsing rules.

Let’s summarize. Here are some things DiffX offers:

  • Standardized rules for parsing diffs

  • Formalized storage and naming of metadata for the diff and for each commit and file within

  • Ability to extend the format without breaking existing parsers

  • Multiple commits can be represented in one diff file

  • Git-compatible diffs of binary content

  • Knowledge of text encodings for files and diff metadata

  • Compatibility with all existing parsers and patchers (for all standard diff features – new features will of course require support in tools, but can still be parsed)

  • Mutability, allowing a tool to easily open a diff, record new data, and write it back out

DiffX is not designed to:

  • Force all tools to support a brand new file format

  • Break existing diffs in new tools or require tools to be rewritten

  • Create any sort of vendor lock-in

If you want to know more about what diffs are lacking, or how they differ from each other (get it?), then read The Problems with Diffs.

If you want to get your hands dirty, check out the DiffX File Format Specification.

See example DiffX files to see this in action.

Other questions? We have a FAQ for you.

  • Review Board from Beanbag. We built DiffX to solve long-standing problems we’ve encountered with diffs, and are baking support into all our products.


Read the original article

Comments

  • By laserbeam 2025-06-044:151 reply

    I really don’t like the highly hierarchical format, that there’s a “..meta” and a “…meta” somewhere else. I can imagine we want to annotate the whole diff, each file and each chunk. That’s a total of 3 levels of depth. Let’s just give them distinct names and not go full yaml with a format for once?

    This helps with readability (if one of the “meta” blocks is missing, for example, I could still tell at a glance what it refers to without counting dots), and is less error prone (it make little sense to me why the metadata associated with a whole diff should have the same fields as the metadata of a file).

    Furthermore, why do we have two formats? Json and key=value pairs? Is there any reason to not just use one format because it sounds like the number of things we’d want to annotate is quite small. Having a single structure makes it much easier to write parsers or integrate with existing tooling (grep, sed or jq - but not both at once)

    Other notes:

    - please allow trailing commas in lists

    - diffs are inherently splittable. I can grab half of a diff and apply it. How does your format influence that? I guess it breaks because I would need to copy the preamble, then skip 20 lines, then copy the block I need?

    - revisions are a file property? Not a commit checksum? (I might just be dumb here)

    • By chipx86 2025-06-047:024 reply

      In the early drafts, we played with a number of approaches for the structure. Things like "commit-meta", etc. In the end, we broke it down into `#<section_level><section_type>`, just to simplify the parsing requirements. Every meta block is a meta block, and knowing what section level you're supposed to be in and comparing to what section level you get become a matter of "count the dots".

      The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for. The parsing rules for the header are intentionally very simple.

      JSON was chosen after a lot of discussion between us and outside parties and after experimentation with other grammars. The header for a meta block can specify a format used to serialize the data, in case down the road something supplants JSON in a meaningful way. We didn't want to box ourselves in, but we also don't want to just let any format sit in there (as that brings us back to the format compatibility headaches we face today).

      For the other notes:

      1. Compatibility is a key factor here, so we'd want to go with base-level JSON. I'd prefer being able to have trailing commas in lists, but not enough to make life hard for someone implementing this without access to a JSON5 parser.

      2. If your goal is to simply feed to GNU patch (or similar), you can still split it. This extra data is in the Unified Diff "garbage" areas, so they'll be ignored anyway (so long as they don't conflict, and we take care to ensure that in our recommendations on encoding).

      If your goal is to split into two DiffX files, it does become more complicated in that you'd need to re-add the leading headers.

      That said, not all diff formats used in the wild can be split and still retain all metadata. Mercurial diffs, for example, have a header that must be present at the top to indicate parent commit information. You can remove that and still feed to GNU patch, but Mercurial (or tools supporting the format) will no longer have the information on the parent commit.

      3. Revisions depend heavily on the SCM. Some SCMs use a commit identifier. Some use per-file identifiers. Some use a combination of the two. Some use those plus additional information that either gets injected into the diff or needs to be known out-of-bounds. There's a wide variety of requirements here across the SCM landscape.

      • By laserbeam 2025-06-0410:42

        > The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for.

        One more thing you should prepare for whenever you have "free-form bits of metadata". They somehow turn into: "some user was storing 100MB blobs in there, and that broke our other thing".

      • By laserbeam 2025-06-048:131 reply

        > 1. Compatibility is a key factor here, so we'd want to go with base-level JSON. I'd prefer being able to have trailing commas in lists, but not enough to make life hard for someone implementing this without access to a JSON5 parser.

        This is what I was referring to. This is not json:

        > #..meta: format=json, length=270

        > The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for. The parsing rules for the header are intentionally very simple.

        Exactly my point. That level of flexibility for a .patch format to support another language embedded in it is overwhelming. Keep in mind that you are proposing a textual format, not a binary format. So people will use 3rd party text parsing tools to play with it. And having 2 distinct languages in there makes that annoying and a pain.

        • By hdjrudni 2025-06-065:36

          How do they reasonable work around that though? If they want the ability to move away from JSON, you have to know that it is JSON before trying to parse it. And then you need to know how much data to read. So I can see why they put those 2 tidbits of info above data block.

          Maybe they could have said too bad, JSON for life, we'll never change it. OK. But then you still need the length or a delimiter for the "end of json".

      • By WhyNotHugo 2025-06-0415:58

        What was your reasoning for discarding the existing header format used by git?

      • By quotemstr 2025-06-047:161 reply

        > Compatibility is a key factor here, so we'd want to go with base-level JSON. I'd prefer being able to have trailing commas in lists, but not enough to make life hard for someone implementing this without access to a JSON5 parser.

        Everyone has access to a JSON5 parser. Everyone has to suffer for the sake of a few people who don't to pay the trifling tax of pip installing something --- when they're using an external library for a novel file format _anyway_?

        • By genocidicbunny 2025-06-048:032 reply

          > Everyone has access to a JSON5 parser.

          That's just a lack of imagination. When you're making a product for teams that span everything from a brand new startup using the latest tooling to teams that are working on software that runs on embedded systems from the 90's, you need to consider things like this.

          • By roblabla 2025-06-048:531 reply

            There are json5 parsers written in C89 out there. And your embedded systems from the 90s probably doesn't have a JSON parser built in at all either... If you're going to build your own json parser, adding json5 support on top is really trivial.

            • By genocidicbunny 2025-06-049:262 reply

              That doesn't mean it's not going to be difficult to use that parser. Not everyone has the luxury of being able to use third-party code, or having the time allotted to write a JSON5 parser. The JSON parser some places are using may have been written two decades ago and works well enough that there's little motivation to implement JSON5 support. Sometimes it's just company policy or internal politics that prevent the usage.

              It's also just not that big a deal overall for the intended use of the DiffX format. It's mainly machine-generated and machine-consumed. There's human readability concerns for sure, but the format looks to be designed mainly for tools to create and consume, so missing a few features that JSON5 brings is not that big of a deal.

              • By DannyBee 2025-06-0417:471 reply

                "That doesn't mean it's not going to be difficult to use that parser. Not everyone has the luxury of being able to use third-party code, or having the time allotted to write a JSON5 parser."

                Why are these people the target market?

                I understand it may be important to you, but that isn't the same as "matters to target market/audience".

                On top of that, the same constraints you mention here would stop you parsing current git patch formats, and lots of other things anyway. So you were never going to be using modern tools that might care here.

                This is all also really meta. Who exactly is writing software with >1% market share, needs to parse the patch format, and can't access a JSON parser.

                Instead of this theoretical discussion, let's have a concrete one.

                • By genocidicbunny 2025-06-0419:36

                  In this specific instance, those people are part of the target market because the project chooses to make them part of the target market. It's worked well enough for Review Board.

              • By quotemstr 2025-06-0415:311 reply

                So the whole world should suffer through vanilla JSON because someone, somewhere, has an overbearing and paranoid software approval process? That's the attitude the delayed universal unicode adoption by a decade.

                • By genocidicbunny 2025-06-0419:31

                  That's a bit dramatic. This isn't something as universal as Unicode. You really only need to care about this if you're writing tools that generate or consume the DiffX format, which is not something most people will be doing. The whole world isn't suffering their decision to use JSON instead of JSON5.

          • By DannyBee 2025-06-0417:44

            I don't think this is true, and honestly, I think it would be a mistake to consider it - they can't serve everyone, down that path is madness. FWIW - I even have a JSON parser in my RTOS-that-must-run-in-less-than-512k.

            I also think that target of "embedded systems from the 90's" makes no sense because the tooling for the embedded system, which is what would conceivably want to handle patch format, ran on the host, which easily had access to a JSON parser.

            But let's assume it does matter - let's be super concrete - assume they want to serve 95-99% of the users of patch format (i doubt it's even that high).

            Which exact pieces of software with even >1% market share that need to process patch format don't have access to a JSON parser?

  • By HelloNurse 2025-06-047:132 reply

    A staggering amount of unnecessary and counterproductive scope creep in just 4 items:

        A single diff can’t represent a list of commits
    
        There’s no standard way to represent binary patches
    
        Diffs don’t know about text encodings (which is more of a problem than you might think)
    
        Diffs don’t have any standard format for arbitrary metadata, so everyone implements it their own way.
    
    
    Of these, only a notation for binary patches would be a reasonable generalization of diff files. Everything else is the internal data structure or protocol of some specific revision control system, only exchanged between its clients and servers and backups.

    • By chipx86 2025-06-047:231 reply

      We build a code review product that interfaces with over a dozen SCMs. In about 20 years of writing diff parsers, we've encountered all kinds of problems and limitations in SCM-generated diff files (which we have to process) that we wouldn't ever have expected to even consider thinking about before. This all comes from the pain points and lessons learned in that work, and has been a huge help in solving these for us.

      These aren't problems end users should hopefully ever need to worry about, but they're problems that tools need to worry about and work around. Especially for SCMs that don't have a diff format of their own, have one that is missing data (in some, not all changes can be represented, e.g. deleted files), or don't include enough information for another tool to identify the file in a repository.

      • By HelloNurse 2025-06-047:55

        Better file formats cannot, by themselves, improve an inferior SCM tool that, for instance, processes files with the wrong text encoding or forgets deleted and renamed files: they would only have helped you for the purpose of developing your code review tool.

        Standards are meant for interchange, like (as mentioned in other comments) producing a patch file by any means and having someone else apply it regardless of what they use for version control.

    • By tankenmate 2025-06-047:221 reply

      Not so, obviously it is less common these days, but I still use patch(1) and friends enough to run into problems from time to time. This is especially true when you have devs on different platforms (don't even get me started on filename mangling / case-folding issues).

      • By Borg3 2025-06-047:382 reply

        Oh, then this is management issue, not tooling. You need to sit down and analize where your stuff will be developled. Some very basic rules to start with: file names need to be all lower case (they are case-insensitive), use 7bit ASCII encoding for source code files. And vioala :)

        • By NavinF 2025-06-049:03

          Poe's law at work. Replies are taking you literally, but I'm almost certain that you're joking. Very few large projects exclusively have lowercase filenames

        • By bawolff 2025-06-047:502 reply

          What exactly is the lowest common denominator platform we are trying to target here where we need 7bit ascii? MS-dos?

          • By theamk 2025-06-054:36

            Any system which uses encodings, including Windows and Linux in non-utf8 locale.

          • By keybored 2025-06-047:553 reply

            Could just be Linux. Filenames are just bytes so two equivalent Unicode filenames that have been normalized differently could be confusing. I guess?

            I guess since I’m too afraid to use non-ASCII in filenames much.

            • By bawolff 2025-06-048:02

              I guess that is fair. If i remember right mac uses NFD where literally everyone else in the world uses NFC (linux might not normalize but basically it usually ends up being NFC).

              That said, i feel like this is something most tooling could just handle, and not really an issue.

              Certainly its not a problem diffX is going to solve since it appears to only store charset and not filename normalization rules.

            • By dotancohen 2025-06-048:541 reply

              I had this condition a few years ago. A folder shared with Dropbox was then renormalized either by Dropbox or by another system, then when it was synced back to the original machine I had two folders with identical names, normalized differently.

              I still have some ls and hd output that I stored in my notes files, if anybody is interested.

              • By dotancohen 2025-06-0413:591 reply

                Here, found it:

                  $ ls
                  Español  Español  Français  Français
                  $ ls | hexdump -C
                  00000000  45 73 70 61 6e cc 83 6f  6c 0a 45 73 70 61 c3 b1  |Espan..ol.Espa..|
                  00000010  6f 6c 0a 46 72 61 6e 63  cc a7 61 69 73 0a 46 72  |ol.Franc..ais.Fr|
                  00000020  61 6e c3 a7 61 69 73 0a                           |an..ais.|
                  00000028

                • By bawolff 2025-06-0419:171 reply

                  The first one (6e cc 83) is NFD which is used by mac, the second one (c3 b1) is NFC which is used by everyone else.

                  • By dotancohen 2025-06-0419:20

                    Thanks. I did have a company Mac in 2017, and it was connected to that account.

            • By Joker_vD 2025-06-0417:031 reply

              > I’m too afraid to use non-ASCII in filenames much.

              I suggest installing a fresh Linux distribution with e.g. bg_BG.UTF-8 locale and playing with it, especially with XDG directories like "Плот", "Свалени" and "Документи", and apps that should use them by default. Everything should Just Work™.

              Although I admit that when reporting bugs for apps that can't handle non-ASCII paths, the responses from the developers (unless they're themselves from non-English speaking countries, but sometimes even then) quite often seem to be very thinly veiled "I can't be bothered to figure out where I botch things, why can't you just speak English like all reasonable people".

              • By bawolff 2025-06-0419:061 reply

                To be fair, as far as unicode goes, cryllic is kind of the easy case (no combining characters, no rtl, etc). In some ways its even easier than (non-english) latin scripts because in latin you can get easily confused with windows-1252 where things sort of work where if you are accidentally using a legacy 8bit encoding with cryllic you are more likely to figure that out quickly.

                • By HelloNurse 2025-06-056:51

                  It's "Cyrillic", named after St. Cyrill.

  • By blacklion 2025-06-0411:32

    So, self-delimitered format (JSON) is embedded in format with lengths? I change one space in JSON, JSOM is valid, whole DiffX file is invalid.

    Nice, nice.

    Format looks very clunky and messy, to be honest, mixture of self-invented headers and JSON payloads, strange structure (without comments here I will not notice different number of dots in `.meta`), need essentialy two parsers.

    Idea to have extended diff with standard way to put metadata is good.

    This implementation looks bad, sorry.

HackerNews