Bzip3: A spiritual successor to BZip2

2025-02-0116:46355176github.com

A better and stronger spiritual successor to BZip2. - kspalaiologos/bzip3

Show article

A better, faster and stronger spiritual successor to BZip2. Features higher compression ratios and better performance thanks to a order-0 context mixing entropy coder, a fast Burrows-Wheeler transform code making use of suffix arrays and a RLE with Lempel Ziv+Prediction pass based on LZ77-style string matching and PPM-style context modeling.

Like its ancestor, BZip3 excels at compressing text or code.

# If using a git clone (not needed for source packages), first...
$ ./bootstrap.sh # All...
$ ./configure
$ make
$ sudo make install

Alternatively, you might be able to install bzip3 using your system's package manager:

On macOS, you can use Homebrew to easily install:

First, I have downloaded every version of Perl5 ever released and decompressed them.

% wget -r -l1 -nH --cut-dirs=2 --no-parent -A.tar.gz --no-directories https://www.cpan.org/src/5.0/
% for g in *.gz; do gunzip $g; done
% ls -la | wc -l
262

Then, I put all the resulting .tar files in a single .tar file and tried to compress it using various compressors:

xz -T16 -9 -k all.tar  10829.91s user 26.91s system 1488% cpu 14658M memory 12:09.24 total
bzip2 -9 -k all.tar  981.78s user 9.77s system 95% cpu 8M memory 17:16.64 total
bzip3 -e -b 256 -j 12 all.tar  2713.81s user 16.28s system 634% cpu 18301M memory 7:10.10 total
bzip3 -e -b 511 -j 4 all.tar  17.65s user 12.19s system 170% cpu 12178M memory 7:08.65 total
zstd -T12 -16 all.tar  4162.94s user 16.40s system 1056% cpu 687M memory 6:35.62 total

The results follow:

LZMA (xz) - 2'056'645'240 bytes
bzip2 - 3'441'163'911 bytes
bzip3 -b 256 - 1'001'957'587 bytes
bzip3 -b 511 - 546'456'978 bytes
Zstandard - 3'076'143'660 bytes

Finally, wall clock time decompression times (WD Blue HDD):

LZMA (xz) - 4min 40s
bzip2 - 9min 22s
bzip3 (parallel) - 4min 6s
Zstandard - 3min 51s

Then, I used lrzip to perform long-range deduplication on the original .tar file:

% time lrzip -n -o all_none.tar.lrz all.tar
546.17s user 160.87s system 102% cpu 10970M memory 11:28.00 total

% time lrzip --lzma -o all_lzma.tar.lrz all.tar
702.16s user 161.87s system 122% cpu 10792M memory 11:44.83 total

% time lrzip -b -o all_bzip2.tar.lrz all.tar
563.93s user 147.38s system 112% cpu 10970M memory 10:34.10 total

Finally, I compressed the resulting none.tar.lrz file using bzip3:

% time bzip3 -e -b 256 -j 2 all_none.tar.lrz
32.05s user 0.76s system 146% cpu 2751M memory 22.411 total

The results follow:

lrzip + bzip3 - 60'672'608 bytes.
lrzip + lzma - 64'774'202 bytes.
lrzip + bzip2 - 75'685'065 bytes.

For further benchmarks against Turbo-Range-Coder and BSC, check powturbo's benchmark of bzip3, bzip2, bsc and others.

I TAKE NO RESPONSIBILITY FOR ANY LOSS OF DATA ARISING FROM THE USE OF THIS PROGRAM/LIBRARY, HOWSOEVER CAUSED.

Every compression of a file implies an assumption that the compressed file can be decompressed to reproduce the original. Great efforts in design, coding and testing have been made to ensure that this program works correctly.

However, the complexity of the algorithms, and, in particular, the presence of various special cases in the code which occur with very low but non-zero probability make it impossible to rule out the possibility of bugs remaining in the program.

DO NOT COMPRESS ANY DATA WITH THIS PROGRAM UNLESS YOU ARE PREPARED TO ACCEPT THE POSSIBILITY, HOWEVER SMALL, THAT THE DATA WILL NOT BE RECOVERABLE.

That is not to say this program is inherently unreliable. Indeed, I very much hope the opposite is true. Bzip3/libbz3 has been carefully constructed and extensively tested.

Bzip3's performance is heavily dependent on the compiler. x64 Linux clang13 builds usually can go as high as 17MiB/s compression and 23MiB/s decompression per thread. Windows and 32-bit builds might be considerably slower.

Bzip3 has been tested on the following architectures:

x86
x86_64
armv6
armv7
aarch64
ppc64le
mips
mips64
sparc
s390x

Check etc/BENCHMARKS.md for more results.

A breakdown of components and their licenses follows:

(runtime) The codebase as a whole: Copyright 2022-2023, Kamila Szewczyk (kspalaiologos@gmail.com); LGPL (LICENSE)
(runtime) The Burrows-Wheeler transform (libsais) and LZP code: 2021-2022, Ilya Grebnov (ilya.grebnov@gmail.com); Apache 2.0 (3rdparty/libsais-LICENSE)
(compile-time) build-aux: Copyright 2011, Daniel Richard G (skunk@iSKUNK.ORG), 2019, Marc Stevens (marc.stevens@cwi.nl), 2008, Steven G. Johnson (stevenj@alum.mit.edu); GPL-3+ with AutoConf exception
(compile-time) build-aux/ax_check_compile_flag.m4: Copyright 2008, Guido U. Draheim (guidod@gmx.de), 2011, Maarten Bosmans (mkbosmans@gmail.com); FSFAP
(compile-time) build-aux/git-version-gen: Copyright 2007-2012, Free Software Foundation, Inc; GPLv3
(runtime) bz3grep: Copyright 2003, Thomas Klausner; BSD-2-clause

bzip3 as a whole is licensed under LGPLv3 only. It is not dual-licensed under LGPLv3 and Apache 2.0.

Ilya Grebnov for his libsais library used for BWT construction in BZip3 and the LZP encoder which I had used as a reference implementation to improve myself.
Caleb Maclennan for configuring autotools as a packaging-friendly build system for BZip3.
Ilya Muravyov for his public domain BWT post-coder, a derivative of which is used in this project.

Page 2

You can’t perform that action at this time.

Read the original article

Comments

By mmastrac 2025-02-0117:2711 reply

I've studied the Burrows-Wheeler Transform, I understand the transformation, I've re-implemented it countless times for kicks, I see how it improves compressability, but for the life of me the intuition of _why_ it works has never really clicked.

It's a fantastic bit of algorithmic magic that will always impress me to see it.

By unnah 2025-02-0118:583 reply

The Burroughs-Wheeler transform has been described as a unique algorithm idea in that there are no non-trivial variations or related algorithms, unlike more conventional compression algorithms, which can be tweaked and improved in so many ways. There is no general compression theory in which BWT could be described as a special case.

It looks to me that the above still holds: Bzip2 and Bzip3 are simply combining more conventional compression algorithms with the BWT, which itself is still the same old transform. Bzip2 does Huffman coding after BWT, and Bzip3 does arithmetic coding.

By derf_ 2025-02-024:471 reply

> There is no general compression theory in which BWT could be described as a special case.

I would not really say this is true. BWT is spiritually similar to the various Prediction by Partial Matching (PPM) algorithms, except that instead of needing to decide how much context (i.e., preceding symbols) to use to model the next symbol, and carefully learning the probabilities for each unique context, it naturally sorts symbols with the same context (of _any_ length) so they appear next to each other, and relies on adaptation to update your learned statistics as you move from one context to the next, without ever explicitly tracking context boundaries [0].

[0] N.J. Larsson, "The Context Trees of Block Sorting Compression," 1998. https://ieeexplore.ieee.org/abstract/document/672147

By unnah 2025-02-029:22

Thanks for the reference, looks interesting.

By thesz 2025-02-025:541 reply

It definitely is related to prediction by partial match (PPM).

BWT sorts rotated data and what is achieved is that same suffixes group together:

  ...
  "Bzip2 and Bzip3 are simply combining more"
  "Bzip3 are simply combining moreBzip2 and "

The preceding (to suffix) character goes to end and then gets outputted. This is much like PPM going backward. There is a PPM* algorithm (unbounded context length) where authors considered reconstruction of contexts from data, utilizing something like LZSS seach. Same idea is in BWT - context is reconstructed from data.

BWT also breaks near dependencies in data, this is why move-to-front with Huffman or arithmetic encoding works well there.

By unnah 2025-02-029:21

Wow, thanks. As always, the best way to learn more on the internet is to be confidently and totally wrong!

By altairprime 2025-02-0119:333 reply

Can BWT be combined with zstd, which uses asymmetric numeral systems?

By CJefferson 2025-02-0120:491 reply

Yes, it would actually be interesting to just have a bwt pass which does no compression, so we can then try lots of post compression options.

By rkeene2 2025-02-025:01

I've been thinking about adding support for this kind of stacking to DACT [0].

[0] http://dact.rkeene.org/

By gopalv 2025-02-021:28

> BWT be combined with zstd

BWT can be combined with anything which does RLE and get a benefit.

What does it does is give RLE more to work with.

By klodolph 2025-02-023:391 reply

I think you would just need ANS, not the rest of zstd.

By altairprime 2025-02-0218:131 reply

I don’t know enough to evaluate that, but it sounds plausible. Apparently modifying or integrating zstd into custom solutions was a common path in submissions to, at the very least, the GDCC 2021 T2 contest. This is all well outside of my learned competence so I’m just here to learn and ask naive questions.

By klodolph 2025-02-0223:41

Zstd is basically LZ77 + FSE. There are some other pieces to it, but they’re minor. FSE is a form of entropy coding that you can swap out with other entropy coding techniques, like Huffman or arithmetic coding. For most objectives, FSE is the clear winner of those three.

As people mentioned elsewhere in the thread, Burrows-Wheeler isn’t really composable with other systems. Nobody has figured out a reasonable way to combine LZ77 and Burrows-Wheeler. That’s why zstd + bzip2 does not work. But you do need an entropy coding system to go with Burrows-Wheeler… and FSE fits the bill.

By thrtythreeforty 2025-02-0119:34

Thank you for the reference. I learned something new today. That algorithm is wild. If you had shown me the transform and asked if it had an inverse, I would have said of course it doesn't, it's too weird.

By loxias 2025-02-0120:072 reply

I always understood it as working because of the predictability of a symbol/letter/token given the previous one.

Sorting all the shifts of a string puts all the characters in order, then looking at the last column shows you all the _preceding_ characters. If there's any predictability there (which there often is), it's now easier to compress. It's sorta like an entropy coder in that way.

I've never thought of it as being that deep, and understood them since I was a kid -- building an intuition for "why" the FFT works is much harder -- but that being said, I clicked quickly to reply thinking "that's easy! I can explain this!" then struggled for a while trying to get the picture in my mind into text. :)

By crazygringo 2025-02-0120:134 reply

But shouldn't Huffman coding already detect that same predictability and compress it the same?

What I don't get isn't the benefits of BWT on its own. It's why BWT should add any additional benefit if you're already doing Huffman.

By loxias 2025-02-0120:291 reply

> What I don't get isn't the benefits of BWT on its own. It's why BWT should add any additional benefit if you're already doing Huffman.

Ahhhh. Now we're on the same page. :) Seeing how it helps when combined is somewhat subtle/non-obvious. I believe it relates to BWT and Huffman both being approximations of something more optimal. The two transforms could also have different window sizes -- one rarely does BWT on a whole 1GB file -- which introduce inefficiencies. Huffman coding is also only optimal in the very large alphabet and very long data limits. As your data length and alphabet size decrease, it gets less optimal.

Put differently, "I think that's a wonderfully phrased question, this _is_ my specialization/subfield, and I'm gonna need to chew on it for a while."

By crazygringo 2025-02-0120:441 reply

Thanks. Yeah, I can see how that would make more sense if BWT was redundant under a theoretically perfect Huffman compression, but it happens to pick up some things that real-world Huffman encoders don't, with their practical limits on CPU and memory.

By Sesse__ 2025-02-0213:20

Nearly any real-world Huffman encoder is trivially optimal, i.e., given a set of symbols with probabilities, it is easy to construct the optimal set of output bits for each symbol. (There are some exceptions in that if you have a huge amount of symbols, or some that are extremely rare, you can bound the upper length and get a non-optimal code. This matters very little in practice, i.e., less than 0.1% in a typical setting. And of course, if you allow fractional bits or probabilities that change based on context, then you can do better, but then it's no longer Huffman.)

BWT is orthogonal to Huffman; like LZ, it exploits that some symbol _sequences_ are more common than others, while Huffman is about the optimality of coding each symbol on its own.

By jltsiren 2025-02-0121:16

One way to look at data compression is splitting it into a model and an encoder. The model describes the data, while the encoder encodes (or equivalently predicts) the data according to the model. The compressed output consists of the serialized model and the encoded data. BWT is a model, while Huffman is an encoder.

Huffman takes a probability distribution and a symbol and encodes the symbol according to the distribution. If you encode all symbols independently according to the same distribution, you probably don't get very good compression.

You get a bit better results with a model that has a separate distribution for each context. If the previous symbols were X, Y, and Z, you encode the next symbol according to the distribution for context XYZ. This approach doesn't really scale, because the size of the model grows rapidly (exponentially in a naive implementation) with context length. You get better compression with an adaptive model. You start with a uniform distribution and update the available contexts and distributions after encoding each symbol. On the one hand, you don't have to store the model explicitly. But on the other hand, updating the model is very slow.

Burrows-Wheeler transform is an implicit model. It sorts the symbols according to the context that follows them, and it does that simultaneously for each context length. Because you don't have to store the model explicitly, you can effectively use longer context lengths than with a fixed explicit model. And because you don't have to update an explicit model after encoding each symbol, using the BWT is much faster than using an adaptive model.

By palaiologos 2025-02-0121:131 reply

Hi, tool author here.

Huffman coding is a static minimum-redundancy code. What this means is that it finds an optimal assignment of bit sequences to letters in the input alphabet (commonly US-ASCII or extensions). This however means that Huffman coding can not exploit redundancies that stem from the concrete sequence of characters. For example, you could easily predict that an `e` comes after `Th`, but Huffman coding can not know that.

Hence after applying the Burrows-Wheeler transform you need to have some sort of a higher-order transform (i.e. a transform that considers more than just individual bytes) which somehow reaps from the changed distribution of the result of the algorithm. But we will get to that in a second.

The joke here is that the Burrows-Wheeler transform is closely related to suffix trees and suffix arrays, which are often used in bioinformatics and HPC for full-text search. If you wanted to find a pattern of length `p` in a text of length `n`, if you already have a suffix tree of the original text, the search is linear in the length /of the pattern/ - i.e. O(p). The suffix tree stores all suffixes of a string in a compressed manner (i.e. it has a linear space overhead, approximately O(20n) as given by Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press), so you can search for a word in it by simply traversing from the root node to an internal or leaf node by following a sequence of bytes that comprise the word.

As such, a suffix tree (and equivalently suffix array and the BWT, which is trivially computed from a suffix array) form something which can be thought of as a static PPM model. Notably real world implementations of PPM use suffix trees as a part of their main storage data structure (e.g. PPMd). What this all means is that given a suffix tree, we can very cheaply give the probability distribution for the next byte that follows a given fixed-order sequence of bytes. This is nice, because then e.g. an order-2 predictor would be able to tell that `Th` is followed by `e` once enough data has been gathered.

As you can probably guess, the more preceding bytes you know, the better will be your estimate for what is the most likely next byte. But the larger your context, the more expensive the searches and computations become due to pointer chasing in the suffix tree.

So how do we remedy this? We notice that the Burrows-Wheeler transform essentially clusters similar contexts together, meaning that a low order predictor (= faster, simpler) on BWT compresses as well as a high order predictor (= slow, complicated) on the original data, at the cost of an extra transformation. This is viable, because the Burrows-Wheeler transform can be quickly computed and there have been recent advancements in running it on the GPU. So what this means is that bzip3 uses BWT + a low order predictor with an arithmetic coder to encode the bytes, meaning that it can make use of high order statistics for compression and performs comparably at a faster speed.

By eru 2025-02-0210:03

You are spot on.

Btw, the Burrows-Wheeler transform is often explained as taking the last column.

I find it easier to understand why it works if you think of BWT as sorting all rotations of your string by from their _second_ character onwards, and then writing down all the first characters.

By eru 2025-02-0210:00

> But shouldn't Huffman coding already detect that same predictability and compress it the same?

Huffman coding only works on individual letters. It doesn't know anything about relationships between adjacent letters.

By vanviegen 2025-02-0122:581 reply

Understanding why increasing predictability helps with compression is not the hard part though. What's hard to grasp is why the transform is reversible.

By gfody 2025-02-0123:571 reply

a word can be factored into the set and frequency of letters + the specific permutation. compressable patterns in either channel seem likely when the underlying words are language like.

By SideQuark 2025-02-020:13

And in general that set of descriptors provides no compressibility. BWT is much richer that that set in that the way it works performs well for data we care about.

Describing a multiset takes as much information as the multiset contained to begin with, on average. BWT somehow works better on things of use.

By mnw21cam 2025-02-0118:251 reply

I remember the lecturer commenting on what sort of sick and twisted mind could come up with such a ridiculous convoluted notion when I was taught it at university.

By ajb 2025-02-027:211 reply

Wheeler was also one of the inventors of the "closed subroutine" AKA function, which had to be implemented via a hack as machines of the time did not include ISA support for "return":

https://en.m.wikipedia.org/wiki/Wheeler_Jump

By mnw21cam 2025-02-038:43

Yes. He also happened to be in residence just down the hall from the lecture theatre at the time.

By BlackLotus89 2025-02-0123:18

I really liked the computerphile video about it https://www.youtube.com/watch?v=GYbCttCF25A

By maginx 2025-02-0217:41

I feel exactly the same, and have also implemented it backwards and forwards. I've thought about it in my sleep, trying to recall how it REALLY works. Happens every few years ;-) I always thought it was probably obvious to everyone else what the "magic" is.

By abetusk 2025-02-021:34

Isn't it basically run length encoding but on patterns? Sorting lexicographical on all rotations means blocks of patterns get grouped together, which means you can do compression more easily, right?

By duskwuff 2025-02-020:00

> the intuition of _why_ it works has never really clicked.

In terms of why it aids compression, or why it's reversible?

As far as the first goes: it transforms n-grams into repeated characters.

By forty 2025-02-0119:47

Haha I also studied as part of a student project and I remember my computer science teacher saying "it's dark magic" ^^

By bawolff 2025-02-0120:121 reply

Yeah. BWT and zero knowledge proofs are my goto examples of CS things that seem like pure magic.

By jltsiren 2025-02-0123:302 reply

Public-key cryptography is magic. Zero-knowledge proofs are a consequence that's difficult to find on your own but easy to understand once you've seen it.

I remember seeing someone (probably Avi Wigderson) demonstrating a zero-knowledge proof for graph coloring on an overhead projector when I was starting my second year studying CS. He had a slide with a graph, multiple slides with different permutations of the same valid coloring of the vertices, and a piece of paper with "doors" over the vertices to cover everything. The audience could ask him to open the doors over the vertices connected by any edge, and he would show that the coloring is valid, as far as those two vertices are concerned. And then he would replace the coloring with another permutation for the next iteration. The idea felt clever but kind of obvious in retrospect.

By bawolff 2025-02-021:141 reply

Magical things are usually built up from the trivial examples.

Graph colouring is usually used as a teaching example for zkp, because of how easy it is to understand. Its still amazing how you can go from that example to "Here is a file that anyone can verify (non-interactively) which shows i have a passport and that passport is not from a country sanctioned by the usa but otherwise does not reveal anything about me" or "I will not reveal my ip or any other identifying/unique info but i will prove i have not been previously blocked from this website including during the other times i anonoymously accessed this website"

By jltsiren 2025-02-022:171 reply

Things that look magical often stop being magical when you have the right perspective and the right abstractions. The step from a toy example to proving any verifiable statement is just NP-completeness. Which is simple enough that undergraduates are often expected to understand it.

Interactive zero-knowledge proofs are also technically non-interactive. They are defined in terms of the verifier evaluating a transcript of the protocol. If the verifier accepts some causal assumptions about the provenance of the transcript, they will accept the proof. If they disagree with the assumptions, the proof is indistinguishable from random noise they could generate themself. An interactive commitment – challenge – response protocol is one possible causal assumption. A source of randomness could replace the challenges, making the protocol non-interactive. Or there could be a pre-committed secret, making a single-round protocol effectively non-interactive.

These are things a sufficiently interested CS undergraduate can prove and understand. Public-key cryptography, on the other hand, remains magical. There are many things people assume to be true. Which need to be true for public-key cryptography to function. Empirically these things seem to be true, but nobody has been able to prove them. And I don't think anyone really understands them.

By maginx 2025-02-0218:00

Agreed - I've worked with PKI in many years, and know why the various systems work...in terms of "why you can decrypt again", not in terms of why it is secure (no other way to decrypt) which no one really knows. But if we assume for a moment the systems are secure, it is truly fascinating when thinking about it in abstract terms: Who would have thought, it is possible to give someone an exact recipe to follow that scramble a message to be sent... we can consider e.g. an encryption routine where the public key is inline so it is really just a standalone scrambling problem. Even though it is completely public, it can only be used to scramble and not unscramble. The sender who literally went through the steps to scramble the message, cannot undo what they just did. (the sender could have saved the original before scrambling!). And it is not because data is thrown away it can't be unscrambled - all the information is there there, fully recoverable, but only by the person who made the scrambling-recipe and there's no practical way to deduce this unscrambling recipe from the scrambling recipe.

By toast0 2025-02-0218:41

> Public-key cryptography is magic.

Public key cryptography is kind of magic, but the basic piece that makes everything work isn't too hard. It's been a while since I took discrete math, so this is a bit rusty, but it should get you in the right direction. It's modular multiplication in a field. If you take a small field and make a times table, you can see the magic. All of the factors that are relatively prime to the modulus will have a pattern that you F times X is only equal to F times Y if X and Y are equal. There will also be a multiplicative inverse so that F times X time I equals X for all X; because of commutation I times X times F also equals X. When you've determined an F and an I, you give your correspondent F and keep I --- then if you multiply your message by F, they multiply it by I and get back the message. If they want to send you a message, they multiply it by I and send it, then you multiply it by F.

There's some other stuff, but that's the magic part, IMHO. You just have to use a bigger field and some other stuff I forgot. ;)

By linsomniac 2025-02-0117:2510 reply

Highlight (benchmark of Perl source code):

    The results follow:
    xz -T16 -9 -k - 2'056'645'240 bytes (c=12m09s, d=4m40s)
    bzip2 -9 -k   - 3'441'163'911 bytes (c=17m16s, d=9m22s)
    bzip3 -b 256  - 1'001'957'587 bytes (c=7m10s,  d=4m6s? Unclear on source page)
    bzip3 -b 511  -   546'456'978 bytes (c=7m08s,  d=4m6s? Unclear)
    zstd -T12 -16 - 3'076'143'660 bytes (c=6m32s,  d=3m51s)

edit: Adding times and compression levels

By zimpenfish 2025-02-0118:482 reply

Did a test on a 1.3G text file (output of `find -f -printf ...`); Macbook Pro M3 Max 64GB. All timings are "real" seconds from bash's builtin `time`.

    files.txt         1439563776
    bzip2 -9 -k       1026805779 71.3% c=67 d=53
    zstd --long -19   1002759868 69.7% c=357 d=9
    xz -T16 -9 -k      993376236 69.0% c=93 d=9
    zstd -T12 -16      989246823 68.7% c=14 d=9
    bzip3 -b 256       975153650 67.7% c=174 d=187
    bzip3 -b 256 -j12  975153650 67.7% c=46 d=189
    bzip3 -b 511       974113769 67.6% c=172 d=187
    bzip3 -b 511 -j12  974113769 67.6% c=77s d=186

I'll stick with zstd for now (unless I need to compress the Perl source, I guess.)

(edited to add 12 thread runs of bzip3 and remove superfluous filenames)

By zimpenfish 2025-02-0121:31

Since I only have 12 perf cores on this Mac, I tried the xz test again with 12 threads.

    xz -T12 -9 -k      993376236 69.0% c=83 d=9

~10% faster for compression with the same size output.

By yencabulator 2025-02-0220:02

That d=9 sure wins the day there, for me.

By wongarsu 2025-02-0119:43

Additional benchmarks on the same dataset:

    uncompressed               - 19'291'709'440
    bzip2 -9                   -  3'491'493'993 (sanity check)
    zstd -16 --long            -    593'915'849
    zstd -16 --long=31         -    122'909'756 (requires equivalent argument in decompressor due to needing ~4GB RAM)
    zstd -19 --long            -    505'728'419
    zstd -19 --long=31         -    106'601'594 (requires equivalent argument in decompressor)
    zstd --ultra -22           -    240'330'522
    zstd --ultra -22 --long=31 -     64'899'008 (requires equivalent argument in decompressor)
    rar a -m5 -md4g -s -mt8    -     64'837'044

As you notice my sanity check actually has a slightly different size. Not sure why. The benchmark is a bit underspecified because new perl versions were released in the interim. I used all releases up to perl-5.37.1 to get to the correct number of files. Just treat all numbers to have about 2% uncertainty to account for this difference.

I can't provide compression/decompression times, but the --long or --long=31 arguments should not have major impact on speed, they mostly impact used memory. --long=31 requires setting the same in the decompressor, making that option mostly useful for internal use, not archives meant for public consumption.

As you can see, the benchmark chosen by the author mostly comes down to finding similar data that's far away. I wonder if bzip3 can do this better than other algorithms (especially in less memory) or simply chose default parameters that use more memory.

Edit: added more benchmarks

By linsomniac 2025-02-0118:023 reply

My standard test is compressing a "dd" disc image of a Linux install (I use these for work), with unused blocks being zeroed. Results:

    Uncompressed:      7,516,192,768
    zstd:              1,100,323,366
    bzip3 -b 511 -j 4: 1,115,125,019

By palaiologos 2025-02-0120:593 reply

Hi, tool author here!

Thank you for your benchmark!

As you may be aware, different compression tools fill in different data type niches. In particular, less specialised statistical methods (bzip2, bzip3, PPMd) generally perform poorly on vaguely defined binary data due to unnatural distribution of the underlying data that at least in bzip3's case does not lend well to suffix sorting.

Conversely, Lempel-Ziv methods usually perform suboptimally on vaguely defined "textual data" due to the fact that the future stages of compression that involve entropy coding can not make good use of the information encoded by match offsets while maintaining fast decompression performance - it's a long story that I could definitely go into detail about if you'd like, but I want to keep this reply short.

All things considered, data compression is more of an art than science, trying to fit in an acceptable spot on the time to compression ratio curve. I created bzip2 as an improvement to the original algorithm, hoping that we can replace some uses of it with a more modern and worthwhile technology as of 2022. I have included benchmarks against LZMA, zstandard, etc. mostly as a formality; in reality if you were to choose a compression method it'd be very dependent on what exactly you're trying to compress, but my personal stance is that bzip3 would likely be strictly better than bzip2 in all of them.

bzip3 usually operates on bigger block sizes, up to 16 times bigger than bzip2. additionally, bzip3 supports parallel compression/decompression out of the box. for fairness, the benchmarks have been performed using single thread mode, but they aren't quite as fair towards bzip3 itself, as it uses a way bigger block size. what bzip3 aims to be is a replacement for bzip2 on modern hardware. what used to not be viable decades ago (arithmetic coding, context mixing, SAIS algorithms for BWT construction) became viable nowadays, as CPU Frequencies don't tend to change, while cache and RAM keep getting bigger and faster.

By linsomniac 2025-02-0122:10

Thanks for the reply. I just figured I'd try it and see, and the bzip3 results are extremely good. I figured it was worth trying because a fair bit of the data in that image is non-binary (man pages, config files, shell/python code), but probably the bulk of it is binary (kernel images, executables).

By andix 2025-02-0122:032 reply

Shouldn't a modern compression tool, targeting a high compression rate, try to switch its compression method on the fly depending on the input data?

I have no idea about compression, just a naive thought.

By supertrope 2025-02-021:04

7-Zip can apply a BCJ filter before LZMA to more effectively compress x86 binaries. https://www.7-zip.org/7z.html. Btrfs’ transparent compression feature checks if the first block compressed well; if not it gives up for the rest of the file.

By kristofferR 2025-02-025:09

https://en.wikipedia.org/wiki/PAQ

By nmz 2025-02-0223:51

If the focus is on text, then the best example is probably the sqlite amalgation file which is a 9mb C file.

By linsomniac 2025-02-0122:05

A couple other data points:

    zstd --long --ultra -2:                                1,062,475,298
    zstd --long=windowLog --zstd=windowLog=31 --ultra -22: 1,041,203,362

So for my use case the additional settings don't seem to make sense.

By SeptiumMMX 2025-02-0119:572 reply

Given that it's BWT, the difference should be the most prominent on codebases with huge amounts of mostly equivalent files. Most compression algorithms won't help if you get an exact duplicate of some block when it's past the compression window (and will be less efficient if near the end of the window).

But here's a practical trick: sort files by extension and then by name before putting them into an archive, and then use any conventional compression. It will very likely put the similar-looking files together, and save you space. Done that in practice, works like a charm.

By hcs 2025-02-020:50

Handy tip for 7-Zip, the `-mqs` command line switch (just `qs` in the Parameters field of the GUI) does this for you. https://7-zip.opensource.jp/chm/cmdline/switches/method.htm#...

By ku1ik 2025-02-029:07

Ooh, that’s neat. How much improved do you get from this? Is it more single or double digit % diff?

By sevg 2025-02-0117:302 reply

To make your comment more useful you’ll want to include compression and decompression time.

Using the results from the readme, seems like bzip3 performs competitively with zstd on both counts.

By idoubtit 2025-02-0118:30

I've experimented a bit with bzip3, and I think the results in the readme are not representative. I think it's a handmade pick, with an uncommon input and unfair choices of parameters. And it's made with a HDD, which skews the results even more.

For instance, with a 800 MB SQL file, for the same compression time and optimal parameters (within my capacity), bzip3 produced a smaller file (5.7 % compression ration) than zstd (6.1 % with `--long -15`). But the decompression was about 20× slower (with all cores or just one).

I'm not claim my stupid benchmark is better or even right. It's just that my results were very different from bzip3's readme. So I'm suspicious.

By dralley 2025-02-0117:331 reply

Also the compression levels..

By linsomniac 2025-02-0117:411 reply

I believe the compression levels are included in the list above.

By dralley 2025-02-0117:461 reply

Not for zstd or lzma

By linsomniac 2025-02-0117:59

Added, thanks.

By jl6 2025-02-0117:461 reply

A 4x improvement over lzma is an extraordinary claim. I see the author has also given a result after applying lrzip (which removes long-range redundancies in large files), and the difference isn’t so great (but bzip3 still wins). Does the amazing result without lrzip mean bzip3 is somehow managing to exploit some of that long-range redundancy natively?

I’d be astonished if such a 4x result generalized to tarballs that aren’t mostly duplicated files.

By wongarsu 2025-02-0118:06

Currently running my own benchmarks, my preliminary results are that zstd becomes competitive again once you add the --long option (so `zstd --long -16 all.tar` instead of `zstd -16 all.tar`). Which is an option that not everyone might be aware of, but whose usefulness should be intuitive for this benchmark of >200 very similar files.

By eqvinox 2025-02-0123:11

I'd argue that's actually the lowlight of the README since that is a very poor choice of benchmark. Combining a multitude of versions of the same software massively favors an algorithm good at dealing with this kind of repetitiveness in a way that will not be seen in typical applications.

The "Corpus benchmarks" further down in the README are IMHO much more practically relevant. The compression ratio of bzip3 is not significantly better, but the runtime seems quite a bit lower than lzma at least.

By miohtama 2025-02-0117:281 reply

In Linux source benchmark results are interestingly more equal, LZMA still holding up well.

What makes Perl source benchmark special? Deduplication?

By mmastrac 2025-02-0117:301 reply

An old friend use to say that Perl is line noise that was given sentience.

By ape4 2025-02-0117:431 reply

This is the source - which is probably C.

By bhaney 2025-02-0118:13

It's 246 C files and 3163 perl files

By loeg 2025-02-0118:071 reply

Why -T12 for zstd and T16 for xz? How many threads is bzip3 using?

By zimpenfish 2025-02-0119:07

From the source, it looks like bzip3 defaults to 1 thread if not explicitly set by arguments.

By JoshTriplett 2025-02-0117:46

...using zstd level 16, when zstd goes up to 22. And without turning on zstd's long-range mode.

By pxeger1 2025-02-0118:001 reply

The author is super cool. They are one of very few people to write any substantial programs in Malbolge, a programming language designed to be cryptographically hard to use (or something like that)

By matthberg 2025-02-0212:01

The Wikipedia page on Malbolge was quite the horrific read, downright amazing to have a lisp written in it.

Here's the program as talked about on HN:

https://news.ycombinator.com/item?id=38850961

https://github.com/kspalaiologos/malbolge-lisp

The cursed language: https://en.wikipedia.org/wiki/Malbolge

And the variant used by the program: https://esolangs.org/wiki/Malbolge_Unshackled