Hacker News

An ode to bzip

2026-03-1416:0116890purplesyringa.moe

March 12, 2026 Hacker News Lobsters RedditThe story goes like this. ComputerCraft is a mod that adds programming to Minecraft. You write Lua code that gets executed by a bespoke interpreter with…

Show article

March 12, 2026 Hacker News Lobsters Reddit

The story goes like this. ComputerCraft is a mod that adds programming to Minecraft. You write Lua code that gets executed by a bespoke interpreter with access to world APIs, and now you’re writing code instead of having fun. Computers have limited disk space, and my /nix folder is growing out of control, so I need to compress code.

The laziest option would be to use LibDeflate, but its decoder is larger than both the gains from compression and my personal boundary for copying code. So the question becomes: what’s the shortest, simplest, most ratio-efficient compression algorithm?

I initially thought this was a complex question full of tradeoffs, but it turns out it’s very clear-cut. My answer is bzip, even though this algorithm has been critiqued multiple times and has fallen into obscurity since xz and zstd became popular.

I’m compressing a 327 KB file that contains Lua code with occasional English text sprinkled in comments and documentation. This is important: bzip excels at text-like data rather than binary data. However, my results should be reproducible on other codebases, as the percentages seem to be mostly constant within that category.

Let’s compare multiple well-known encoders on this data:

uncompressed: 327005
(gzip) zopfli --i100: 75882
zstd -22 --long --ultra: 69018
xz -9: 67940
brotli -Z: 67859 (recompiled without a dictionary)
lzip -9: 67651
bzip2 -9: 63727
bzip3: 61067

The bzip family is a clear winner by a large margin. It even beats lzip, whose docs say “‘lzip -9’ compresses most files more than bzip2” (I guess code is not “most files”). How does it achieve this? Well, it turns out that bzip is not like the others.

You see, all other popular compression algorithms are actually the same thing at the core. They’re all based on LZ77, a compression scheme that boils down to replacing repetitive text with short links to earlier occurrences.

The main difference is in how literal strings and backreferences are encoded as bit streams, and this is highly non-trivial. Since links can have wildly different offsets, lengths, and frequencies from location to location, a good algorithm needs to predict and succinctly encode these parameters.

But bzip does not use LZ77. bzip uses BWT, which reorders characters in the text to group them by context – so instead of predicting tokens based on similar earlier occurrences, you just need to look at the last few symbols. And, surprisingly, with the BWT order, you don’t even need to store where each symbol came from!

For example, if the word hello is repeated in text multiple times, with LZ77 you’ll need to find and insert new references at each occurrence. But with BWT, all continuations of hell are grouped together, so you’ll likely just have a sequence of many os in a row, and similarly with other characters, which simple run-length encoding can deal with.

BWT comes with some downsides. For example, if you concatenate two texts in different English dialects, e.g. using color vs colour, BWT will mix the continuations of colo in an unpredictable order and you’ll have to encode a weird sequence of rs and us, whereas LZ77 would prioritize recent history. You can remedy this by separating input by formats, but for consistent data like code, it works just fine as is.

bzip2 and bzip3 are both based on BWT and differ mostly in how the BWT output is compressed. bzip2 uses a variation on RLE, while bzip3 tries to be more intelligent. I’ll focus on bzip2 for performance reasons, but most conclusions apply to bzip3, too.

There is another interesting thing about BWT. You might have noticed that I’m invoking bzip3 without passing any parameters like -9. That’s because bzip3 doesn’t take them. In fact, even invoking bzip2 with -9 doesn’t do much.

LZ77-based methods support different compression levels because searching for earlier occurrences is time-consuming, and sometimes it’s preferable to use a literal string instead of a difficult-to-encode reference, so there is some brute-force. BWT, on the other hand, is entirely deterministic and free of heuristics.

Furthermore, there is no degree of freedom in determining how to efficiently encode the lengths and offsets of backreferences, since there are none. There are run lengths, but that’s about it – it’s a single number, and it’s smaller than typical offsets.

All of that is to say: if you know what the bzip2 pipeline looks like, you can quickly achieve similar compression ratios without fine-tuning and worrying about edge cases. My unoptimized ad-hoc bzip2-like encoder compresses the same input to about 67 KB – better than lzip and with clear avenues for improvement.

That covers the compression format, but what about the size of the decoder? Measuring ELFs is useless when targeting Lua, and Lua libraries like LibDeflate don’t optimize code size for self-extracting archives, so at risk of alienating readers with fancy words and girl math, I’ll have to eyeball this for everything but bzip2.

A self-extracting executable doesn’t have to decode every archive – just one. We can skip sanity checks, headers, inline metadata into code, and tune the format for easier decoding. As such, I will only look at the core decompression loops.

gzip, zstd, xz, brotli, and lzip all start by doing LZ77. Evaluating “copy” tokens is a simple loop that won’t take much code. Where they differ is in how those tokens are encoded into bits:

gzip does some light pre-processing and then applies Huffman coding, which assigns unambiguous bit sequences to tokens and then concatenates them, optimizing for total length based on the token frequency distribution. Huffman codes can be parsed in ~250 bytes, the bit trie might take ~700 bytes, and the glue should fit in ~500 bytes. Let’s say 1.5 KB in total.

xz encodes tokens bit-by-bit instead of treating them as atoms, which allows the coder to adjust probabilities dynamically, yielding good ratios without encoding any tables at the cost of performance. Bit-by-bit parsing will take more space than usual, but avoiding tables is a huge win, so let’s put at 1 KB.
lzip is very similar to xz, only slightly changing token encodings, so let’s put it at 1 KB as well.

zstd complicates the pre-processing step and uses Finite State Entropy instead of Huffman coding, which effectively allows tokens to be encoded with fractional bit lengths. FSE is simple, but requires large tables, so let’s say ~2000 bytes for storing and parsing them. Adding glue, we should get about 3 KB.

brotli keeps Huffman coding, but switches between multiple static Huffman tables on the flight depending on context. I couldn’t find the exact count, but I get 7 tables on my input. That’s a lot of data that we can’t just inline – we’ll need to encode it and parse it. Let’s say ~500 bytes for parser and ~100 bytes per table. Together with the rest of the code, we should get something like 2.2 kB.

For bzip decoders, BWT can be handled in ~250 bytes. As for the unique parts,

bzip2 compresses the BWT output with MTF + RLE + Huffman. With the default 6 Huffman tables, let’s assign ~1.5 KB to all Huffman-related code and data and ~400 bytes for MTF, RLE, and glue.
bzip3 uses XZ-like bit-by-bit coding with context mixing instead. Let’s say 1 KB for the former and ~500 bytes for the latter.

Point is: by dropping compatibility with standard file formats, the decoder can become very small. I might be wrong on some of these figures, but it most likely won’t switch things up significantly.

bzip-style methods are in the middle of the pack, but that’s somewhat misleading. While bzip2 typically uses 6 Huffman tables, I got good compression results with just one. With a single table, my bzip-style decoder fits in 1.5 KB, which is smaller than everything but xz and lzip, while being faster and more compact.

It’s common knowledge that bzip is slow, but there’s nuance.

When compression is used to circumvent a hard limit, the difference between bzip and gzip is not booting slowly or quickly, but between booting and not booting at all. If unpacking initrd takes 0.5 seconds instead of 0.3 seconds, it’s not a big deal compared to being unable to boot.

From this perspective, saying bzip is slower than gzip begs the question. gzip cannot easily produce an optimal output like bzip can, so it searches for earlier occurrences heuristically with a configurable time-ratio tradeoff. gzip is faster not because of different designs per se, but because gzip decided being fast is more important than a good output, even on -9. If you want a good ratio, you have to use zopfli, which tries to encode gzip more optimally and is a magnitude slower than bzip despite producing worse output.

On the decoding side, decoding bzip is slow because inverting BWT requires random access. This is less of an issue in high-level languages like Lua, where all operations are slow anyway. In this situation, most typical compression techniques are even more expensive. You can decode gzip faster than bzip2, but zstd and brotli are likely closer to it in speed.

I haven’t tried using bzip3 in practice, but I expect decoding bzip3 to be significantly slower than bzip2 due to having to parse tokens bit-by-bit. I’m assuming it’s possible to get bzip3-like ratio without this, but I’ll cross that bridge when I come to it.

So that’s my answer: not a custom compression algorithm, just bzip with whistles.

If you know anything about me, not inventing anything new might sound surprising. Well, I tried to, with varying success.

Here’s an exercise. The core idea behind text compression is that text is repetitive, so instead of repeating the word repeat every time it’s written, we can encode (hopefully shorter) references to the earlier repetitions of repeat. For example, the string:

Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture.

…can be encoded as:

Write Rust is an iron oxide, a usually reddish-brow
Copy 7 bytes from 31 bytes ago (n oxide)
Write formed by the
Copy 3 bytes from 24 bytes ago ( re)
Write acti
Copy 4 bytes from 60 bytes ago (on o)
Write f
Copy 6 bytes from 68 bytes ago ( iron )
Write and
Copy 3 bytes from 41 bytes ago ( ox)
Write yge
Copy 3 bytes from 84 bytes ago (n i)
Write n
Copy 5 bytes from 35 bytes ago ( the )
Write catalytic presence
Copy 4 bytes from 45 bytes ago ( of )
write water or air moisture.

Since we’re compressing text, you might expect the repeated parts to be words. But in this example, it’s often syllables or random letters. Attempts to compress code based on code structure struggle with this, because they’re only capable of compressing individual tokens.

And while we could pre-process the code and then apply bzip to reap the rest of the benefits, it would complicate the decoder without measurably improving the compression ratio. Luz, for example, does this with gzip without any impact on the ratio, despite wasting many kilobytes on decoding.

This seems to be a common experience in data compression. Real-world data is structured in a way that humans can’t easily guess, but computers can recognize on the fly. Improvements come not from complicating algorithms, but from restructuring data and applying more powerful general-purpose methods.

If this sounds suspiciously like machine learning, you’re on point. The 2009 article Rationale for a Large Text Compression Benchmark describes this connection in more detail. It feels naive today, but it checks out: the top contender of the Large Text Compression Benchmark is nncp, and you can surely guess what “NN” stands for.

bzip might be suboptimal as a general-purpose compression format, but it’s great for text and code. One might even say the b in bzip stands for “best”.

bzip encoders are less riddled with heuristics and word-of-mouth design decisions than LZ77-based encoders, leaving less room for subtle mistakes that greatly affect the compression ratio.

bzip decoding is quite fast when implemented in a high-level language.

Read the original article

signa11

Karma: 86864

@Hacker__News
@hacker._news

Comments

By idoubtit 2026-03-1420:266 reply

My experience does not match theirs when compressing text and code:

> bzip might be suboptimal as a general-purpose compression format, but it’s great for text and code. One might even say the b in bzip stands for “best”.

I've just checked again with a 1GB SQL file. `bzip2 -9` shrinks it to 83MB. `zstd -19 --long` to 52MB.

Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's.

By mppm 2026-03-159:04

What you are seeing here is probably the effect of window size. BZip has to perform the BWT strictly block-wise and is quite memory-hungry, so `bzip2 -9` uses a window size of 900KB, if I recall correctly. Dictionary-based algorithms are more flexible in this regard, and can gain a substantial advantage on very large and repetitive files. The article kind of forgets to mention this. Not that BZip isn't remarkably efficient for its simplicity, but it's not without limitations.

By bigiain 2026-03-152:401 reply

I suspect the reason for the difference here may be specific use case and the implications there on the size of the files? The author's use case is Lua files to run in Minecraft, and I strongly suspect their example file at 327KB is very much closer to "typical" for that use case than a 1GB SQL file.

It wouldn't surprise me at all that "more modern" compression techniques work better on larger files. It also wouldn't surprise me too much if there was no such thing as a 1GB file when bzip was originally written, according to Wikipedia bzip2 is almost 30 years old "Initial releases 18 July 1996". And there are mentions of the preceding bzip (without the 2) which must have been even earlier than that. In the mid/late 90s I was flying round the world trips with a dozen or so 380 or 500MB hard drives in my luggage to screw into our colo boxen in Singapore London and San Francisco (because out office only has 56k adsl internet).

By adrian_b 2026-03-1511:59

For large files, it is frequent to obtain much higher compression ratios when using a preprocessing method, e.g. by using lrzip (which invokes internal or external standard compressors after preprocessing the input to find long-range similarities).

For instance, "lrzip -b", which uses bzip2 for compression, typically achieves much higher compression ratios on big files than using either xz or zstd alone. Of course, you can also use lrzip with xz or zstd, with various parameters, but among the many existing possibilities you must find an optimum compromise between compression ratio and compression/decompression times.

By ac29 2026-03-154:31

> Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's

I compressed kernel 6.19.8 with zstd -19 --long and bzip3 (default settings). The latter compressed better and was about 8x faster.

By cogman10 2026-03-1420:552 reply

bzip is old and slow.

It was long surpassed by lzma and zstd.

But back in roughly the 00s, it was the best standard for compression, because the competition was DEFLATE/gzip.

By duskwuff 2026-03-1421:14

Also potentially relevant: in the 00s, the performance gap between gzip and bzip2 wasn't quite as wide - gzip has benefited far more from modern CPU optimizations - and slow networks / small disks made a higher compression ratio more valuable.

By yyyk 2026-03-1421:141 reply

Even then, there were better options in the Windows world (RAR/ACE/etc.). Also, bzip2 was considered slow even when it was new.

By thesz 2026-03-1422:041 reply

RAR/ACE/etc used continuous compression - all files were concatenated and compressed as if they were one single large file. Much like what is done with .tar.bz. Bzip on Windows did not do that, there was no equivalent of .tar.bz2 on Windows.

You can bzip2 -9 files in some source code directory and tar these .bz2 files. This would be more or less equivalent to creating ZIP archive with BWT compression method. Then you can compare result with tar-ing the same source directory and bzip2 -9 the resulting .tar.

Then you can compare.

The continuous mode in RAR was something back then, exactly because RAR had long LZ77 window and compressed files as continuous stream.

By yyyk 2026-03-1422:38

>continuous compression

'Solid compression' (as WinRAR calls it) is still optional with RAR. I recall the default is 'off'. At the time, that mode was still pretty good compared to bzip2.

By eviks 2026-03-152:07

is SQL file text or code?

By 8n4vidtmkvmk 2026-03-1421:49

And here i got best compression out of xz for SQL.

By saghm 2026-03-1417:472 reply

Early on the article mentions that xz have zstd have gotten more popular than bzip, and my admitted naive understanding is that they're considered to have better tradeoffs in teems of collision compression time and overall space saved by compression. The performance section heavily discusses encoding performance of gzip and bzip, but unless I'm missing something, the only references to xz or zstd in that section are briefly handwaving about the decoding times probably being similar.

My impression is that this article has a lot of technical insight into how bzip compares to gzip, but it fails actually account for the real cause of the diminished popularity of bzip in favor of the non-gzip alternatives that it admits are the more popular choices in recent years.

By 0cf8612b2e1e 2026-03-1420:051 reply

There was an analysis which argued that zstd is pareto optimal.

https://insanity.industries/post/pareto-optimal-compression/

By lucb1e 2026-03-155:021 reply

> > Pareto optimality is a situation where no […] preference criterion can be better off without making at least one […] preference criterion worse off […].

(Omissions theirs.)

Wasn't that zstandard's stated goal? Not very surprising that it has this property, also considering it's much newer (2015) than the established tools like gzip (1992), bzip2 (1996), LZMA as used by xz utils (1999)

Edit: https://github.com/facebook/zstd/blob/4856a00164c1d7b947bd38... initial commit indeed states it's meant to have good ratios and good (de)compression speeds as compared to other tools, without sacrificing one for the other (»"Standard" translates into everyday situations which neither look for highest possible ratio (which LZMA and ZPAQ cover) nor extreme speeds (which LZ4 covers).«). So Pareto by design, just not using that name

By bonzini 2026-03-157:241 reply

> meant to have good ratios and good (de)compression speeds as compared to other tools

That does not mean it's Pareto optimal; Pareto-optimality forms a curve and while zstd, LZMA, LZ4, ZPAQ all want to be as close as possible to the curve, they focus on different parts of it. In particular zstd tries to stay on the middle part of the curve, while LZMA and LZ4 focus on opposite sides

          ___.--- higher throughput
        /     LZ4
       / zStd
      |
      ; LZMA
     |
     | ZPAQ
    lower size

Also, the Pareto curve is not necessarily known in advance. All you can do is add more and more algorithms or tweaks to understand what it looks like. For example, this blog post [https://insanity.industries/post/pareto-optimal-compression/] shows that prior to zstd, bzip2 and gzip2 were both pretty much Pareto optimal in the same area for ratio vs. compression speed. LZMA at low settings was a bit better but much slower. There was a huge gap between LZMA and LZ4, and bzip2/gzip filled it as best as they could.

The same blog post shows that zstd is an absolute speed demon at decompression; while not all zstd settings are Pareto optimal when looking at size vs compression speed (in particular LZMA wins at higher compression ratios, and even considering zstd only there's hardly a reason to use levels 11-15), zstd is pretty much Pareto optimal at all settings when looking at size vs. decompression speed. On the other hand at intermediate settings zstd is faster and produces smaller files than gzip, which therefore is not Pareto optimal (anymore).

By rurban 2026-03-1515:111 reply

This misses the very best compressors by Fabrice Bellard. https://bellard.org/nncp/ and for text tm_zip

By lucb1e 2026-03-161:35

Interesting approach. The fastest of the 4 presented compressors ("LSTM (small)") is 24 times slower than xz, and their best compressor ("LSTM (large1)") is 429 times slower than xz. Let alone gzip or, presumably, zstandard (not shown in paper). They also ran the models on different CPUs (a Core i5 and a Xeon E5) so the results are not even comparable within the same paper. A linked webpage lists the author's decompression times, which are even worse: xz decompresses twelve thousand times faster (50MB/s vs. 4kB/s) when nncp has an Nvidia RTX 3090 and 24GB RAM available to it, which apparently speeds it up by 3x compared to the original CPU implementation.

At half the size of xz's output, there can be applications for this, but you need to:

- not care about compression time

- not be constrained on hardware requirements (7.6GB RAM, ideally let it run on a GPU)

- not care about decompression time either

- and the data must be text (I can't find benchmarks other than from English Wikipedia text, but various sources emphasize it's a text compressor so presumably this is no good on e.g. a spacecraft needing to transmit sensor/research data over a weak connection, even if the power budget trade-off of running a GPU instead of pumping power into the antenna were the optimal thing to do)

By usefulcat 2026-03-151:573 reply

Bzip is slower than zstd and doesn’t compress as well as xz. There’s no place for it.

By darkwater 2026-03-158:53

You can pry bzip from my dead cold hands.

By int_19h 2026-03-153:492 reply

Article has the numbers for their input:

uncompressed: 327005

(gzip) zopfli --i100: 75882

zstd -22 --long --ultra: 69018

xz -9: 67940

brotli -Z: 67859

lzip -9: 67651

bzip2 -9: 63727

bzip3: 61067

By lucb1e 2026-03-153:591 reply

This matches my experience compressing structured text btw. Bzip2 will beat every other tool out there, both on compression ratio and, sadly, decompression time

OP says decompression time is so high because it has similar properties to a memory-hard password hash: it's bandwidth-bound due to the random access requirement. Even xz decompresses 2.5x faster, and I don't find it particularly fast

This is why I switched away, also for text compression; searching for anything that isn't near the beginning of a large file is tedious. My use-case for compression is generally not like OP's, that is, compressing 100KB so that it can fit into Minecraft (if I understood their purpose correctly); I compress files because they take too much disk space (gigabytes). But if I never wanted to access them, I'd not store them, so decompression speed matters. So I kinda do agree with GP that Bzip2 has limited purposes when Zstd is just a few % more bytes to store for over an order of magnitude more speed (1GB/s instead of 45MB/s)

Edit: And all that ignores non-json/xml/code/text compression tasks, where Bzip2/LZMA doesn't give you the best compression ratio. I'd argue it is premature optimization to use Bzip2 without a very specific use-case like OP has for very good code compression ratios and a simple decoder

By necovek 2026-03-1510:141 reply

I wonder what the combined speed would be for small to mid-sized text files if they were fully loaded into memory first? That swaps multiple random-accesses for single sequential read (even if sequential is not really with flash memory, it should still perform better than totally random access), and memory random access which should not be a bottleneck at these speeds.

Or perhaps this is already being done this way and it is a bottleneck?

This might not work for multi-gigabyte files, but most of text content is well under 1MiB.

By lucb1e 2026-03-1523:47

This is essentially already being done. The kernel will keep in RAM as much of the file that the system is operating on as will fit

Code can care about cache locality and improve upon this default behavior, like if you find a heuristic where you can do another part of the decompression already while some of the data is still in a CPU cache and you don't have to hit main RAM, but whether that's possible will depend on the algorithm

Being in RAM sadly doesn't mean that random access doesn't matter anymore. Just like how hard drives (compare to, say, tape drives) are random access, you still have swapping times to load the data into the CPU (or arm-moving times in the case of an HDD)

By ac29 2026-03-154:151 reply

I tested bzip3 vs xz compressing enwik8: bzip3 was 7x faster and had ~20% better compression

By adrian_b 2026-03-1511:52

I have also tested bzip3 and initially I encountered a great number of cases where bzip3 was much better than zstd from all points of view, i.e. compression ratio and execution times, for any zstd options.

Unfortunately, later I have also found other cases where the bzip3 performance was not so good.

I spend a lot of time on compression/decompression, but sadly there is no algorithm that can be said to be superior to all others in all circumstances, so for optimum results you must still be prepared to use any of them.

Even the ancient gzip is preferable in certain cases, when speed is more important than compression ratio, because sometimes it happens to provide a better compromise than newer algorithms.

By jgalt212 2026-03-1512:301 reply

10 years ago we look at replacing gzip archives with bzip. It was just too slow for the incremental space saving gains offered. Creating bzip archives took forever. I know xz is "better", but we still just gzip everywhere.

By Sesse__ 2026-03-1515:10

xz is generally slower-but-more-dense; it's not meant to be a full gzip replacement. Zstd, on the other hand, aims to be a “better gzip”, in that it's compresses about as fast as gzip but packs somewhat denser (and decompresses faster).

By thesz 2026-03-1421:56

BWT is a prediction by partial match (PPM) in disguise.

Consider "bananarama":

  "abananaram"
  "amabananar"
  "ananaramab"
  "anaramaban"
  "aramabanan"
  "bananarama"
  "mabananara"
  "nanaramaba"
  "naramabana"
  "ramabanana"

The last symbols on each line get context from first symbols of the same line. It is so due to rotation.

But, due to sorting, contexts are not contiguous for the (last) character predicted and long dependencies are broken. Because of broken long dependencies, it is why MTF, which implicitly transforms direct symbols statistics into something like Zipfian [1] statistics, does encode BWT's output well.

[1] https://en.wikipedia.org/wiki/Zipf%27s_law

Given that, author may find PPM*-based compressors to be more compression-wise performant. Large Text Compression Benchmark [2] tells us exactly that: some "durilka-bububu" compressor that uses PPM fares better than BWT, almost by third.