
March 12, 2026 Hacker News Lobsters RedditThe story goes like this. ComputerCraft is a mod that adds programming to Minecraft. You write Lua code that gets executed by a bespoke interpreter with…
My experience does not match theirs when compressing text and code:
> bzip might be suboptimal as a general-purpose compression format, but it’s great for text and code. One might even say the b in bzip stands for “best”.
I've just checked again with a 1GB SQL file. `bzip2 -9` shrinks it to 83MB. `zstd -19 --long` to 52MB.
Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's.
What you are seeing here is probably the effect of window size. BZip has to perform the BWT strictly block-wise and is quite memory-hungry, so `bzip2 -9` uses a window size of 900KB, if I recall correctly. Dictionary-based algorithms are more flexible in this regard, and can gain a substantial advantage on very large and repetitive files. The article kind of forgets to mention this. Not that BZip isn't remarkably efficient for its simplicity, but it's not without limitations.
I suspect the reason for the difference here may be specific use case and the implications there on the size of the files? The author's use case is Lua files to run in Minecraft, and I strongly suspect their example file at 327KB is very much closer to "typical" for that use case than a 1GB SQL file.
It wouldn't surprise me at all that "more modern" compression techniques work better on larger files. It also wouldn't surprise me too much if there was no such thing as a 1GB file when bzip was originally written, according to Wikipedia bzip2 is almost 30 years old "Initial releases 18 July 1996". And there are mentions of the preceding bzip (without the 2) which must have been even earlier than that. In the mid/late 90s I was flying round the world trips with a dozen or so 380 or 500MB hard drives in my luggage to screw into our colo boxen in Singapore London and San Francisco (because out office only has 56k adsl internet).
For large files, it is frequent to obtain much higher compression ratios when using a preprocessing method, e.g. by using lrzip (which invokes internal or external standard compressors after preprocessing the input to find long-range similarities).
For instance, "lrzip -b", which uses bzip2 for compression, typically achieves much higher compression ratios on big files than using either xz or zstd alone. Of course, you can also use lrzip with xz or zstd, with various parameters, but among the many existing possibilities you must find an optimum compromise between compression ratio and compression/decompression times.
> Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's
I compressed kernel 6.19.8 with zstd -19 --long and bzip3 (default settings). The latter compressed better and was about 8x faster.
bzip is old and slow.
It was long surpassed by lzma and zstd.
But back in roughly the 00s, it was the best standard for compression, because the competition was DEFLATE/gzip.
Also potentially relevant: in the 00s, the performance gap between gzip and bzip2 wasn't quite as wide - gzip has benefited far more from modern CPU optimizations - and slow networks / small disks made a higher compression ratio more valuable.
Even then, there were better options in the Windows world (RAR/ACE/etc.). Also, bzip2 was considered slow even when it was new.
RAR/ACE/etc used continuous compression - all files were concatenated and compressed as if they were one single large file. Much like what is done with .tar.bz. Bzip on Windows did not do that, there was no equivalent of .tar.bz2 on Windows.
You can bzip2 -9 files in some source code directory and tar these .bz2 files. This would be more or less equivalent to creating ZIP archive with BWT compression method. Then you can compare result with tar-ing the same source directory and bzip2 -9 the resulting .tar.
Then you can compare.
The continuous mode in RAR was something back then, exactly because RAR had long LZ77 window and compressed files as continuous stream.
>continuous compression
'Solid compression' (as WinRAR calls it) is still optional with RAR. I recall the default is 'off'. At the time, that mode was still pretty good compared to bzip2.
is SQL file text or code?
And here i got best compression out of xz for SQL.
Early on the article mentions that xz have zstd have gotten more popular than bzip, and my admitted naive understanding is that they're considered to have better tradeoffs in teems of collision compression time and overall space saved by compression. The performance section heavily discusses encoding performance of gzip and bzip, but unless I'm missing something, the only references to xz or zstd in that section are briefly handwaving about the decoding times probably being similar.
My impression is that this article has a lot of technical insight into how bzip compares to gzip, but it fails actually account for the real cause of the diminished popularity of bzip in favor of the non-gzip alternatives that it admits are the more popular choices in recent years.
There was an analysis which argued that zstd is pareto optimal.
https://insanity.industries/post/pareto-optimal-compression/
> > Pareto optimality is a situation where no […] preference criterion can be better off without making at least one […] preference criterion worse off […].
(Omissions theirs.)
Wasn't that zstandard's stated goal? Not very surprising that it has this property, also considering it's much newer (2015) than the established tools like gzip (1992), bzip2 (1996), LZMA as used by xz utils (1999)
Edit: https://github.com/facebook/zstd/blob/4856a00164c1d7b947bd38... initial commit indeed states it's meant to have good ratios and good (de)compression speeds as compared to other tools, without sacrificing one for the other (»"Standard" translates into everyday situations which neither look for highest possible ratio (which LZMA and ZPAQ cover) nor extreme speeds (which LZ4 covers).«). So Pareto by design, just not using that name
> meant to have good ratios and good (de)compression speeds as compared to other tools
That does not mean it's Pareto optimal; Pareto-optimality forms a curve and while zstd, LZMA, LZ4, ZPAQ all want to be as close as possible to the curve, they focus on different parts of it. In particular zstd tries to stay on the middle part of the curve, while LZMA and LZ4 focus on opposite sides
___.--- higher throughput
/ LZ4
/ zStd
|
; LZMA
|
| ZPAQ
lower size
Also, the Pareto curve is not necessarily known in advance. All you can do is add more and more algorithms or tweaks to understand what it looks like. For example, this blog post [https://insanity.industries/post/pareto-optimal-compression/] shows that prior to zstd, bzip2 and gzip2 were both pretty much Pareto optimal in the same area for ratio vs. compression speed. LZMA at low settings was a bit better but much slower. There was a huge gap between LZMA and LZ4, and bzip2/gzip filled it as best as they could.The same blog post shows that zstd is an absolute speed demon at decompression; while not all zstd settings are Pareto optimal when looking at size vs compression speed (in particular LZMA wins at higher compression ratios, and even considering zstd only there's hardly a reason to use levels 11-15), zstd is pretty much Pareto optimal at all settings when looking at size vs. decompression speed. On the other hand at intermediate settings zstd is faster and produces smaller files than gzip, which therefore is not Pareto optimal (anymore).
This misses the very best compressors by Fabrice Bellard. https://bellard.org/nncp/ and for text tm_zip
Interesting approach. The fastest of the 4 presented compressors ("LSTM (small)") is 24 times slower than xz, and their best compressor ("LSTM (large1)") is 429 times slower than xz. Let alone gzip or, presumably, zstandard (not shown in paper). They also ran the models on different CPUs (a Core i5 and a Xeon E5) so the results are not even comparable within the same paper. A linked webpage lists the author's decompression times, which are even worse: xz decompresses twelve thousand times faster (50MB/s vs. 4kB/s) when nncp has an Nvidia RTX 3090 and 24GB RAM available to it, which apparently speeds it up by 3x compared to the original CPU implementation.
At half the size of xz's output, there can be applications for this, but you need to:
- not care about compression time
- not be constrained on hardware requirements (7.6GB RAM, ideally let it run on a GPU)
- not care about decompression time either
- and the data must be text (I can't find benchmarks other than from English Wikipedia text, but various sources emphasize it's a text compressor so presumably this is no good on e.g. a spacecraft needing to transmit sensor/research data over a weak connection, even if the power budget trade-off of running a GPU instead of pumping power into the antenna were the optimal thing to do)
Bzip is slower than zstd and doesn’t compress as well as xz. There’s no place for it.
You can pry bzip from my dead cold hands.
Article has the numbers for their input:
uncompressed: 327005
(gzip) zopfli --i100: 75882
zstd -22 --long --ultra: 69018
xz -9: 67940
brotli -Z: 67859
lzip -9: 67651
bzip2 -9: 63727
bzip3: 61067
This matches my experience compressing structured text btw. Bzip2 will beat every other tool out there, both on compression ratio and, sadly, decompression time
OP says decompression time is so high because it has similar properties to a memory-hard password hash: it's bandwidth-bound due to the random access requirement. Even xz decompresses 2.5x faster, and I don't find it particularly fast
This is why I switched away, also for text compression; searching for anything that isn't near the beginning of a large file is tedious. My use-case for compression is generally not like OP's, that is, compressing 100KB so that it can fit into Minecraft (if I understood their purpose correctly); I compress files because they take too much disk space (gigabytes). But if I never wanted to access them, I'd not store them, so decompression speed matters. So I kinda do agree with GP that Bzip2 has limited purposes when Zstd is just a few % more bytes to store for over an order of magnitude more speed (1GB/s instead of 45MB/s)
Edit: And all that ignores non-json/xml/code/text compression tasks, where Bzip2/LZMA doesn't give you the best compression ratio. I'd argue it is premature optimization to use Bzip2 without a very specific use-case like OP has for very good code compression ratios and a simple decoder
I wonder what the combined speed would be for small to mid-sized text files if they were fully loaded into memory first? That swaps multiple random-accesses for single sequential read (even if sequential is not really with flash memory, it should still perform better than totally random access), and memory random access which should not be a bottleneck at these speeds.
Or perhaps this is already being done this way and it is a bottleneck?
This might not work for multi-gigabyte files, but most of text content is well under 1MiB.
This is essentially already being done. The kernel will keep in RAM as much of the file that the system is operating on as will fit
Code can care about cache locality and improve upon this default behavior, like if you find a heuristic where you can do another part of the decompression already while some of the data is still in a CPU cache and you don't have to hit main RAM, but whether that's possible will depend on the algorithm
Being in RAM sadly doesn't mean that random access doesn't matter anymore. Just like how hard drives (compare to, say, tape drives) are random access, you still have swapping times to load the data into the CPU (or arm-moving times in the case of an HDD)
I tested bzip3 vs xz compressing enwik8: bzip3 was 7x faster and had ~20% better compression
I have also tested bzip3 and initially I encountered a great number of cases where bzip3 was much better than zstd from all points of view, i.e. compression ratio and execution times, for any zstd options.
Unfortunately, later I have also found other cases where the bzip3 performance was not so good.
I spend a lot of time on compression/decompression, but sadly there is no algorithm that can be said to be superior to all others in all circumstances, so for optimum results you must still be prepared to use any of them.
Even the ancient gzip is preferable in certain cases, when speed is more important than compression ratio, because sometimes it happens to provide a better compromise than newer algorithms.
10 years ago we look at replacing gzip archives with bzip. It was just too slow for the incremental space saving gains offered. Creating bzip archives took forever. I know xz is "better", but we still just gzip everywhere.
xz is generally slower-but-more-dense; it's not meant to be a full gzip replacement. Zstd, on the other hand, aims to be a “better gzip”, in that it's compresses about as fast as gzip but packs somewhat denser (and decompresses faster).
BWT is a prediction by partial match (PPM) in disguise.
Consider "bananarama":
"abananaram"
"amabananar"
"ananaramab"
"anaramaban"
"aramabanan"
"bananarama"
"mabananara"
"nanaramaba"
"naramabana"
"ramabanana"
The last symbols on each line get context from first symbols of the same line. It is so due to rotation.But, due to sorting, contexts are not contiguous for the (last) character predicted and long dependencies are broken. Because of broken long dependencies, it is why MTF, which implicitly transforms direct symbols statistics into something like Zipfian [1] statistics, does encode BWT's output well.
[1] https://en.wikipedia.org/wiki/Zipf%27s_law
Given that, author may find PPM*-based compressors to be more compression-wise performant. Large Text Compression Benchmark [2] tells us exactly that: some "durilka-bububu" compressor that uses PPM fares better than BWT, almost by third.