https://john-millikin.com/
john@john-millikin.com
Currently semi-retired; formerly at Google (2011-2017) and Stripe (2017-2022)
It doesn't matter, though. xxhash is better than crc32 for hashing keys in a hash table, but both of them are inappropriate for file checksums -- especially as part of a data archival/durability strategy.
It's not obvious to me that per-page checksums in an archive format for comic books are useful at all, but if you really wanted them for some reason then crc32 (fast, common, should detect bad RAM or a decoder bug) or sha256 (slower, common, should detect any change to the bitstream) seem like reasonable choices and xxhash/xxh3 seems like LARPing.
> It seems that JPEG can be decoded on the GPU [1] [2]
Sure, but you wouldn't want to. Many algorithms can be executed on a GPU via CUDA/ROCm, but the use cases for on-GPU JPEG/PNG decoding (mostly AI model training? maybe some sort of giant megapixel texture?) are unrelated to anything you'd use CBZ for.For a comic book the performance-sensitive part is loading the current and adjoining pages, which can be done fast enough to appear instant on the CPU. If the program does bulk loading then it's for thumbnail generation which would also be on the CPU.
Loading compressed comic pages directly to the GPU would be if you needed to ... I dunno, have some sort of VR library browser? It's difficult to think of a use case.
> According to smhasher tests [3] CRC32 is not limited by memory bandwidth.
> Even if we multiply CRC32 scores x4 (to estimate 512 bit wide SIMD from 128
> bit wide results), we still don't get close to memory bandwidth.
Your link shows CRC32 at 7963.20 MiB/s (~7.77 GiB/s) which indicates it's either very old or isn't measuring pure CRC32 throughput (I see stuff about the C++ STL in the logs).Look at https://github.com/corsix/fast-crc32 for example, which measures 85 GB/s (GB, GiB, eh close enough) on the Apple M1. That's fast enough that I'm comfortable calling it limited by memory bandwidth on real-world systems. Obviously if you solder a Raspberry Pi to some GDDR then the ratio differs.
> The 32 bit hash of CRC32 is too low for file checksums. xxhash is definitely
> an improvement over CRC32.
You don't want to use xxhash (or crc32, or cityhash, ...) for checksums of archived files, that's not what they're designed for. Use them as the key function for hash tables. That's why their output is 32- or 64-bits, they're designed to fit into a machine integer.File checksums don't have the same size limit so it's fine to use 256- or 512-bit checksum algorithms, which means you're not limited to xxhash.
> Why would you need to use a cryptographic hash function to check integrity
> of archived files? Quality a non-cryptographic hash function will detect
> corruptions due to things like bit-rot, bad RAM, etc. just the same.
I have personally seen bitrot and network transmission errors that were not caught by xxhash-type hash functions, but were caught by higher-level checksums. The performance properties of hash functions used for hash table keys make those same functions less appropriate for archival. > And why is 256 bits needed here? Kopia developers, for example, think 128
> bit hashes are big enough for backup archives [4].
The checksum algorithm doesn't need to be cryptographically strong, but if you're using software written in the past decade then SHA256 is supported everywhere by everything so might as well use it by default unless there's a compelling reason not to.For archival you only need to compute the checksums on file transfer and/or periodic archive scrubbing, so the overhead of SHA256 vs SHA1/MD5 doesn't really matter.
I don't know what kopia is, but according to your link it looks like their wire protocol involves each client downloading a complete index of the repository content, including a CAS identifier for every file. The semantics would be something like Git? Their list of supported algorithms looks reasonable (blake, sha2, sha3) so I wouldn't have the same concerns as I would if they were using xxhash or cityhash.
I use CBZ to archive both physical and digital comic books so I was interested in the idea of an improved container format, but the claimed improvements here don't make sense.
---
For example they make a big deal about each archive entry being aligned to a 4 KiB boundary "allowing for DirectStorage transfers directly from disk to GPU memory", but the pages within a CBZ are going to be encoded (JPEG/PNG/etc) rather than just being bitmaps. They need to be decoded first, the GPU isn't going to let you create a texture directly from JPEG data.
Furthermore the README says "While folders allow memory mapping, individual images within them are rarely sector-aligned for optimized DirectStorage throughput" which ... what? If an image file needs to be sector-aligned (!?) then a BBF file would also need to be, else the 4 KiB alignment within the file doesn't work, so what is special about the format that causes the OS to place its files differently on disk?
Also in the official DirectStorage docs (https://github.com/microsoft/DirectStorage/blob/main/Docs/De...) it says this:
> Don't worry about 4-KiB alignment restrictions
> * Win32 has a restriction that asynchronous requests be aligned on a
> 4-KiB boundary and be a multiple of 4-KiB in size.
> * DirectStorage does not have a 4-KiB alignment or size restriction. This
> means you don't need to pad your data which just adds extra size to your
> package and internal buffers.
Where is the supposed 4 KiB alignment restriction even coming from?There are zip-based formats that align files so they can be mmap'd as executable pages, but that's not what's happening here, and I've never heard of a JPEG/PNG/etc image decoder that requires aligned buffers for the input data.
Is the entire 4 KiB alignment requirement fictitious?
---
The README also talks about using xxhash instead of CRC32 for integrity checking (the OP calls it "verification"), claiming this is more performant for large collections, but this is insane:
> ZIP/RAR use CRC32, which is aging, collision-prone, and significantly slower
> to verify than XXH3 for large archival collections.
> [...]
> On multi-core systems, the verifier splits the asset table into chunks and
> validates multiple pages simultaneously. This makes BBF verification up to
> 10x faster than ZIP/RAR CRC checks.
CRC32 is limited by memory bandwidth if you're using a normal (i.e. SIMD) implementation. Assuming 100 GiB/s throughput, a typical comic book page (a few megabytes) will take like ... a millisecond? And there's no data dependency between file content checksums in the zip format, so for a CBZ you can run the CRC32 calculations in parallel for each page just like BBF says it does.But that doesn't matter because to actually check the integrity of archived files you want to use something like sha256, not CRC32 or xxhash. Checksum each archive (not each page), store that checksum as a `.sha256` file (or whatever), and now you can (1) use normal tools to check that your archives are intact, and (2) record those checksums as metadata in the blob storage service you're using.
---
The Reddit thread has more comments from people who have noticed other sorts of discrepancies, and the author is having a really difficult time responding to them in a coherent way. The most charitable interpretation is that this whole project (supposed problems with CBZ, the readme, the code) is the output of an LLM.
I'm not (only) talking about the general population, but major sites. As a quick sanity check, the following sites are serving images with the `image/jpeg` content type:
* CNN (cnn.com): News-related photos on their front page
* Reddit (www.reddit.com): User-provided images uploaded to their internal image hosting
* Amazon (amazon.com): Product categories on the front page (product images are in WebP)
I wouldn't expect to see a lot of WebP on personal homepages or old-style forums, but if bandwidth costs were a meaningful budget line item then I would expect to see ~100% adoption of WebP or AVIF for any image that gets recompressed by a publishing pipeline.
Most of the code in WebP and AVIF is shared with VP8/AV1, which means if your browser supports contemporary video codecs then it also gets pretty good lossy image codecs for free. JPEG-XL is a separate codebase, so it's far more effort to implement and merely providing better compression might not be worth it absent other considerations. The continued widespread use of JPEG is evidence that many web publishers don't care that much about squeezing out a few bytes.
Also from a security perspective the reference implementation of JPEG-XL isn't great. It's over a hundred kLoC of C++, and given the public support for memory safety by both Google and Mozilla it would be extremely embarrassing if a security vulnerability in libjxl lead to a zero-click zero-day in either Chrome or Firefox.
The timing is probably a sign that Chrome considers the Rust implementation of JPEG-XL to be mature enough (or at least heading in that direction) to start kicking the tires.
This project is an enhanced reader for Ycombinator Hacker News: https://news.ycombinator.com/.
The interface also allow to comment, post and interact with the original HN platform. Credentials are stored locally and are never sent to any server, you can check the source code here: https://github.com/GabrielePicco/hacker-news-rich.
For suggestions and features requests you can write me here: gabrielepicco.github.io