tomnicholas1

2026-03-06 9:24

Submitted: "Periodic Labs"

2 points0 commentsperiodic.com

From bits to atoms.

2026-02-24 7:10

Commented: "A distributed queue in a single JSON file on object storage"

What you describe is very similar to how Icechunk[1] works. It works beautifully for transactional writes to "repos" containing PBs of scientific array data in object storage.

[1]: https://icechunk.io/en/latest/

2026-01-17 6:35

Commented: "Show HN: Streaming gigabyte medical images from S3 without downloading them"

People have literally used Zarr for this - at one point Gemini used Zarr for checkpointing model weights. Not sure what the current fashion in that space is though.

It's definitely one of many fields that see convergent evolution towards something that just looks like Zarr. In fact you can use VirtualiZarr to parse HuggingFace's "SafeTensors" format [0].

[0]: https://github.com/zarr-developers/VirtualiZarr/pull/555

2026-01-17 5:53

Commented: "Show HN: Streaming gigabyte medical images from S3 without downloading them"

IMO Zarr is that newer format. It abstracts over the features of all these other formats so neatly that it can literally subsume them.

I feel that we no longer really need TIFF etc. - for scientific use cases in the cloud Zarr is all that's needed going forwards. The other file formats become just archival blobs that either are converted to Zarr or pointed at by virtual Zarr stores.

2026-01-17 4:22

Commented: "Show HN: Streaming gigabyte medical images from S3 without downloading them"

The generalized form of this range-request-based streaming approach looks something like my project VirtualiZarr [0].

Many of these scientific file formats (HDF5, netCDF, TIFF/COG, FITS, GRIB, JPEG and more) are essentially just contiguous multidimensional array(/"tensor") chunks embedded alongside metadata about what's in the chunks. Efficiently fetching these from object storage is just about efficiently fetching the metadata up front so you know where the chunks you want are [1].

The data model of Zarr [2] generalizes this pattern pretty well, so that when backed by Icechunk [3], you can store a "datacube" of "virtual chunk references" that point at chunks anywhere inside the original files on S3.

This allows you to stream data out as fast as the S3 network connection allows [4], and then you're free to pull that directly, or build tile servers on top of it [5].

In the Pangeo project and at Earthmover we do all this for Weather and Climate science data. But the underlying OSS stack is domain-agnostic, so works for all sorts of multidimensional array data, and VirtualiZarr has a plugin system for parsing different scientific file formats.

I would love to see if someone could create a virtual Zarr store pointing at this WSI data!

[0]: https://virtualizarr.readthedocs.io/en/stable/

[1]: https://earthmover.io/blog/fundamentals-what-is-cloud-optimi...

[2]: https://earthmover.io/blog/what-is-zarr

[3]: https://earthmover.io/blog/icechunk-1-0-production-grade-clo...

[4]: https://earthmover.io/blog/i-o-maxing-tensors-in-the-cloud

[5]: https://earthmover.io/blog/announcing-flux

Hacker News

tomnicholas1

127

2024-08-19

About Me

Recent Activity

Submitted: "Periodic Labs"

Commented: "A distributed queue in a single JSON file on object storage"

Commented: "Show HN: Streaming gigabyte medical images from S3 without downloading them"

Commented: "Show HN: Streaming gigabyte medical images from S3 without downloading them"

Commented: "Show HN: Streaming gigabyte medical images from S3 without downloading them"

HackerNews