gregates

2025-09-25 4:08

Commented: "How AWS S3 serves 1 petabyte per second on top of slow HDDs"

Availability zones are not durability zones. S3 aims for objects to still be available with one AZ down, but not more than that. That does actually impose a constraint on the ratio relative to the number of AZs you shard across.

If we assume 3 AZs, then you lose 1/3 of shards when an AZ goes down. You could do at most 6:9, which is a 1.5 byte ratio. But that's unacceptable, because you know you will temporarily lose shards to HDD failure, and this scheme doesn't permit that in the AZ down scenario. So 1.5 is our limit.

To lower the ratio from 1.8, it's necessary to increase the denominator (the number of shards necessary to reconstruct the object). This is not possible while preserving availability guarantees with just 9 shards.

Note that Cloudflare's R2 makes no such guarantees, and so does achieve a more favorable cost with their erasure coding scheme.

Note also that if you increase the number of shards, it becomes possible to change the ratio without sacrificing availability. Example: if we have 18 shards, we can chose 11:18, which gives us 1.61 physical bytes per logical byte. And it still takes 1 AZ + 2 shards to make an object unavailable.

You can extrapolate from there to develop other sharding schemes that would improve the ratio and improve availability!

Another key hidden assumption is that you don't worry about correlated shard loss except in the AZ down case. HDDs fail, but these are independent events. So you can bound the probability of simultaneous shard loss using the mean time to failure and the mean time to repair that your repair system achieves.

2025-09-24 8:42

Commented: "How AWS S3 serves 1 petabyte per second on top of slow HDDs"

It's not all java anymore. There's some rust now, too. ShardStore, at least (which the article mentions).

2025-09-24 8:35

Commented: "How AWS S3 serves 1 petabyte per second on top of slow HDDs"

I think the article's title question is a bit misleading because it focuses on peak throughput for S3 as a whole. The interesting question is "How can the throughput for a GET exceed the throughput of an HDD?"

If you just replicated, you could still get big throughput for S3 as a whole by doing many reads that target different HDDs. But you'd still be limited to max HDD throughput * number of GETs. S3 is not so limited, and that's interesting and non-obvious!

2025-09-24 8:25

Commented: "How AWS S3 serves 1 petabyte per second on top of slow HDDs"

A few factual inaccuracies in here that don't affect the general thrust. For example, the claim that S3 uses a 5:9 sharding scheme. In fact they use many different sharding schemes, and iirc 5:9 isn't one of them.

The main reason being that a ratio of 1.8 physical bytes to 1 logical byte is awful for HDD costs. You can get that down significantly, and you get wider parallelism and better availability guarantees to boot (consider: if a whole AZ goes down, how many shards can you lose before an object is unavailable for GET?).

Hacker News

gregates

170

2024-02-26

About Me

Recent Activity

Commented: "How AWS S3 serves 1 petabyte per second on top of slow HDDs"

Commented: "How AWS S3 serves 1 petabyte per second on top of slow HDDs"

Commented: "How AWS S3 serves 1 petabyte per second on top of slow HDDs"

Commented: "How AWS S3 serves 1 petabyte per second on top of slow HDDs"

HackerNews