More than DNS: Learnings from the 14 hour AWS outage

2025-10-2715:5617660thundergolfer.com

A thorough review of a major cloud outage.

Show article

States and transitions of a system experiencing a metastable failure. From *Metastable Failures in Distributed Systems.

In normal operation of EC2, the DWFM maintains a large number (~10^6) of active leases against physical servers and a very small number (~10^2) of broken leases, the latter of which the manager is actively attempting to reestablish.

The DynamoDB outage, which lasted almost 3 hours, caused widespread heartbeat timeouts and thus thousands of leases broke. I’d estimate that the number of broken leases reached at least 10^5, or three OOMs larger than normal.

With a huge queue of leases to reestablish, the DWFM system has two possible transitions. One is a slow, gradual burn down of the lease backlog (recovery). The other is a congestive collapse, where the lease queue remains high (a “sustaining effect”) until manual intervention.

Unfortunately, the DWFM entered congestive collapse.

…, due to the large number of droplets, efforts to establish new droplet leases took long enough that the work could not be completed before they timed out. — AWS Summary

It’s most interesting here to consider why collapse occurred. Was the unit of work completion somehow O(queue depth), ie. instead of processing one lease at a time the DWFM processes an all-or-nothing block of leases? Or, did the DWFM become a thundering herd and overwhelm a downstream component between it and the Droplets?

The congestive collapse of EC2 could only be restored by manual intervention by engineers, where they restarted DWFM servers presumably to drop the in-memory queued lease work and restore goodput in the system.

For a while now I’ve eagerly followed the public blogging of Marc Brooker, a distinguished engineer at AWS with key contributions to Lambda, EBS, and EC2. I’m sure he’s wired up after this outage, because he has been evangelizing the analysis of metastable failure for years now.

The presence and peril of the metastable failure state has likely been widely known within AWS engineering leadership for around 4 years. And yet it bit them in their darling EC2. I eagerly await Marc’s blog post.

NLB

At around 16:00 UTC in the now internally famous #incident-41-hfm_failures incident channel I began investigating a degradation in our Sandbox product. It turned out that AWS NLB trouble was causing Modal clients to fail to establish a gRPC connection with api.modal.com (which points at NLB load balancers).

The NLB service went down because EC2 has a Network Manager responsible for propagating network configuration when new EC2 instances are created or instance transitions (e.g. stopping) occur, and this manager fell behind under the weight of backlogged work.

The most interesting bit of the NLB service outage was that the NLB healthcheck system received bad feedback because of network configuration staleness and incorrectly performed AZ failovers.

The alternating health check results increased the load on the health check subsystem, causing it to degrade, resulting in delays in health checks and triggering automatic AZ DNS failover to occur.

Under control systems analysis, the potential for bad feedback is exactly the kind of thing that gets designed for. The healthcheck system behaved exactly as intended giving the inputs it received— it was a reliable component interacting unsafely with a broken environment.

In their discussion of remediations, the line item for NLB is about control for bad feedback.

For NLB, we are adding a velocity control mechanism to limit the capacity a single NLB can remove when health check failures cause AZ failover.

Humble conclusions

You only get so many major AWS incidents in your career, so appreciate the ones that you get. AWS’s reliability is the best in the world. I have used all major US cloud providers for years now, and until Monday’s us-east-1 outage AWS was clearly the most reliable. This is perhaps their biggest reliability failure in a decade. We should learn something from it.

I disagree with the popular early explanations. The “brain drain” theory has a high burden of proof and very little evidence. It possible that brain drain delayed remediation—this was a 14 hour outage—but we can’t see their response timeline. There’s also no evidence that us-east-1, being the oldest region, suffered from its age.³ The systems involved (DynamoDB, DNS Enactor, DWFM) are probably running globally. Also, those suggesting AWS’s reliability has fallen behind its competitors are too hasty. GCP had a severe global outage just last June.

My main takeaway is that in our industry the design, implementation, and operation of production systems still regularly falls short of what we think we’re capable of. DynamoDB hit a fairly straightforward race condition and entered into unrecoverable state corruption. EC2 went into congestion collapse. NLB’s healthcheck system got misdirected by bad data. We’ve seen these before, we’ll see them again. We’re still early with the cloud.

Software systems are far more complex and buggy than we realize. Useful software systems, such as EC2, always operate in a degraded state with dozens of present or latent bugs and faults. The presence of constant safety, of the cherished five 9s, is not a miracle, but a very challenging design and operational endeavor.

Monday was a bad day for hyper-scaled distributed systems. But in in the long run the public cloud industry will root out its unsound design and operation. Today’s advanced correctness and reliability practices will become normal.

Read the original article

Comments

By ggm 2025-10-281:046 reply

Aside from not dogfooding, what would have reduced the impact? Because "don't have a bug" is .. well it's the difference between desire and reality.

Not dogfooding is the software and hardware equivalent of the electricity network "black start" -you never want to be there, but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator, which you spin up to start the turbine spinning until the steam takes up load and the real generator is able to get volts onto the wire.

Pumped Hydro is inside the machine. It's often held out as the black-start mechanism because it's gravity, there's less to go wrong, but if we are in 'line up the holes in the cheese grater' space, you can always have 'want of a nail' issues with any mechanism. The honda generator can have a hole in the petrol tank, the turbine at the pumped hydro can be out for maintenance.

By belorn 2025-10-3014:41

We can draw inspiration from older dns infrastructure like the root servers. They use a list of names rather than a single name. We can imagine if the root (".") was a single nameserver that was distributed with anycast, and how a single misconfiguration would bring down the whole internet. Instead we have a list of name servers, operated by different entities, and the only thing that should happen if one goes down is that the next one get used after a timeout.

The article bring up a fairly important point in impact reductions from bugs. Critical systems need to have sanity checks for states and values that never should occur during normal operation, with some corresponding action in case they happen. End-points could have had sanity checks of invalid DNS, such as zero ip-addresses or broken DNS, and either reverted back to an valid state or a predefined emergency system. Either would have reduced the impact.

By nijave 2025-10-2922:033 reply

A cap on region size could have helped. Region isolation didn't fail here so splitting us-east-1 into 3,4,5 would have been a smaller impact

Having such a gobstoppingly massive singular region seems to be working against AWS

By DevelopingElk 2025-10-2922:35

DynamoDB is working on going cellular which should help. Some parts are already cellular, and others like DNS are in progress. https://docs.aws.amazon.com/wellarchitected/latest/reducing-...

By otterley 2025-10-301:222 reply

us-east-2 already exists and wasn’t impacted. And the choice of where to deploy is yours!

By nijave 2025-11-0713:471 reply

Our bigger issue is SaaS like CircleCI and Quay also being in us-east-1. We have a working DR strategy (albeit about a 4-6 hour exercise) to switch regions but a lot of 3rd party dependencies are tied to us-east-1

It's a bit of a circular problem since it's usually useful to colocate in the same region for latency or security (privatelink)

By otterley 2025-11-1018:54

Insist that your SaaS providers have implemented and tested their regional failover capability!

By fragmede 2025-10-317:081 reply

Which is great, except for global services that you don't have control on where to deploy to, that ended up being in us-east-1, resulting in issues, no matter where your EC2 instances happened to be.

By otterley 2025-10-317:481 reply

Like what?

By fragmede 2025-10-318:281 reply

AWS IAM, AWS Organizations, Amazon Route 53 (DNS); AWS services that rely on Route53 include ELB, API Gateway; Amazon S3 bucket creation, some other calls; sts.amazonaws.com is still us-east-1 for many cases; Amazon CloudFront, AWS WAF (for CloudFront), AWS Shield Advanced all have us-east-1 as their control plane.

To be clear, the above list is a control plane depency on us-east-1. During the incident the service itself may have been fine but could not be (re)configured.

The big one is really Route 53 though. DNS having issues caused a lot of downstream effects since it's used by ~everything to talk to everything else.

Other services include Slack. No, it's not an AWS service, but for companies reliant on it or something like it, it doesn't matter how much you're not in EC2 if an outside service that you rely on went down. Full list: https://www.reddit.com/r/DigitalMarketing/comments/1oc2jtd/a...

By nijave 2025-11-0713:51

Which parts of STS? I was thinking that had regional endpoints but had some caveats around what existing clients default to

By pas 2025-10-2922:381 reply

there's already some virtualization going on. (I heard that what people see as us-east-1a might be us-east-1c for others to spread the load. though obviously it's still too big.)

By easton 2025-10-302:081 reply

They used to do that (so that everyone picking 1a wouldn’t actually route traffic to the same az), but at some point they made it so all new accounts had the same AZ labels to stop confusing people.

The services that fail on AWS’ side are usually multi zone failures anyway, so maybe it wouldn’t help.

I’ve often wondered why they don’t make a secret region just for themselves to isolate the control plane but maybe they think dogfooding us-east-1 is important (it probably wouldn’t help much anyway, a meteor could just as easily strike us-secret-1).

By nevon 2025-10-309:571 reply

This is not exactly true. The az names are indeed randomized per account, and this is the identifier that you see everywhere in the APIs. The difference now is that they also expose a mapping from AZ name (randomized) to AZ id (not randomized), so that you can know that AZ A in one account is actually in the same datacenter as AZ B in a different account. This becomes quite relevant when you have systems spread across accounts but want the communication to stay zonal.

By waiwai933 2025-10-3012:021 reply

You're both partially right. Some regions have random mapping for AZs; all regions since 2012 have static AZ mapping. https://docs.aws.amazon.com/global-infrastructure/latest/reg...

By nevon 2025-10-3012:17

Oh wow. Thanks for telling me this. I didn't know that this was different for different regions. I just checked some of my accounts, and indeed the mapping is stable between accounts for for example Frankfurt, but not Sydney.

By accrual 2025-10-3015:46

> but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator ...

Tangent, but players of Satisfactory might recognize this condition. If your main power infrastructure somehow goes down, you might not have enough power to start the pumps/downstream machines to power up your main power generators. Thus it's common to have some Tier 0 generators stashed away somewhere to kick start the rest of the system (at least before giant building-sized batteries were introduced a few updates ago).

By mcswell 2025-10-2923:062 reply

(Almost) irrelevant question. You wrote "...a bigger generator, which you spin up to start the turbine spinning until the steam takes up load..." I once served on a steam powered US Navy guided missile destroyer. In addition to the main engines, we had four steam turbine electrical generators. There was no need--indeed, no mechanism--for spinning any of these turbines electrically, they all started up simply by sending them steam. (To be sure, you'd better ensure that the lube oil was flowing.)

Are you saying it's different on land-based steam power plants? Why?

By brendoelfrendo 2025-10-300:111 reply

Most (maybe all?) large grid-scale generators use electromagnets to generate the electromagnetic field they use to generate electricity. These magnets require electricity to generate that field, so you need a small generator to kickstart your big generator's magnets in order to start producing power. There are other concerns, too; depending on the nature of the plant, there may be some other machinery that requires electricity before the plant can operate. It doesn't take much startup energy to open the gate on a hydroelectric dam, but I don't think anyone is shoveling enough coal to cold-start a coal plant without a conveyer belt.

If I had to guess, I'd assume that the generators on your destroyer were gas turbine generators that use some kind of motor as part of their startup process to get the turbine spinning. It's entirely possible that there was an electric motor in there to facilitate the generator's "black-start," but it may have been powered by a battery rather than a smaller generator.

By mcswell 2025-11-020:18

Steam turbine generators (this was back in the 1970s, the ship was decommissioned in 2003). No motor to start turning the turbines, just apply steam. The main engines were also steam turbines, and likewise started just by opening the throttle valves--one for forward, another for backing. We did have a jacking gear powered by an electric motor, but it was only used to prevent warping as the engine cooled down when we went cold iron. And of course the boilers ran on fuel oil, not coal--coal went out on naval ships in the early 20th century.

As for how the generators' fields were started, now that you mention it I'm not sure. We did have emergency diesel generators (and of course shore power when we were pier-side), so maybe those supplied electricity to jump-start the generators. But they were 750 kw generators (upgraded in 1974 from 500 kw generators), so I don't imagine batteries would have sufficed.

By fragmede 2025-10-317:431 reply

The "why" would involve different design considerations that a warship that may get shot at with enemy missiles, and the cost incentives for building one, and also because naval destroyers typically aren't connected to the grid, so black starts are much more of a possibility and need to recover from them is often under duress, vs a land-based power station simply isn't going to have those same issues.

By mcswell 2025-11-020:20

Yeap, and get shot at we did--but by artillery, not missiles. The North Vietnamese were rumored to have Russian Styx missile boats, but if they did they never left port.

By pas 2025-10-2923:491 reply

The post argues for "control theory" and slowing down changes. (Which ... will sure, maybe, but it will slow down convergence, or it complicates things if some class of actions will be faster than others.)

But what would make sense is that upstream services return their load (queue depths or things like PSI from newer kernels) to downstream ones as part of the API responses, so if shit's going on the downstream ones will become more patient and slow down. (But if things are getting cleaned up then the downstream services can speed things up.)

By otterley 2025-10-301:161 reply

I’ve been thinking about this problem for decades. Load feedback is a wonderful idea, but extremely difficult to put into practice. Every service has a different and unique architecture; and even within a single service, different requests can consume significantly different resources. This makes it difficult, if not impossible, to provide a single quantitative number in response to the question “what request rate can I send you? It also requires tight coupling between the load balancer and the backends, which has problems of its own.

I haven’t seen anyone really solve this problem at scale with maybe the exception of YouTube’s mechanism (https://research.google/pubs/load-is-not-what-you-should-bal...), but that’s specific to them and it isn’t universally applicable to arbitrary workloads.

By pas 2025-10-3011:291 reply

thanks for the YT paper!

my point is there's no need to try (and fail) to define some universal backpressure semantics between coupling points, after all this can be done locally, and even after the fact (every time there's an outage, or better yet every time there's a "near miss") the signal to listen to will show up.

and if not, then not, which means (as you said) that link likely doesn't have this kind of simple semantics. maybe because the nature of the integration is not request-response or not otherwise structured to provide this apparent legibility, even if it's causally important for downstream.

simply thinking about this during post-mortems, having metrics available (which is anyway a given in these complex high-availability systems), having the option in the SDK, seems like the way forward

(yes, I know this is basically the circuit breaker and other Netflix-evangelized ideas with extra steps :))

By otterley 2025-10-3017:491 reply

The simplest and most effective strategy we know today to automatically recover that gives the impacted service the ability to avoid entering a metastable state is for clients to implement retries with exponential backoff. No circuit breaker-type functionality is required. Unfortunately it requires that clients be well behaved.

Also, circuit breakers have issues of their own:

“Even with a single layer of retries, traffic still significantly increases when errors start. Circuit breakers, where calls to a downstream service are stopped entirely when an error threshold is exceeded, are widely promoted to solve this problem. Unfortunately, circuit breakers introduce modal behavior into systems that can be difficult to test, and can introduce significant addition time to recovery. We have found that we can mitigate this risk by limiting retries locally using a token bucket. This allows all calls to retry as long as there are tokens, and then retry at a fixed rate when the tokens are exhausted.” https://aws.amazon.com/builders-library/timeouts-retries-and...

Consider a situation in which all the clients have circuit breakers. All of them enter the open state once the trigger condition is met, which drops request load on the service to zero. Your autoscaler reduces capacity to the minimum level in response. Then, all the circuit breakers are reset to the closed state. Your service then experiences a sudden rush of normal- or above-normal traffic, causing it to immediately exhaust availabile capacity. It’s a special case of bimodal behavior, which we try to avoid as a matter of sound operational practice.

By fragmede 2025-10-317:34

Thundering herd is a known post-outage, service restore failure mode. You never let your load balancer and the API boxes behind it dip below some statically defined low water mark; the waste of money is better than going down when the herd shows up. When it does show up, as noted, token bucket rate limiter running 429s while the herd slowly gets let back onto the system. Even if the app server can take it, it's not a given that the eg queue or database systems can absorb the herd as well (especially if that's what went down).

By 7952 2025-10-3012:54

I think a lot problems across different systems have a similar issue. You have a system that needs to have some autonomy (like a flying aeroplane). It has a sources of authority (say a sensor, ATC) but that sometimes is unavailable, delayed, gives wrong data. When that happens we are unwilling to fall back on more autonomy and automation. But there is limited scope for human intervention due to the scale of the problem or just technical difficulty. We reach an inflection point where the only direction left is to give up some element of human control. Accept that systems will sometimes receive bad data and need some autonomy to ignore it when it is contraindicated. And that higher level control is just another source of possible false data.

By tptacek 2025-10-2922:471 reply

Good to see an analysis emphasizing the metastable failure mode in EC2, rather than getting bogged down by the DNS/Dynamo issue. The Dynamo issue, from their timeline, looks like it got fixed relatively quickly, unlike EC2, which needed a fairly elaborate SCRAM and recovery process that took many hours to execute.

A faster, better-tested "restart all the droplet managers from a known reasonable state" process is probably more important than finding all the Dynamo race conditions.

By thundergolfer 2025-10-2923:021 reply

I was motivated by your back-and-forth in the original AWS summary to go and write this post :)

By tptacek 2025-10-2923:09

It's good, and I love that you brought the Google SRE stuff into it.

By bboozzoo 2025-10-3010:201 reply

> What is surprising is that a classic Time-of-check-time-of-use (TOCTOU) bug was latent in their system until Monday. Call me naive, but I thought AWS, an organization running a service that averages 100 million RPS, would have flushed out TOCTOU bugs from their critical services.

Yeah, right. I'm surprised how anyone involved with software engineering can be surprised by this. I would argue that there many, if not infinitely many, similar bugs out there. It's just that the right conditions for them to show up haven't been met yet.

By accrual 2025-10-3015:43

I had a similar thought. TOCTOU bugs could be anywhere and only take a few lines of code to create the conditions for them, but have no immediate warning they exist unless you're looking for them.