Systems Correctness Practices at Amazon Web Services

2025-05-3012:43383132cacm.acm.org

Amazon Web Services (AWS) strives to deliver reliable services that customers can trust completely. This requires maintaining the highest standards of security, durability, integrity, and…

Show article

Amazon Web Services (AWS) strives to deliver reliable services that customers can trust completely. This requires maintaining the highest standards of security, durability, integrity, and availability—with systems correctness serving as the cornerstone for achieving these priorities. An April 2015 article published in Communications of the ACM, titled “How Amazon Web Services Uses Formal Methods,” highlighted the approach for ensuring the correctness of critical services that have since become among the most widely used by AWS customers.²¹

Central to this approach was TLA+,¹⁴ a formal specification language developed by Leslie Lamport. Our experience at AWS with TLA+ revealed two significant advantages of applying formal methods in practice. First, we could identify and eliminate subtle bugs early in development—bugs that would have eluded traditional approaches such as testing. Second, we gained the deep understanding and confidence needed to implement aggressive performance optimizations while maintaining systems correctness.

Moreover, 15 years ago, AWS’s software testing practice relied primarily on build-time unit testing, often against mocks, and limited deployment-time integration testing. Since then, we have significantly evolved our correctness practices, integrating both formal and semi-formal approaches into the development process. As AWS has grown, formal methods have become increasingly valuable—not only for ensuring correctness but also for performance improvements, particularly in verifying the correctness of both low- and high-level optimizations. This systematic approach toward systems correctness has become a force multiplier at AWS’s scale, enabling faster development cycles through improved developer velocity while delivering more cost-effective services to customers.

This article surveys the portfolio of formal methods used across AWS to deliver complex services with high confidence in its correctness. We consider an umbrella definition of formal methods that encompasses these rigorous techniques—from traditional formal approaches (such as theorem proving,⁷^,¹⁰ deductive verification,¹⁸ and model checking⁸^,¹⁴) to more lightweight semi-formal approaches (such as property-based testing,⁶^,¹⁹ fuzzing,⁹ and runtime monitoring¹¹).

As the use of formal methods was expanded beyond the initial teams at AWS in the early 2010s, we discovered that many engineers struggled to learn and become productive with TLA+. This difficulty seemed to stem from TLA+’s defining feature: It is a high-level, abstract language that more closely resembles mathematics than the imperative programming languages most developers are familiar with. While this mathematical nature is a significant strength of TLA+, and we continue to agree with Lamport’s views on the benefits of mathematical thinking,¹⁵ we also sought a language that would allow us to model check (and later prove) key aspects of systems design while being more approachable to programmers.

We found this balance in the P programming language.⁸ P is a state-machine-based language for modeling and analysis of distributed systems. Using P, developers model their system designs as communicating state machines, a mental model familiar to Amazon’s developer population—most of whom develop systems based on microservices and service-oriented architectures (SOAs). P has been developed at AWS since 2019 and is maintained as a strategic open source project.²² Teams across AWS that build some of its flagship products—from storage (for example, Amazon S3, EBS), to databases (for example, Amazon DynamoDB, MemoryDB, Aurora), to compute (for example, EC2, IoT)—have been using P to reason about the correctness of their system designs.

For example, P was used in migrating Simple Storage Service (S3) from eventual to strong read-after-write consistency.¹ A key component of S3 is its index subsystem, an object metadata store that enables fast data lookups. To achieve strong consistency, the S3 team had to make several nontrivial changes to the S3 index protocol stack.²⁵ Because these changes were difficult to get right at S3 scale, and the team wanted to deliver strong consistency with high confidence in correctness, they used P to formally model and validate the protocol design. P helped eliminate several design-level bugs early in the development process and allowed the team to deliver risky optimizations with confidence, as they could be validated using model checking.

In 2023, the P team at AWS built PObserve, which provides a new tool for validating the correctness of distributed systems both during testing and in production. With PObserve, we take structured logs from the execution of distributed systems and validate post hoc that they match behaviors allowed by the formal P specification of the system. This allows for bridging the gap between the P specification of the system design and the production implementation (typically in languages like Rust or Java). While there are significant benefits from verifying protocols at design time, runtime monitoring of the same properties for the implementation makes the investment in formal specification much more valuable and addresses classic concerns with the deployment of formal methods in practice (that is, connecting design-time validation with system implementation).

Another way that AWS has brought formal methods closer to its engineering teams is through the adoption of lightweight formal methods.

Property-based testing. The most notable single example of leveraging lightweight formal methods is in Amazon S3’s ShardStore, where the team used property-based testing throughout the development cycle both to test correctness and to speed up development (described in detail by Bornholt, et al.⁴). The key idea in their approach was combining property-based testing with developer-provided correctness specifications, coverage-guided fuzzing (an approach where the distribution of inputs is guided by code coverage metrics), failure injection (where hardware and other system failures are simulated during testing), and minimization (where counterexamples are automatically reduced to aid human-guided debugging).

Deterministic simulation. Another lightweight method widely used at AWS is deterministic simulation testing, in which a distributed system is executed on a single-threaded simulator with control over all sources of randomness, such as thread scheduling, timing, and message delivery order. Tests are then written for particular failure or success scenarios, such as the failure of a participant at a particular stage in a distributed protocol. The nondeterminism in the system is controlled by the test framework, allowing developers to specify orderings they believe are interesting (such as ones that have caused bugs in the past). The scheduler in the testing framework can also be extended for fuzzing of orderings or exploring all possible orderings to be tested.

Deterministic simulation testing moves testing of system properties, like behavior under delay and failure, closer to build time instead of integration testing. This accelerates development and provides for more complete behavioral coverage during testing. Some of the work done at AWS on build-time testing of thread ordering and systems failures has been open sourced as part of the shuttle^a and turmoil^b projects.

Continuous fuzzing or random test-input generation. Continuous fuzzing, especially coverage-guided scalable test-input generation, is also effective for testing systems correctness at integration time. During the development of Amazon Aurora’s data-sharding feature (Aurora Limitless Database³), for example, we made extensive use of fuzzing to test two key properties of the system. First, by fuzzing SQL queries (and entire transactions), we validated that the logic partitioning SQL execution over shards is correct. Large volumes of random SQL schemas, datasets, and queries are synthesized and run through the engines under test, and the results compared with an oracle based on the non-sharded version of the engine (as well as other approaches to validation, like those pioneered by SQLancer²³).

Fuzzing, combined with fault-injection testing, is also useful for testing other aspects of database correctness such as atomicity, consistency, and isolation. In database testing, transactions are automatically generated, their correct behavior is defined using a formally specified correctness oracle, and then all possible interleaving of transactions and statements within the transaction is executed against the system under test. We also use post hoc validation of properties such as isolation (following approaches such as Elle¹³).

In early 2021, AWS launched Fault Injection Service (FIS)² with the goal of making testing based on fault injection accessible to a wide range of AWS customers. FIS allows customers to inject simulated faults, from API errors to I/O pauses and failed instances, into test or production deployments of their infrastructure on AWS. Injecting faults allows customers to validate that the resiliency mechanisms they have built into their architectures (such as failovers and health checks) actually improve availability and do not introduce correctness problems. Fault-injection testing based on FIS is widely used by AWS customers and internally within Amazon. For example, Amazon.com ran 733 FIS-based fault-injection experiments in preparation for Prime Day 2024.

In 2014, Yuan et al. found that 92% of catastrophic failures in tested distributed systems were triggered by incorrect handling of nonfatal errors. Many distributed-systems practitioners who were told about this research were surprised the percentage was not higher. Happy-case catastrophic failures are rare simply because the happy case of systems is executed often, tested better (both implicitly and explicitly), and is significantly simpler than the error cases. Fault-injection testing and FIS make it much easier for practitioners to test the behavior of their systems under faults and failures, closing the gap between happy-case and error-case bug density.

While fault injection is not considered a formal method, it can be combined with formal specifications. Defining the expected behavior using a formal specification, and then comparing results during and after fault injection to the specified behavior, allows for catching a lot more bugs than simply checking for errors in metrics and logs (or having a human look and say, “Yup, that looks about right”).

Over the past decade, there has been an emerging interest in a particular class of systems failure: those where some triggering event (like an overload or a cache emptying) causes a distributed system to enter a state where it does not recover without intervention (such as reducing load below normal). This class of failures, dubbed metastable failures,⁵ is one of the most important contributors to unavailability in cloud systems. The figure, adapted from Bronson et al.,⁵ illustrates a common type of metastable behavior: Load increases on the system are initially met with increasing goodput, followed by saturation, followed by congestion and goodput dropping to zero (or near zero). From there, the system cannot return to healthy state by slightly reducing load. Instead, it must follow the dotted line and may not recover until load is significantly reduced. This type of behavior is present even in simple systems. For example, it can be triggered in most systems with timeout-and-retry client logic.

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Read the original article

Comments

By amazingamazing 2025-05-3014:246 reply

> Deterministic simulation. Another lightweight method widely used at AWS is deterministic simulation testing, in which a distributed system is executed on a single-threaded simulator with control over all sources of randomness, such as thread scheduling, timing, and message delivery order. Tests are then written for particular failure or success scenarios, such as the failure of a participant at a particular stage in a distributed protocol. The nondeterminism in the system is controlled by the test framework, allowing developers to specify orderings they believe are interesting (such as ones that have caused bugs in the past). The scheduler in the testing framework can also be extended for fuzzing of orderings or exploring all possible orderings to be tested.

Any good open source libraries that do this that are language agnostic? Seems doable - spin up a container with some tools within it. Said tools require some middleware to know when a test is going to be run, when test is run, tools basically make certain things, networking, storage, etc "determinstic" in the context of the test run.

This is more-or-less what antithesis does, but haven't seen anything open source yet.

You of course, could write your tests well, such that you can stub out I/O, but that's work and not everyone will write their tests well anyway (you should do this anyway, but it's nicer imo if this determinism is on a layer higher than the application).

as a slight sidebar - I'm not really bullish on AI, but I think testing is one of the things where AI will hopefully shine, because the feedback loop during prompting can be driven by your actual application requirements, such that the test implementation (driven by AI), requirements (driven by you as the prompt) and "world" (driven by the actual code being tested) can hopefully help drive all three to some theoretical ideal. if AI gives us anything, I'm hoping it can make software a more rigorous discipline by making formal verification more doable.

By wwilson 2025-05-3017:02

There have historically been two giant adoption challenges for DST.

(1) Previously, you had to build your entire system around one of the simulation frameworks (and then not take any dependencies).

(2) It’s way too easy to fool yourself with weak search/input generation, which makes all your tests look green when actually you aren’t testing anything nontrivial.

As you say, Antithesis is trying to solve both of these problems, but they are very challenging.

I don’t know of anybody else who has a reliable way of retrofitting determinism onto arbitrary software. Facebook’s Hermit project tried to do this with a deterministic Linux userspace, but is abandoned. (We actually tried the same thing before we wrote our hypervisor, but found it didn’t work well).

A deterministic computer is a generically useful technology primitive beyond just testing. I’m sure somebody else will create one someday, or we will open-source ours.

By bit_razor 2025-05-3017:451 reply

https://apple.github.io/foundationdb/testing.html https://www.youtube.com/watch?v=4fFDFbi3toc

By amazingamazing 2025-05-3017:462 reply

Trust me, I love FDB, but that's not the same thing. The FDB team IIRC had to write their own programming language to do this. It's not a agnostic layer above the application.

The problem with coupled tooling is that no one will use it. That's what is cool about antithesis. If they're able to complete their goal, that's basically what will be achieved.

By Lwerewolf 2025-05-3020:071 reply

I guess you meant to say "only the people working on the software coupled to the tooling will use it". It's not just FDB & Amazon that are using something like this, and it is a ridiculously powerful type of tool for debugging distributed systems.

By amazingamazing 2025-05-3020:11

ah, yes.

By bit_razor 2025-05-3018:35

Fiar point. I was thinking about antithesis, but it's not open source (yet?). Turns out I also didn't read your comment well enough. Back to lurking I go.

By nine_k 2025-05-310:081 reply

I suspect you can relatively easily obtain a completely deterministic machine by running QEMU in 100% emulation mode in one thread. But what you are after is controlled deterministic execution, and it's far harder. That is, making your multiple processes to follow a specific dance that triggers an interesting condition must be very involved, when seen from the level as low as CPU and OS scheduler. Hence a language-agnostic setup is hard to achieve and especially hard to make it do your bidding. It may drown you in irrelevant details.

I once built a much, much simpler thing that allowed to run multiple JVM threads in a particular kind of lockstep, by stubbing and controlling I/O operations and the advance of the system time. With that, I could run several asynchronously connected components in particular interaction patterns, including not just different thread activation order but also I/O failures, retries, etc. It was manageable, and it helped uncover a couple of nasty bugs before the code ever ran in prod.

But that was only possible because I went with drastic simplifications, controlling not the whole system but only particular synchronization points. It won't detect a generic data race where explicit synchronization would be just forgotten.

By ajb 2025-05-315:56

I'm sure I heard that something like this existed for the JVM ages ago (like 15 years). I don't remember the details so it might not be quite the same, but a colleague was telling me about some tech which would test your concurrency by automatically selecting bad scheduling orders.

By crvdgc 2025-05-3018:051 reply

https://rr-project.org/ for languages that can be debugged by gdb.

By smj-edison 2025-05-317:01

+1 for rr. Bonus feature is you can also time-travel debug! It's spoiled me forever...

By john2x 2025-05-3019:46

There was a talk from Joe Armstrong about using property testing to test Dropbox.

By simonw 2025-05-3016:023 reply

S3 remains one of the most amazing pieces of software I've ever seen. That thing a few years ago where they just added strong read-after-write consistency to the whole system? Incredible software engineering. https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...

By riknos314 2025-05-3023:321 reply

I had the distinct pleasure of working on S3 (Lifecycle) during the timeframe that the index team was working on the rearchitecture that enabled the read-after-write consistency.

I can confidently say that as impressive as S3 is from the outside, it's at least that impressive internally, both in implementation, and organizational structure.

By positisop 2025-05-3017:256 reply

Google Cloud Storage had it for eons before S3. GCS comes across as a much better thought-out and built product.

By simonw 2025-05-3018:441 reply

S3 is probably the largest object store in the world. The fact that they can upgrade a system like that to add a feature as complex as read-after-write with no downtime and working across 200+ exabytes of data is really impressive to me.

By tossandthrow 2025-05-3019:071 reply

I really do respect the engineering efforts.

But object stores are embarrassingly parallel, so if such a migration should be possible somewhere without down time, then it is definitely object stores.

By gamegoblin 2025-05-3020:311 reply

Where would you make make the cut that takes advantage of object store parallelism?

That is, at what layer of the stack do you start migrating some stuff to the new strongly consistent system on the live service?

You can't really do it on a per-bucket basis, since existing buckets already have data in the old system.

You can't do it at the key-prefix level for the same reason.

Can't do both systems in parallel and try the new one and fall back to the old one if the key isn't in it, because opens up violations of the consistency rules you're trying to add.

Seems trickier than one might think.

By nojvek 2025-05-310:48

Obviously depends on how they delivered read after write.

Likely they don't have to physically move data of objects, but the layer that writes and reads coordinates based on some version control guarantees e.g in database land MVCC is a prominent paradigm. They'd need a distributed transactional kv store that tells every reader what the latest version of the object is and where to read from.

An object write only acknowledges finished if the data is written and kv store is updated with new version.

They could do this bucket by bucket in parallel since buckets are isolated from each other.

By SteveNuts 2025-05-3017:531 reply

Sure, but whose (compatible) API is GCS using again? Also keep in mind that S3 is creeping up on 20 years old, so backing a change in like that is incredible.

By benoau 2025-05-3017:551 reply

Not just 20 years old - an almost flawless 20 years at massive scale.

By SteveNuts 2025-05-3018:161 reply

It's funny that things that are pinnacles of human engineering exist like this where the general public has no idea it even exists, though they (most likely) use it every single day.

By ninetyninenine 2025-05-3018:594 reply

I find red dead redemption 2 more impressive. I don’t know why. It sounds stupid but S3 on the surface has the simplest api and it’s just not impressive to me when compared to something like that.

I’m curious which one is actually more impressive in general.

By SteveNuts 2025-05-3020:25

Simple to use from an external interface yes, the backend is wildly impressive.

Some previous discussion https://news.ycombinator.com/item?id=36900147

By UltraSane 2025-05-312:071 reply

AWS has said that the largest S3 buckets are spread over 1 million hard drives. That is quite impressive.

By ninetyninenine 2025-05-313:172 reply

Red dead redemption 2 is likely on over 74 million hard drives.

By simonw 2025-05-3113:231 reply

I think you misunderstood. They're not saying S3 uses a million hard drives, they're saying that there exist some large single buckets that use a million hard drives just for that one bucket/customer!

By UltraSane 2025-06-0217:49

actually data from more than one customer would be stored on those million drives. But data from one customer is spread over 1 million drives to get the needed IOPs from spinning hard drives.

By SteveNuts 2025-05-313:56

There's likely over a trillion active SQLite databases in use right now.

By koito17 2025-05-311:28

> S3 on the surface has the simplest api and it’s just not impressive [...]

Reminded of the following comment from not too long ago.

https://news.ycombinator.com/item?id=43363055

By sriram_malhar 2025-05-316:321 reply

That's the strangest comparison I have seen. What axis are you really comparing here? Better graphics? Sound?

By ninetyninenine 2025-05-317:341 reply

Complexity and sheer intelligence and capability required to build either.

By sriram_malhar 2025-05-319:511 reply

And what is the basis for your claim? You are not impressed by AWS's complexity and intelligence and capability to build and manage 1-2 zettabytes of storage near flawlessly?

By ninetyninenine 2025-05-3113:48

Im more impressed by red dead redemption 2 or baldurs gate 3.

There is no “basis” other my gut feeling. Unless you can get quantified metrics to compare that’s all we got. For example if you had lines of code for both, or average IQ. Both would lead towards the “basis” which neither you or I have.

By rossjudson 2025-05-312:45

GCS's metadata layer was originally implemented with Megastore (the precursor to Spanner). That was seamlessly migrated to Spanner (in roughly small-to-large "region" order), as Spanner's scaling ability improved over the years. GCS was responsible for finding (and helping to knock out) quite a few scaling plateaus in Spanner.

By 0xbadcafebee 2025-06-0215:07

> GCS comes across as a much better thought-out and built product

I've worked with AWS and GCS for a while, and I have the opposite opinion. GCS is what you get if you let engineers dictate to customers how they are allowed to do work, and then give them shit interfaces, poor documentation, and more complexity than adds value.

There's "I engineered the ultimate engineery thing", and then there's "I made something people actually like using".

By theflyinghorse 2025-05-3019:56

Maybe. But Google do have a reputation which makes selecting them for infrastructure a risky endeavor

By throitallaway 2025-05-3017:38

From my POV Amazon designs its services from a "trust nothing, prepare for the worst case" perspective. Eventual consistency included. Sometimes that's useful and most of the time it's a PITA.

By up2isomorphism 2025-05-316:051 reply

S3 is not a piece of software per se, it is a service.

Also S3 is not better than gcs or azure blob.

By simonw 2025-05-3113:22

Services are built on software.

S3 is likely an order of magnitude larger then those others - it's had a lot longer to grow.

By EGreg 2025-05-3013:183 reply

Wow. I used to correspond with Leslie Lamport years ago (about his Buridan's Principle papers, etc.)

Today I went to his website and discovered a lot about TLA+ and PlusCal. He still maintains it: https://lamport.azurewebsites.net/tla/peterson.html?back-lin...

I must say ... it would make total sense for a guy like that, who brought mathematics to programming and was an OG of concurrent systems, to create a systems design language that's used at AWS and other industrial places that need to serve people.

I wish more people who build distributed systems would use what he made. Proving correctness is crucial in large systems.

By lopatin 2025-05-3013:242 reply

And just a tip for who may be intersted: Claude Opus with Extended Thinking seems to be very good at converting existing code into TLA+ specs.

I've found multiple bugs for personal Rust projects like this (A Snake game that allowed a snake to do a 180 degree turn), and have verified some small core C++ components at work with it as well (a queue that has certain properties around locking and liveness).

I tried other models but kept getting issues with syntax and spec logic with anything else besides Opus.

By dosnem 2025-05-3013:433 reply

I’ve always envisioned tla and other formal methods as specific to distributed systems and never needed to understand it. How is it used for a snake game? Also how is the TLA+ spec determined from the code? Won’t it implicitly model incorrect bugs as correct behaviour since it’s an existing state in the system? Also when using TLA from the start, can it be applied to implementations? Or is it only for catching bugs during design? Therefore I’m assuming implementations still need to match the design exactly or else you would still get subtle bugs? Sorry for all the questions I’ve never actually learned formal methods but have always been interested.

By lopatin 2025-05-3013:54

Here's how it caught my Snake bug: My snake representation is a vector of key points (head, turns, tail). A snake in a straight line, of length 3, facing right can look like this: [(0,0), (2,0)]. When a Snake moves (a single function called "step_forward"), the Snake representation is compressed by my code: If the last 2 points are the same, remove the last one. So if this snake changes direction to "left", then the new snake representation would be [(1, 1), (1, 1)] and compressed to [(1, 1)] before existing out of step_forward.

Here's how the bug was caught: It should be impossible for the Snake representation to be < 2 points. So I told Opus to model the behavior of my snake, and also to write a TLA+ invariant that the snake length should never be under 2. TLA+ then basically simulates it and finds the exact sequence of steps "turns" that cause that invariant to not hold. In this case it was quite trivial, I never thought to prevent a Snake from making turns that are not 90 degrees.

By Jtsummers 2025-05-3015:08

It's targeted at distributed systems, but it can be used to model any system over time. I've used it for distributed systems, but also for embedded systems with a peculiar piece of hardware that (seemed, and we found was) to be misbehaving. I modeled the hardware and its spec in TLA+, then made changes to the behavior description to see if it broke any expected invariants (it did, in precisely the way we saw with the real hardware). The TLA+ model also helped me develop better reproducible test cases for that hardware compared to what we were doing before.

By skydhash 2025-05-3016:32

I'm not an expert but my current understanding is that code execution is always a state transition to the next state. So what you do is fully specify each state and the relation between them. How the transition actually happens is your code and it's not that important. What's important is that the relations does not conflict to each other. It's a supercharged type system.

> Also how is TLA+ spec determined from the code?

You start from the initial state, which is always known (or is fixed). Then you model the invariants for each lines.

> Won’t it implicitly model incorrect bugs as correct behaviour since it’s an existing state in the system

Invariants will conflicts with each other in this case.

> Also when using TLA from the start, can it be applied to implementations?

Yes, by fully following the specs and handling possible incorrect states that may happens in practice. If your initial state in the TLA+ specs says that it only includes natural numbers between 1 and 5, you add assertions in your implementation (or throw exceptions) that check that as the Int type in many type systems is not a full guarantee for that constraint. Even more work when using a dynamic language.

By k__ 2025-05-3018:26

It seems like the new DeepSeek performs at a similar level as Opus 4. At least to preliminary Aider benchmarks.

By skydhash 2025-05-3013:271 reply

> Proving correctness is crucial in large systems.

It could be good in smaller, but critical and widely used utilities like SSH and terminals.

By oblio 2025-05-3013:541 reply

Yeah, basically all the coreutils plus all the common extras (rsync, ssh, etc) could use stuff like this.

By rthnbgrredf 2025-05-3019:101 reply

It should be feasible to rewrite the coreitils like ls, cd and cp in Lean 4 together with Cursor within days. Rsync and ssh are more complex though.

By oblio 2025-05-318:14

Your first claim is actually a very solid test for AI. We should start seeing a lot more AI powered OSS projects or at least contributions if AI truly is as good as they say. Heck, OSS should accelerate exponentially since contributions should become very easy.

By belter 2025-05-3013:441 reply

>> Proving correctness is crucial in large systems.

You can't do that...

The model checker says the specification satisfies the properties you wrote within the finite state space you explored...

By amw-zero 2025-05-310:53

You can write proofs in TLA+ and many other formalisms. You don’t need to ever use a model checker. The proofs hold for an infinite number of infinite-length executions. We are definitely not limited to finite behaviors.