Maintaining large-scale AI capacity at Meta

2024-06-1622:1810659engineering.fb.com

Meta is currently operating many data centers with GPU training clusters across the world. Our data centers are the backbone of our operations, meticulously designed to support the scaling demands …

Meta is currently operating many data centers with GPU training clusters across the world. Our data centers are the backbone of our operations, meticulously designed to support the scaling demands of compute and storage. A year ago, however, as the industry reached a critical inflection point due to the rise of artificial intelligence (AI), we recognized that to lead in the generative AI space we’d need to transform our fleet. 

Our increased focus on AI was driven both by its rise in driving business outcomes and the huge growth in these types of workloads’ computational needs. In addition to wider use of traditional AI for things like ad targeting, we have also seen increasing numbers of large generative AI models that mimic almost-human intelligence in everything from human verbal interaction to the creation of pictures and other media. And these types of models are huge, with trillions of training parameters, and to train them we need vast resources. 

In this process, we’ve built one of the world’s largest AI training infrastructures, and it has been growing exponentially over the last years. Meta’s training infrastructure comprises dozens of AI clusters of varying sizes, with a plan to scale to 600,000 GPUs in the next year. It runs thousands of training jobs every day from hundreds of different Meta teams. Training jobs characteristics vary greatly too. They can be as small as a single GPU running for a couple minutes, while generative AI jobs can have trillions of parameters and often span thousands of hosts that need to work together and are very sensitive to interruptions. In addition to that, training jobs are tied much closer to the hardware, and that hardware varies greatly. Meta runs different types of backend networks, topologies, and training jobs that have tight dependencies between software and hardware components. 

This transition has not been without its challenges. We had to reconfigure the fleet without disrupting our hypergrowth, a task akin to rebuilding an airplane mid-flight. This pushed us to innovate and collaborate with vendors and utility companies to create a supportive ecosystem. In this blog we will discuss only one of these transformations. We will describe how Meta is maintaining these training clusters and what sets us apart from the average AI environment. And what do we mean by maintaining? Basically, any kind of operation that updates or verifies software and firmware components in the clusters, including the networking path. 

The main characteristics of GPU training

GPU training has some demanding characteristics:

  • Capacity guarantees: While some training jobs can be paused, a lot of Meta jobs are time-critical and recurring or online. This means we cannot take large amounts of capacity on a default basis.
  • Bad hosts are very bad: Since many jobs require all hosts to be synchronized, bad hosts that are a bit slower, have some non-fatal hardware, or have networking issues are extremely damaging.
  • Low interruption rate: Since many hosts work with each other on a shared problem, AI training jobs are sensitive to interruptions. 
  • Rollout safety: The AI software stack is deep, and problems are often hard to pinpoint, so we need to be careful when rolling out new components.
  • Host consistency: AI training jobs are in general cross-host, and while outside of the CUDA version there are rarely hard incompatibilities, we have learned that cluster consistency is highly important for debugging and SEV avoidance. 

What’s special about Meta’s GPU training?

Meta uses bespoke training hardware with the newest chips possible and high-performance backend networks that are highly speed optimized. We also try to stay as current and flexible as possible with the software stack; in the event of firmware upgrades, this allows us to utilize new features or reduce failure rates. 

Together this means we have more than:

  • 30 maintenance operations
  • 50 different components that are updated 
  • Three different host-verification tasks to ensure optimal performance and stability
  • Thousands of disruptive AI host tasks every day 

And we need to do them safely, while guaranteeing capacity. After all, our training clusters are also used flexibly to run a wide variety of workloads, from single-host to some of the biggest training jobs in the world, and from offline tasks to jobs that need to be up and running 24/7.

An overview of different maintenance rollouts happening on Meta capacity over time with overlapping durations.

Given the variety of upgrades, we have a large amount of overlapping inflight changes at any given time, including some that are consistently being applied, such as verification tasks. Accepting this gives Meta the flexibility we need in using cutting-edge hardware, scaling our infrastructure, and using both in flexible ways. In smaller environments it is often possible to keep clusters in a consistent state and upgrade the whole cluster and all of its firmware and software components in the same maintenance window. Doing this in a large, diverse environment like Meta, however, would introduce big risks and be operationally infeasible. Instead, we ensure components are compatible with each other and roll component upgrades up in a sliding fashion. This approach also allows us to guarantee capacity availability. 

Maintenance trains

Meta maintains capacity by using maintenance trains, which involves shutting down small amounts of capacity in a cyclic fashion.

Outside of special cases, Meta maintains its fleet of clusters using a technique called maintenance trains. This is used for all capacity, including compute and storage capacity. A small number of servers are taken out of production and maintained with all applicable upgrades. Trains provide the guarantee that all capacity minus one maintenance domain is up and running 24/7, thus providing capacity predictability. This is mandatory for all capacity that is used for online and recurring training.

Maintenance trains pick up any new upgrade and guarantee a full-visit cycle in a guaranteed timeframe. Longer-running upgrades can have lower rollout guarantees and may be scheduled to be applied in multiple cycles. So you can have many overlapping upgrades, and, if beneficial, upgrades can be aligned. 

For AI capacity, we have optimized domains that allow for different kinds of AI capacity, very strict SLOs, and a contract with services that allows them to avoid maintenance-train interruptions, if possible. 

Gradual rollouts

An illustration of the distinction between higher-level components of the AI stack—such as the CUDA drivers, and the lower-level components involved in the training job,.

Because of the scale of our infrastructure, we had to ensure that all disruptive rollouts outside of special cases happen in a gradual fashion. This means different servers in a cluster can run a different host stack for a short period of time. This is quite normal in traditional capacity but challenging in AI training, since AI jobs are very closely tied to the hardware. 

At Meta, we’ve ensured that jobs have a consistent stack but upgrade lower-level components in a gradual fashion. In contrast to this, the AI job itself, which includes the CUDA library, is always consistent. This distinction is necessary because lower-level components often require hours to install and configure or require rebooting the host, while higher-level components in the job container itself can be restarted fluidly.

This sounds simple, but because of the tight integration of AI with hardware, we have needed to do a lot of development, including careful testing on all lower levels, special monitoring, and tight work with vendors.

By and large, this has been very successful. The AI stack in general has matured a lot over the past three years. We also added tooling for rare compatibility-breaking upgrades. 

Selecting the correct maintenance domains 

Maintenance domains are selected based on the amount of buffer-reserved capacity (the smaller the better) and the amount of interruptions we cause to training jobs (the bigger the better).

One way to ensure optimal AI performance was to work with AI teams to design the optimal size of maintenance domains. A maintenance domain is the percentage of capacity we take down in one go, and selecting the optimal size is a function of both the cost of interruptions and the capacity that is lost during the maintenance duration. Since interruption costs are high for AI jobs, optimizing this relationship allowed us to significantly reduce the maintenance overhead for AI capacity.

OpsPlanner: Meta disruptive-work orchestrator

Critical to AI capacity are the consistency requirements. For example, if you want to move to a new CUDA version, you may need all of the capacity on a new driver version. This becomes really difficult in an environment with thousands of hosts and lots of planned and unplanned operations that may overlap with each other. To do this safely and guarantee hosts have the correct upgrades applied before entering production, Meta has unified them in the OpsPlanner work orchestrator. Not only can it work on overlapping scopes of operations and correctly serialize them, it also takes them safely out and into production. In addition, it has a built-in handover flow that ensures correct escalation behavior and avoids overlaps and deadlocks. OpsPlanner can also ensure upgrades are applied to hosts before they are returned to production. And OpsPlanner owns planned maintenance and failure buffers and safeguards them. Furthermore, it’s highly effective and efficient: OpsPlanner currently handles a million operations per day. 

Example scenarios illustrating the disruptive work scheduler the OpsPlanner needs to handle to ensure host consistency.

Safety and failure scenarios 

Meta has a deep stack of safety features that includes:

  • Autostop of maintenance trains if maintenance or failure buffers are exhausted;
  • Automatic offboarding of failing upgrades; and
  • Rollout phases for upgrades, so that only well-tested changes reach global systems.

If something does go wrong, however, we can react quickly, depending on the needed fix, with emergency trains, large-scale maintenance for breaking upgrades, and more.

Rapidly moving to the future of generative AI

At Meta, we believe in moving fast and learning by doing. Rapid innovation is in our ethos. This is what fundamentally shaped our journey as we continually innovated towards building the foundational infrastructure that makes us leaders in generative AI. We will remain dedicated to creating technologies that not only benefit Meta but also have a positive impact on society as a whole. 

As we move forward, we invite you to join us on this journey. Together, we can shape a future where AI is not just a tool but a force for good, transforming industries, empowering individuals, and creating a more sustainable world.

The best is yet to come, and we are excited to pioneer tomorrow’s possibilities in generative AI.


Read the original article

Comments

  • By nomilk 2024-06-170:366 reply

    Meta is going hard into AI (both hardware and software), which is great to see. Something that's not super obvious is what specific features of existing apps require AI, that is, how will Meta get return on investment?

    Two uses I can think of are i) text and image content moderation on fb and instagram (won't need as many human reviewers if bots are as/more effective), and ii) chatbots for businesses (businesses could provide their business documentation to a meta LLM which could handle customer inquiries via messenger and whatsapp).

    Anything else?

    • By aprilthird2021 2024-06-170:47

      Meta actually has a whole separate, existing AI research and use case for targeting ads that has seen much better results as their AI capabilities have improved. I don't think gen AI is used for this in the way most commenters think, but the improvements in AI architecture / infra, training, etc. are all helpful to the AI which helps ad targeting while simultaneously building more powerful Gen AI

    • By candiddevmike 2024-06-170:401 reply

      Fake profiles to boost engagement/DAU? Grandma is lonely on FB now that no one is on there anymore.

      • By aprilthird2021 2024-06-170:53

        This isn't it. Actually FB, the app specifically, is having an unexpected uptick in users in the younger demographics, partly from their strategy of using FB marketplace and other ancillary services to bring people back to the main app.

        EDIT: Also, it's very easy to creat fake profiles on FB, and other people do it all the time. Meta don't need to do it themselves

    • By dudus 2024-06-172:18

      Meta is building and scaling the infrastructure for AI and then they can resell that at a premium later. All these articles do is highlight the challenges of rolling out your own infra, if Meta solves these issues it becomes a reference and get a bigger piece of the AI pie.

      They want to become the AI backend for the Fortune 500

    • By ldjkfkdsjnv 2024-06-170:371 reply

      Their whole advertising business model gets better with LLM understanding of text. They can target ads better.

      • By Mehdi2277 2024-06-170:531 reply

        This is fair guess on intuition but working in recommender space on both content/ad recommendation, content understanding signals have pretty consistently across two companies and many projects tended to underwhelm and key signals are generally engagement signals (including event sequences) and many embeddings (user embedding, creator embedding, ad embedding, etc).

        The main place I’ve seen content understanding help is coldstart specially for new items by new creators.

        • By xwolfi 2024-06-173:26

          And you wonder sometimes if the products being advertised, themselves, couldn't matter more than the targets reading them. We've learned to recognize crappy offering and AI can try to make me read more and more ads relevant to what I'm saying, if it's a crap product, I won't pay anyway :(

    • By xyzzy4747 2024-06-171:19

      They could make customer facing support bots for every business with a Facebook page

    • By mistermann 2024-06-170:49

      Generative environments on demand for their VR goggles seems like something that could drive net new revenue some day, both hardware and subscriptions. If they can find a PR safe way to provide adult content that would be even better.

      Or, virtual conversational humans for their boomer userbase could be a hit.

  • By joaquincabezas 2024-06-1623:141 reply

    Even if this scale is massive and second-to-none, it’s funny how some issues are the same for all of us. In particular “Bad hosts are very bad” aka “a chain is only as strong as its weakest link” can happen with as little as a few (4) machines, and then ruin your day.

    • By loeg 2024-06-171:122 reply

      This is really unique to the AI training clusters for reasons I'm not super clear on. Most other types of horizontally scaled workloads can sort of tolerate a slightly underperforming host, or hosts going bad every so often, with little P99/P99.9 impact. For some reason, AI training workloads really cannot.

      • By chaos_emergent 2024-06-171:441 reply

        When training a large model you're doing a single forward pass + backprop over multiple Infiniband-connected nodes for a single model instance, so if one node goes down it takes a logical unit of nodes down with it. For reference, GPT-4 was rumored to be around 1.7T, and doing some back-of-the-hand math[1], that's like 500-700 H100 GPUs per model instance, which means you need a multiple of that for any training parallelism whatsoever.

        [1] back-of-the-hand-math: 1.7T * 4 bytes = 6.8 TB; 3-4x that for activation + gradients = 27.2 TB; 27.2TB / (80GB / H100) = 349 H100s; 1.5-2x conservative multiplier accounting for not fully using node resources + memory overhead in the machine = ~500-700 H100s.

        truly insane numbers.

        • By furyofantares 2024-06-172:121 reply

          That trillion+ parameter count is the sum of each of the "experts", right?

          • By oersted 2024-06-177:23

            The ever-circulating rumour is 1.7T - 1.8T for the whole thing. But it is not very substantiated, mostly started by SemiAnalysis and geohot based on rather loose speculation (such as API latency and price), and not much solid evidence to confirm it after that.

            And of course, it must have changed substantially with GPT-4-Turbo and GPT-4o. It would make sense if the cost reduction was larger than the price reduction, they probably have a higher profit margin now, and the price reduction has been very significant since GPT-4 release.

      • By eugenhotaj 2024-06-171:291 reply

        This is because everyone is training with synchronous sgd. all gpus need to synchronize on each gradient step so tail latency will kill you.

        • By Mehdi2277 2024-06-172:38

          I’ve worked at companies with async training. Async training does help on fault tolerance and also can assist with training thoroughput by being less reliant on slowest machine. It does add meaningful training noise and when we did experiments against sync training we got much more stable results with sync training and some of our less stable models would even sometimes have loss explosions/divergence issues with async training but be fine with sync training.

          Although even for async training generally I see dataset just sharded and if worker goes down then shard of data may be loss/skipped not some kind of smarter dynamic file assignment factoring when workers go down. Even basic things like job fails continue from last checkpoint with same dataset state for large epoch is messy when major libraries like tensorflow lack a good dataset checkpointing mechanism.

  • By realreality 2024-06-1623:522 reply

    I’m old enough to remember when companies were eager to claim that their data centers (or some aspect) were finally “carbon neutral”.

    Now, with the enormous data center growth for AI purposes, companies don’t even bother pretending that any of this is sustainable.

    At best, they might delude themselves into believing that a glorified text autocomplete program will magically solve the world’s problems, including the unsustainability of the machines running the program.

    • By 123yawaworht456 2024-06-170:053 reply

      we could exist without any of the modern conveniences. let's tear down the electric grid and return to monke.

      • By AYBABTME 2024-06-170:201 reply

        We're way past that. Global warming requires (or will, soon enough) heat pumps for survival in many regions of the world. Plenty of regions require large amount of electricity for life critical functions. Degrowth isn't the answer.

        • By elcomet 2024-06-170:563 reply

          Or alternatively, people will need to move to colder regions.

          • By FuckButtons 2024-06-175:24

            Yes, they will, which will drive conflict, xenophobia and economic destabilization in the countries those people move to, which will exacerbate global political tensions and probably wind up with us all getting wiped out in a nuclear configuration sometime before the century is over, so we might as well have really nice autocomplete before we get there.

          • By KoolKat23 2024-06-178:57

            Driving up concrete production, a significant carbon contributor as it stands (low carbon concrete is still in its infancy).

          • By killingtime74 2024-06-171:07

            Or there are just less people (as we see in developed countries birthrates)

      • By dorkwood 2024-06-175:40

        Could you really survive without trucks delivering food to stores on your city? I couldn't.

      • By realreality 2024-06-170:20

        That’s right (even though you’re probably being facetious).

    • By bloodyplonker22 2024-06-173:351 reply

      It's just that investors are now completely over the whole "DEI and ESG" alphabet investing type of phase after seeing that it has not helped companies produce returns at all.

      • By Spivak 2024-06-1715:12

        ESGs are still going strong, the whole point is potentially accepting lower returns in exchange for voting with your wallet on what companies you support. Investors, the people who collect money for a living, have never been the target for ESGs.

HackerNews