In February, we experienced six incidents that resulted in degraded performance across GitHub services.

We understand the impact outages have on our customers and are sharing details on the stabilization work we’re prioritizing right now.
Over the past several weeks, GitHub has experienced significant availability and performance issues affecting multiple services. Three of the most significant incidents happened on February 2, February 9, and March 5.
First and foremost, we take responsibility. We have not met our own availability standards, and we know that reliability is foundational to the work you do every day. We understand the impact these outages have had on your teams, your workflows, and your confidence in our platform.
Here, we’ll unpack what’s been causing these incidents and what we’re doing to make our systems more resilient moving forward.
These incidents have occurred during a period of extremely rapid usage growth across our platform, exposing scaling limitations in parts of our current architecture. Specifically, we’ve found that recent platform instability was primarily driven by rapid load growth, architectural coupling that allowed localized issues to cascade across critical services, and inability of the system to adequately shed load from misbehaving clients.
Before we cover what we are doing to prevent these issues going forward, it is worth diving into the details of the most impactful incidents.
On Monday, February 9, we experienced a high‑impact incident due to a core database cluster that supports authentication and user management becoming overloaded. The mistakes that led to the problem were made days and weeks earlier.
In early February, two very popular client-side applications that make a significant amount of API calls against our servers were released, with unintentional changes driving a more-than-tenfold increase in read traffic they generated. Because these applications end up being updated by the users over time, the increase in usage doesn’t become evident right away; it appears as enough users upgrade.
On Saturday, February 7, we deployed a new model. While trying to get it to customers as quickly as possible, we changed a refresh TTL on a cache storing user settings from 12 to 2 hours. The model was released to a narrower set of customers due to limited capacity, which made the change necessary. At this point, everything was operating normally because the weekend load is significantly lower, and we didn’t have sufficiently granular alarms to detect the looming issue.
Three things then compounded on February 9: our regular peak load, many customers updating to the new version of the client apps as they were starting their week, and another new model release. At this point, the write volume due to the increased TTL and the read volume from the client apps combined to overwhelm the database cluster. While the TTL change was quickly identified as a culprit, it took much longer to understand why the read load kept increasing, which prolonged the incident. Further, due to the interaction between different services after the database cluster became overwhelmed, we needed to block the extra load further up the stack, and we didn’t have sufficiently granular switches to identify which traffic we needed to block at that level.
The investigation for the February 9 incident raised a lot of important questions about why the user settings were stored in this particular database cluster and in this particular way. The architecture was originally selected for simplicity at a time when there were very few models and very few governance controls and policies related to those models. But over time, something that was a few bytes per user grew into kilobytes. We didn’t catch how dangerous that was because the load was visible only during new model or policy rollouts and was masked by the TTL. Since this database cluster houses data for authentication and user management, any services that depend on these were impacted.
We also had two significant instances where our failover solution was either insufficient or didn’t function correctly:
For both of these incidents, the investigations brought up unexpected single points of failure that we needed to protect and needed to dry run failover procedures in the production more rigorously.
Across these incidents, contributing factors expanded the scope of impact to be much broader or longer than necessary, including:
Our engineering teams are fully engaged in both near-term mitigations and durable longer-term architecture and process investments. We are addressing two common themes: managing rapidly increasing load by focusing on resilience and isolation of critical paths and preventing localized failures from ever causing broad service degradation.
In the near term, we are prioritizing stabilization work to reduce the likelihood and impact of incidents. This includes:
In parallel, we are accelerating deeper platform investments to deliver on GitHub’s commitment to supporting sustained, high-rate growth with high availability. These include:
We are also continuing tactical repair work from every incident.
We recognize that it’s important to provide you with clear communication and transparency when something goes wrong. We publish summaries of all incidents that result in degraded performance of GitHub services on our status page and in our monthly availability reports. The February report will publish later today with a detailed explanation of incidents that occurred last month, and our March report will publish in April.
Given the scope of recent incidents, we felt it was important to address them with the community today. We know GitHub is critical digital infrastructure, and we are taking urgent action to ensure our platform is available when and where you need it. Thank you for your patience as we strengthen the stability and resilience of the GitHub platform.
In February, we experienced six incidents that resulted in degraded performance across GitHub services.
In January, we experienced two incidents that resulted in degraded performance across GitHub services.
Claude by Anthropic and OpenAI Codex are now available in public preview on GitHub and VS Code with a Copilot Pro+ or Copilot Enterprise subscription. Here’s what you need to know and how to get started today.
having issues right now. Perfect timing.