The Cloudflare outage might be a good thing

2025-11-243:04266188gist.github.com

The CloudFlare outage was a good thing. GitHub Gist: instantly share code, notes, and snippets.

Cloudflare, the CDN provider, suffered a massive outage today. Some of the world's most popular apps and web services were left inaccessible for serveral hours whilst the Cloudflare team scrambled to fix a whole swathe of the internet.

And that might be a good thing.

The proximate cause of the outage was pretty mundane: a bad config file triggered a latent bug in one of Cloudflare's services. The file was too large (details still hazy) and this led to a cascading failure across Cloudflare operations. Probably there is some useful post-morteming about canary releases and staged rollouts.

But the bigger problem, the ultimate cause, behind today's chaos is the creeping centralisation of the internet and a society that is sleepwalking into assuming the net is always on and always working.

It's not just "trivial" stuff like Twitter and League of Legends that were affected, either. A friend of mine remarked caustically about his experience this morning

I couldn't get air for my tyres at two garages because of cloudflare going down. Bloody love the lack of resilience that goes into the design when the machine says "cash only" and there's no cash slot. So flat tires for everyone! Brilliant.

We are living in a society where every part of our lives is increasingly mediated through the internet: work, banking, retail, education, entertainment, dating, family, government ID and credit checks. And the internet is increasingly tied up in fewer and fewer points of failure.

It's ironic because the internet was actually designed for decentralisation, a system that governments could use to coordinate their response in the event of nuclear war. But due to the economics of the internet, the challenges of things like bots and scrapers, more of more web services are holed up in citadels like AWS or behind content distribution networks like Cloudflare.

Outages like today's are a good thing because they're a warning. They can force redundancy and resilience into systems. They can make the pillars of our society - governments, businesses, banks - provide reliable alternatives when things go wrong.

(Ideally ones that are completely offline)

You can draw a parallel to how COVID-19 shook up global supply chains: the logic up until 2020 was that you wanted your system to be as lean and efficient as possible, even if it meant relying totally on international supplies or keeping as little spare inventory as possible. After 2020 businesses realised they needed to diversify and build slack in the system to tolerate shocks.

In the same way that growing one kind of banana, nearly resulted in bananas going extinct, we're drifing towards a society that can't survive without digital infrastructure; and a digital infrastructure that can't operate without two or three key players. One day there's going to be an outage, a bug, or cyberattack from a hostile state, that demonstrates how fragile that system is.

Embrace outages, and build redundancy.


Read the original article

Comments

  • By krick 2025-11-244:4311 reply

    It would be a good thing, if it would cause anything to change. It obviously won't. As if a single person reading this post wasn't aware that the Internet is centralized, and couldn't name specifically a few sources of centralization (Cloudflare, AWS, Gmail, Github). As if it's the first time this happens. As if after the last time AWS failed (or the one before that, or one before…) anybody stopped using AWS. As if anybody could viably stop using them.

    • By ectospheno 2025-11-2416:141 reply

      I’m pretty cloudflare centric. I didn’t start that way. I had services spread out for redundancy. It was a huge pain. Then bots got even more aggressive than usual. I asked why I kept doing this to myself and finally decided my time was worth recapturing.

      Did everything become inaccessible the last outage? Yep. Weighed against the time it saves me throughout the year I call it a wash. No plans to move.

      • By tracker1 2025-11-250:231 reply

        I'm of a similar mindset... yeah, it's inconvenient when "everything" goes down... but realistically so many things go down now and then, it just happens.

        Could just as easily be my home's internet connection, or a service I need from/at work, etc. It's always going to be something, it's just more noticeable when it affects so many other things.

        • By bobbob27 2025-11-2515:54

          To be honest, it's MUCH easier to have one source to blame when things go down. If a small-medium vendor's website goes down on a normal day, so poor IT guy is going to be fielding calls all day.

          If that same vendor goes down because Cloudflare went down, oh well. Most already know and won't bother to ask when your site will be back up

    • By captainkrtek 2025-11-244:532 reply

      > It would be a good thing, if it would cause anything to change. It obviously won't.

      I agree wholeheartedly. The only change is internal to these organizations (eg: CloudFlare, AWS) Improvements will be made to the relevant systems, and some teams internally will also audit for similar behavior, add tests, and fix some bugs.

      However, nothing external will change. The cycle of pretending like you are going to implement multi-region fades after a week. And each company goes on continuing to leverage all these services to the Nth degree, waiting for the next outage.

      Not advocating that organizations should/could do much, it's all pros/cons. But the collective blast radius is still impressive.

      • By chii 2025-11-245:265 reply

        the root cause is customers refusing to punish these downtime.

        Checkout how hard customers punish blackouts from the grid - both via wallet, but also via voting/gov't. It's why they are now more reliable.

        So unless the backbone infrastructure gets the same flak, nothing is going to change. After all, any change is expensive, and the cost of that change needs to be worth it.

        • By tjwebbnorfolk 2025-11-2418:402 reply

          > the root cause is customers refusing to punish these downtime.

          ok how do I punish cloudflare -- build my own globally-distributed content-delivery network just for myself so that I can be "decentralized"?

          Or should I go to one of their even-larger competitors like AWS or GCP?

          What exactly do you propose?

        • By MikeNotThePope 2025-11-245:483 reply

          Is a little downtime such a bad thing? Trying to avoid some bumps and bruises in your business has diminishing returns.

          • By Xelbair 2025-11-247:202 reply

            Even more so when most of the internet is also down.

            What are customers going to do? Go to competitor that's also down?

            It is extremely annoying, will ruin your day, but as movie quote goes - if everyone is special, no one is.

            • By throwaway0352 2025-11-2412:502 reply

              I think you’re viewing the issue from an office worker’s perspective. For us, downtime might just mean heading to the coffee machine and taking a break.

              But if a restaurant loses access to its POS system (which has happened), or you’re unable to purchase a train ticket, the consequences are very real. Outages like these have tangible impacts on everyday life. That’s why there’s definitely room for competitors who can offer reliable backup strategies to keep services running.

              • By mallets 2025-11-2413:461 reply

                Those are examples where they shouldn't be using public cloud in the first place. Should build those services to be local-first.

                Using a different, smaller cloud provider doesn't improve reliability (likely makes it worse) if the architecture itself wrong.

                • By esseph 2025-11-255:191 reply

                  It makes credit card transactions risky (offline)

                  • By mallets 2025-11-2513:06

                    Talking more about some unrelated function taking down the whole system, not advocating for "offline" credit card transactions (is this even a thing these days?). Ex: If the transaction needs to be logged somewhere, it can be built to sync whenever possible rather than blocking all transactions if the central service is down.

                    Payment processor being down is payment processor being down.

              • By wongarsu 2025-11-2414:12

                Do any of those competitors actually have meaningfully better uptime?

                From a societal level, having everything shut down at once is an issue. But if you only have one POS system targeting only one backend URL (and that backend has to be online for the POS to work) then cloudflare seems like one of the best choices

                If the uptime provided by cloudflare isn't enough then the solution isn't a cloudflare competitor, it's the ability to operate offline (which many POS have, including for card purchases) or at least multiple backends with different DNS, CDN, server location etc.

            • By immibis 2025-11-248:501 reply

              They could go to your competitor that's up. If you choose to be up, your competitor's customers could go to you.

              • By dewey 2025-11-249:021 reply

                If it’s that easy to get the exact same service / product as another vendor the maybe your competitive advantage isn’t so high. If Amazon would be down I’d just wait a few hours as I don’t want to sign up on another site.

                • By MikeNotThePope 2025-11-2410:23

                  I agree. These days it seems like everything is a micro-optimization to squeeze out a little extra revenue. Eventually most companies lose sight of the need to offer a compelling product that people would be willing to wait for.

          • By krige 2025-11-246:143 reply

            What's "a little downtime" to you might be work ruined and day wasted for someone else.

            • By bloppe 2025-11-249:07

              I remember a Google cloud outage years ago that happened to coincide with one of our customers' massively expensive TV ads. All the people who normally would've gone straight to their website instead got 502. Probably a 1M+ loss for them all things considered.

              We got an extremely angry email about it.

            • By fragmede 2025-11-247:24

              It's 2025. That downtime could be be difference between my cat pics not loading fast enough, or someone's teleoperated robot surgeon glitching out.

            • By cactusplant7374 2025-11-2414:42

              I have a lot of bad days every year. More than I can count. It's just part of living.

          • By aaron_m04 2025-11-246:08

            Depends on the business.

        • By whatevaa 2025-11-246:54

          Grid reliability depends on where you live. In some places, UPS or even a generator is a must have. So it's a bad example, I would say.

        • By LoganDark 2025-11-2412:211 reply

          > Checkout how hard customers punish blackouts from the grid - both via wallet, but also via voting/gov't.

          What? Since when has anyone ever been free to just up and stop paying for power from the grid? Are you going to pay $10,000 - $100,000 to have another power company install lines? Do you even have another power company in the area? State? Country? Do you even have permission for that to happen near your building? Any building?

          The same is true for internet service, although personally I'd gladly pay $10,000 - $100,000 to have literally anything else at my location, but there are no proper other wired providers and I'll die before I ever install any sort of cellular router. Also this is a rented apartment so I'm fucked even if there were competition, although I plan to buy a house in a year or two.

          • By heartbreak 2025-11-2413:25

            The hyperscalers definitely vote with their wallets.

        • By mopsi 2025-11-246:15

          Downtimes happen one way or another. The upside of using Cloudflare is that bringing things back online is their problem and not mine like when I self-host. :]

          Their infrastructure went down for a pretty good reason (let the one who has never caused that kind of error cast the first stone) and was brought back within a reasonable time.

      • By tracker1 2025-11-250:24

        And even in multi-region, you experience a DNS failure and it all goes up in flames anyway. There's always going to be something.

    • By GuB-42 2025-11-2414:581 reply

      Same idea with the Crowdstrike bug, it seems like it didn't have much of on effect on their customers, certainly not with my company at least, and the stock quickly recovered, in fact doing very well. For me, it looks like nothing changed, no lessons learned.

      • By beanjuiceII 2025-11-2416:141 reply

        what do you mean no lesson learned? seems like you haven't been paying attention..there's always a lesson learned

        • By peaseagee 2025-11-2416:592 reply

          I believe they mean that Crowdstrike learned that they could screw up on this level and keep their customers....

          • By thewebguyd 2025-11-2417:15

            That's true of a lot of "Enterprise" software. Microsoft enjoys success from abusing their enterprise customers what seems like daily at this point.

            For bigger firms, the reality is that it would probably cost more to switch EDR vendors than the outage itself cost them, and up to that point, CrowdStrike was the industry standard and enjoyed a really good track records and reputation.

            Depending on the business, there are long term contracts and early termination fees, there's the need to run your new solution along side the old during migration, there's probably years of telemetry and incident data that you need to keep on the old platform, so even if you switch, you're still paying for CrowdStrike for the retention period. It was one (major) issue over 10+ years.

            Just like with CloudFlare, the switching costs are higher than outage cost, unless there was a major outage of that scale multiple times per year.

          • By beanjuiceII 2025-11-2422:54

            that IS the lesson! there are a million questions i can ask myself about those incidents. What dictates they can't ever screw up? sure it was a big screw up, but understanding the tolerances for screw ups is important to understanding how fast and loose you can play it. AWS has at least a big outage a year, whats the breaking point? risk and reward etc.

            I've worked places where every little thing is yak shaved, and places where no one is even sure if the servers are up during working hours. Both jobs paid well.. both jobs had enough happy customers

    • By stingraycharles 2025-11-247:42

      It’s just a function of costs vs benefits. For most people, building redundancy at this layer costs far too much than the benefits.

      If Cloudflare or AWS go down, the outage is usually so big that smaller players have an excuse and people accept that.

      It’s as simple as that.

      “Why isn’t your site working?” “Half the internet is down, here read this news article: …” “Oh, okay, let me know when it’s back!”

    • By ehhthing 2025-11-245:331 reply

      With the rise in unfriendly bots on the internet as well as DDoS botnets reaching 15 Tbps, I don’t think many people have much of a choice.

      • By swiftcoder 2025-11-247:411 reply

        The cynic in me wonders how much blame the world's leading vendor of DDoS prevention might share in the creation of that particularly problem

        • By immibis 2025-11-248:511 reply

          They provide free services to DDoS-for-hire services and do not terminate the services when reported.

          • By zamadatix 2025-11-2412:55

            Not that I doubt examples exist (I've yet to be at a large place with 0 failures on responding to such issues over the years), but it'd be nice if you'd share the specific examples you have in mind if you're going to bother commenting about it. It helps people understand how much is a systemic problem to be interested in vs having a comment which more easily falls into many other buckets instead. I'd try to build trust off the user profile as well, but it proclaims you're shadowbanned for two different reasons - despite me seeing your comment.

            One related topic I've seen brought up is Workers abuse https://www.fortra.com/blog/cloudflare-pages-workers-domains..., but that goes against this claim they do nothing when reported.

    • By philipallstar 2025-11-2512:571 reply

      > As if anybody could viably stop using them.

      It is as easy to not use them as it ever was. There has been no actual centralisation. Everything is done using open protocols. I don't know what more you could want.

      Compare it to Windows where there is deep volume discounting and salespeople shmoozing CTOs and getting in with schools, healthcare providers etc etc. That's actual lock-in.

      • By kordlessagain 2025-11-2516:13

        When the tubes go down the tubes it will be the fault of those who are complacent.

    • By testdelacc1 2025-11-246:533 reply

      If anything, centralisation shields companies using a hyperscaler from criticism. You’ll see downtime no matter where you host. If you self host and go down for a few hours, customers blame you. If you host on AWS and “the internet goes down”, then customers treat it akin to an act of God, like a natural disaster that affects everyone.

      It’s not great being down for hours, but that will happen regardless. Most companies prefer the option that helps them avoid the ire of their customers.

      Where it’s a bigger problem is when a critical industry like retail banking in a country all choose AWS. When AWS goes down all citizens lose access to their money. They can’t pay for groceries or transport. They’re stranded and starving, life grinds to a halt. But even then, this is not the bank’s problem because they’re not doing worse than their competitors. It’s something for the banking regulator and government to worry about. I’m not saying the bank shouldn’t worry about it, I’m saying in practice they don’t worry about it unless the regulator makes them worry.

      I completely empathise with people frustrated with this status quo. It’s not great that we’ve normalised a few large outages a year. But for most companies, this is the rational thing to do. And barring a few critical industries like banking, it’s also rational for governments to not intervene.

      • By graemep 2025-11-249:501 reply

        > If anything, centralisation shields companies using a hyperscaler from criticism. You’ll see downtime no matter where you host. If you self host and go down for a few hours, customers blame you.

        Not just customers. Your management take the same view. Using hyperscalers is great CYA. The same for any replacement of internally provided services with external ones from big names.

        • By testdelacc1 2025-11-2410:31

          Exactly. No one got fired for using AWS. Advocating for self-hosting or a smaller provider means you get blamed when the inevitable downtime comes around.

      • By BlackFly 2025-11-2412:091 reply

        I think this really depends on your industry.

        If you cannot give a patient life saving dialysis because you don't have a backup generator then you are likely facing some liability. If you cannot give a patient life saving dialysis because your scheduling software is down because of a major outage at a third party and you have no local redundancy then you are in a similar situation. Obviously this depends on your jurisdiction and probably we are in different ones, but I feel confident that you want to live in a district where a hospital is reasonably responsible for such foreseeable disasters.

        • By testdelacc1 2025-11-2416:33

          Yeah I mentioned banking because of what I was familiar with but medical industry is going to be similar.

          But they do differ - it’s never ok for a hospital to be unable to dispense care. But it is somewhat ok for one bank to be down. We just assume that people have at least two bank accounts. The problem the banking regulator faces is that when AWS goes down, all banks go down simultaneously. Not terrible for any individual bank, but catastrophic for the country.

          And now you see what a juicy target an AWS DC is for an adversary. They go down on their own now, but surely Russia or others are looking at this and thinking “damn, one missile at the right data Center and life in this country grinds to a halt”.

      • By DeathArrow 2025-11-246:581 reply

        >If anything, centralisation shields companies using a hyperscaler from criticism. You’ll see downtime no matter where you host. If you self host and go down for a few hours, customers blame you.

        What if you host on AWS and only you go down? How does hosting on AWS shield you from criticism?

        • By testdelacc1 2025-11-247:08

          This discussion is assuming that the outage is entirely out of your control because the underlying datacenter you relied on went down.

          Outages because of bad code do happen and the criticism is fully on the company. They can be mitigated by better testing and quick rollbacks, which is good. But outages at the datacenter level - nothing you can do about that. You just wait until the datacenter is fixed.

          This discussion started because companies are actually fine with this state of affairs. They are risking major outages but so are all their competitors so it’s fine actually. The juice isn’t worth the squeeze to them, unless an external entity like the banking regulator makes them care.

    • By markus_zhang 2025-11-2411:311 reply

      It’s too few and far between. It’s gonna make some changes if it’s a monthly event. If businesses start to lose connection for 8 hours every month, maybe the bigger ones are going to run for self hosting or at least some capacity of self hosting.

      • By mkornaukhov 2025-11-2418:23

        Yeah, agree. But even in case of 8 hour downtime (it's almost 99% SLA) it isn't beneficial for really small firms.

    • By tcfhgj 2025-11-248:30

      > As if anybody could viably stop using them.

      You can, and even save money.

    • By sjamaan 2025-11-246:001 reply

      Same with the big Crowdstrike fail of 2024. Especially when everyone kept repeating the laughable statement that these guys have their shit in order, so it couldn't possibly be a simple fuckup on their end. Guess what, they don't, and it was. And nobody has realized the importance of diversity for resilience, so all the major stuff is still running on Windows and using Crowdstrike.

      • By c0l0 2025-11-248:151 reply

        I wrote https://johannes.truschnigg.info/writing/2024-07-impending_g... in response to the CrowdStrike fallout, and was tempted to repost it for the recent CloudFlare whoopsie. It's just too bad that publishing rants won't change the darned status quo! :')

        • By graemep 2025-11-249:47

          People will not do anything until something really disastrous happens. Even afterwards memories can fade. Cloudstrike has not lost many customers.

          Covid is a good parallel. A pandemic was always possible, there is always a reasonable chance of one over the course of decades. However people did not take it seriously until it actually happened.

          A lot of Asian countries are a lot better prepared for a tsunami then they were before 2004.

          The UK was supposed to have emergency plans for a pandemic, but it was for a flu variant, and I suspect even those plans were under-resourced and not fit for purpose. We are supposed to have plans for a solar storm but when another Carrington even occurs I very much doubt we will deal with it smoothly.

    • By fragmede 2025-11-249:021 reply

      > It obviously won't.

      Here's where we separate the men from the boys, the women from the girls, the Enbys from the enbetts, and the SREs from the DevOps. If you went down when Cloudflare went do, do you go multicloud so that can't happen again, or do you shrug your shoulders and say "well, everyone else is down"? Have some pride in your work, do better, be better, and strive for greatness. Have backup plans for your backup plans, and get out of the pit of mediocrity.

      Or not, shit's expensive and kubernetes is too complicated and "no one" needs that.

      • By rkomorn 2025-11-249:04

        You make the appropriate cost/benefit decision for your business and ignore apathy on one side and dogma on the other.

  • By notepad0x90 2025-11-247:301 reply

    Does the author of this post not see the irony of posting this content on Github?

    My counter argument is that "centralization" in a technical sense isn't about what company owns things but how services are operated. Cloudflare is very decentralized.

    Furthermore, I've seen regional outages caused by things like anchors dropped by ships in the wrong place, a shark eating a cable. Regional power outages caused by squirrels,etc... outages happen.

    If everyone ran their own server from their own home, AT&T or Level3 could have an outage and still take out similar swathes of the internet.

    With CDNs like cloudflare, if Level3 had an outage, your website won't be down because your home or VPS host's upstream transit happens to be Level3 (or whatever they call themselves these days) because your content (at least static) is cached globally.

    The only real reasonable alternative is something like ipfs, web3 and similar talk.

    Cloudflare has always called itself a content transport provider, think of it as such. But also, Cloudflare is just one player, there are several very big players. Every big cloud provider has a competing product, not to mention companies like Akamai.

    People are rage posting about cloudflare, especially because it has made CDNs accessible to everyone. You can easily setup a free cloudflare account and be on your merry way. This isn't something you should be angry about. You're free to pay for any number of other cdns, many do.

    If you don't like how Cloudflare has so much market share, then come up with a similarly competitive alternative and profit. Just this HN thread alone is enough for me to think there is a market for more players. Or, just spread the word about the competition that exists today. Use frontdoor, cloudfront, netlify, flycdn, akamai,etc... It's hardly a monopoly.

  • By miki123211 2025-11-249:164 reply

    I don't know how many times I need to say this, but I will die on this hill.

    Centralized services don't decrease redundancy. They're usually far more redundant than whatever homegrown solution you can come up with.

    The difference between centralized and homegrown is mostly psychological. We notice the outages of centralized systems more often, as they affect everything at the same time instead of different systems at different times. This is true even if, in a hypothetical world with no centralization, we'd have more total outage time than we do now.

    If your gas station says "closed" due to a problem that only affects their own networks, people usually go "aah they're probably doing repairs or something", and forget about the problem 5 minutes later. If there's a Cloudflare outage... everybody (rightly) blames the Cloudflare outage.

    Where this becomes a problem is when correlated failures are actually worse than uncorrelated ones. If Visa goes down, it's better if Mastercard stays up, because many customers have both and can use the other when one doesn't work. In some ways, it's better to have 30 mins of Visa outages today and 30 mins of Mastercard outages tomorrow, than to have just 15 mins of correlated outages in one day.

    • By lloeki 2025-11-2414:32

      "redundancy" might not be there correct word. If we had a single worldwide mega-entity serving 100% of the internet it would be both a monopoly and would have tons of redundant infrastructure.

      But it would also be quite unified; the system, while full of redundancies, as a whole is a unique one operated the same way end to end; by virtue of it being a single system handled in a uniform way, a single glitch could bring it all down. There is no diversity in the system's implementation, the monoculture itself makes it vulnerable.

    • By freeplay 2025-11-2416:201 reply

      The problem is creating a single point of failure.

      There's no doubt a VM in AWS is exponentially more redundant than my VM running on a couple of Intel NUCs in my closet.

      The difference is, when I have a major outage, my blog goes down.

      When EC2 has a major outage, all of the blogs go down. Along with Wikipedia, Starbucks, and half the internet.

      That single point of failure is the issue.

      • By YetAnotherNick 2025-11-2417:072 reply

        Single point of failure means exactly opposite of what you think it means. If my work depends on 5 services to be up, each service would be a single point of failure, and correlation of failure is good for probability that I can do my work.

        • By freeplay 2025-11-2420:02

          I see what you're saying but I have to push back.

          "If one thing I need is going to be down, everything might as well be down."

          If I have a product with 5 dependencies and one of them is down, there's things I can do to partially mitigate. A circuit breaker would allow my thing to at least stay up and responsive. Maybe I could get a status message up and turn off a feature flag to disable what calls that dependency.

          On the other hand, if all my dependencies are down AND the management layer is down AND the AWS portal is not functioning correctly, I'm pretty much SOL.

          Massive centralization is never, ever a good thing for anyone other than the ones who are doing the centralizing.

        • By smj-edison 2025-11-2418:32

          This is a really interesting point, because I could see a situation where your application requires integration with say 10 services. If they all run on AWS, they either all go down or all run together. If they're all self-hosted, there's a good chance that at any time one of the ten is down, and so your service can't run.

    • By dgan 2025-11-249:401 reply

      > Centralized services don't decrease redundancy

      Alright, but it creates a failure correlation where previously there was none

    • By masfuerte 2025-11-2412:38

      In my experience services aren't failing due to a lack of redundancy but due to an excess of complexity. With the move to the cloud we are continually increasing both redundancy and complexity and this is making the problem worse.

      I have a cheap VPS that has run reliably for a decade except for a planned hour of downtime. Which was in the middle of the night when no-one cared. Amazon is more reliable in theory. My cheap VPS is more reliable in practice.

HackerNews