Cloudflare Global Network experiencing issues

Comments

By abelanger 2025-11-1813:578 reply

If anyone needs commands for turning off the CF proxy for their domains and happens to have a Cloudflare API token.

First you can grab the zone ID via:

    curl -X GET "https://api.cloudflare.com/client/v4/zones" -H "Authorization: Bearer $API_TOKEN" -H "Content-Type: application/json" | jq -r '.result[] | "\(.id) \(.name)"'

And a list of DNS records using:

    curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" -H "Authorization: Bearer $API_TOKEN" -H "Content-Type: application/json"

Each DNS record will have an ID associated. Finally patch the relevant records:

    curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" -H "Authorization: Bearer $API_TOKEN" -H "Content-Type: application/json" --data '{"proxied":false}'

Copying from a sibling comment - some warnings:

- SSL/TLS: You will likely lose your Cloudflare-provided SSL certificate. Your site will only work if your origin server has its own valid certificate.

- Security & Performance: You will lose the performance benefits (caching, minification, global edge network) and security protections (DDoS mitigation, WAF) that Cloudflare provides.

- This will also reveal your backend internal IP addresses. Anyone can find permanent logs of public IP addresses used by even obscure domain names, so potential adversaries don't necessarily have to be paying attention at the exact right time to find it.

By duggan 2025-11-1814:33

Also, for anyone who only has an old global API key lying around instead of the more recent tokens, you can set:

  -H "X-Auth-Email: $EMAIL_ADDRESS" -H "X-Auth-Key: $API_KEY"

instead of the Bearer token header.

Edit: and in case you're like me and thought it would be clever to block all non-Cloudflare traffic hitting your origin... remember to disable that.

By sam-cop-vimes 2025-11-1814:052 reply

This is exactly what we've decided we should do next time. Unfortunately we didn't generate an API token so we are sitting twiddling our thumbs.

Edit: seems like we are back online!

By napsterbr 2025-11-1814:111 reply

Took me ~30 minutes but eventually I was able to log in, get past the 2FA screen and change a DNS record.

I surely missed a valid API token today.

By firecall 2025-11-1814:172 reply

I'm still trying.

Still can't load the Turnstile JS :-/

By biinjo 2025-11-1814:41

Turnstile is back up (for now). Go refresh. I just managed to make an API key and turn off proxied DNS.

By fragmede 2025-11-1817:47

install tweak chrome extension and mitm yourself and force the js to load from somewhere else

By basch 2025-11-1815:10

Im able to generate keys right now through warp. Login takes forever but it is working.

By mig4ng 2025-11-1814:00

Awesome! I did it via the Terraform provider, but for anyone else without access to the dashboard this is great. Thank you!

By basch 2025-11-1814:361 reply

If anyone needs the internet to work again (or to get into your cf dashboard to generate API keys), if you have Cloudflare WARP installed, turning it on appears to fix otherwise broken sites. Maybe using 1.1.1.1 does too, but flipping the radio box was faster. Some parts of sites are still down, even after tunneling into to CF.

By adi_kurian 2025-11-1814:431 reply

super helpful. thanks!

looks like i can get everywhere i couldn't except my cloudflare dash.

By basch 2025-11-1814:47

Its absurdly slow (like multiple minutes to get the login page to fully load for the login button to be pressable, due to catchpa...), but I was able to log into the dashboard. It's throwing lots of errors once inside, but I can navigate around some of it. YMMV.

My profile (including api tokens,) and websites pages all work, the accounts tab above website on the left does not.

By jlundberg 2025-11-1817:46

Good advice!

And no need for -X GET to make a GET request with curl, it is the default HTTP method if you don’t send any content.

If you do send content with say -d curl will do a POST request, so no need for -X then either.

For PATCH though, it is the right curl option.

By czue 2025-11-197:44

thanks for this! just expanded on a bit and published a write up here so it's easier to find in the future: https://www.coryzue.com/writing/cloudflare-dns/

By JoeOfTexas 2025-11-1819:181 reply

I would advise against this action. Just ride the crash.

By RKFADU_UOFCCLEL 2025-11-1820:59

If people knew how to play the 5 hour long game they wouldn't have been using Cloudflare in the first place.

By gtrealejandro 2025-11-197:39

[dead]

By itzjacki 2025-11-1811:3716 reply

A colleague of mine just came bursting through my office door in a panic, thinking he brought our site down since this happened just as he made some changes to our Cloudflare config. He was pretty relieved to see this post.

By arbuge 2025-11-1813:151 reply

Tell him it's worse than he thinks. He obviously brought the entire Cloudflare system down.

By mlrtime 2025-11-1813:207 reply

You joke and I think its funny, but as a junior engineer I would be quite proud if some small change I made was able to take down the mighty Cloudflare.

By throwup238 2025-11-1813:273 reply

If I were Cloudflare it would mean an immediate job offer well above market. That junior engineer is either a genius or so lucky that they must be bred by Pierson’s Puppeteers or such a perfect manifestation of a human fuzzer that their skills must be utilized.

By anvuong 2025-11-1818:441 reply

This reminds of a friend I had in college. We were assigned to the same group coding an advanced calculator in C. This guy didn't know anything about programming (he was mostly focused on his side biz of selling collector sneakers), so we assigned him to do all the testing, his job was to come up with weird equations and weird but valid way to present them to the calculator. And this dude somehow managed to crash almost all of our iterations except the few last ones. Really put the joke about a programmer, a tester, and a customer walk into a bar into perspective.

By jrochkind1 2025-11-1819:31

I love that he ended up making a very valuable contribution despite not knowing how to program -- other groups would have just been mad at him, had him do nothing, or had him do programming and gotten mad when it was crap or not finished.

By ethmarks 2025-11-1813:522 reply

A Ringworld reference in the wild?

By throwup238 2025-11-1814:011 reply

I never thought I'd get the chance, but then my Claude Code on Web credits ran out and I had to find another way to entertain myself.

By UltraSane 2025-11-1820:561 reply

Even after 20 projects I have only used $60 of my $250

By throwup238 2025-11-190:28

I think the rate limits for Claude Code on the Web include VM time in general and not just LLM tokens. I have a desktop app with a full end to end testing suite which the agent would run for every session that probably burned up quite a bit.

By zarathustreal 2025-11-1813:59

Internet points demand obscure references these days. My system prompt has its own area code

By kkkqkqkqkqlqlql 2025-11-1919:04

> If I were Cloudflare it would mean an immediate job offer well above market.

And not a lawsuit? Cause I've read more about that kind of reaction than of job offers. Though I guess lawsuits are more likely to be controversial and talked about.

By methyl 2025-11-1813:54

I kind of did that back in the days when they released Worker KV, I tried to bulk upload a lot of data and it brought the whole service down, can confirm I was proud :D

By amalcon 2025-11-1815:30

It's also not exactly the least common way that this sort of huge multi-tenant service goes down. It's only as rare as it is because more or less all of them have had such outages in the past and built generic defenses (e.g. automated testing of customer changes, gradual rollout, automatic rollback, there are others but those are the ones that don't require any further explanation).

By zidad 2025-11-1813:36

You might want to consider migrating to Azure Front Door if that's a feature you like: https://www.infoq.com/news/2025/11/azure-afd-control-plane-f...

By aws_ls 2025-11-1813:321 reply

Well its easy to cause damage by messing up the `rm` command, esp with `-fr` options. So don't take it as a proxy for some great skill which is required to cause damage.

By ethmarks 2025-11-1813:56

You could easily cause great damage to your Cloudflare setup, but CF has measures to prevent random customers deleting stuff from taking down the entire service globally. Unless you have admin access to the entire CF system, you can't really cause much damage with rm.

By amypetrik8 2025-11-194:231 reply

>You joke and I think its funny, but as a junior engineer I would be quite proud if some small change I made was able to take down the mighty Cloudflare.

I mean, with Cloudflare's recent (lack of) uptime, I would argue there's a degree of crashflation happening such that the prestige is less in doing so. I mean nowadays if a lawnmower drives by cloudflare and backfires that's enough to collapse the whole damn thing

By DonHopkins 2025-11-1916:351 reply

Are you actually so mind-numbingly ignorant that you think Rebecca Heineman had a brother named Bill, that you would rudely and incorrectly try to correct people who knew her story well, during a memorial discussion of her life and death?

Or were you purposefully going out of your way to perpetrate performative ignorance and transphobic bullying, just to let everyone know that you're a bigoted transphobic asshole?

I don't buy that it was an innocent mistake, given the context of the rest of the discussion, and your pretending to know her family better than the poster you were replying to and everyone else in the discussion, falsely denying her credit for her own work. Do you really think dang made the Hacker News header black because he and everyone else was confused and you were right?

Do you like to show up at funerals of people you don't know, just to interrupt the eulogy with insults, stuff pennies up your ass (as you claim to do), then shit and piss all over the coffin in front of their family and friends?

How long did you have to wait until she died before you had the courage to deadname, misgender, and punch down at her in a memorial, out of hate and cowardice and a perverse desire to show everyone what kind of a person you really are?

Next time, can you at least wait until after the funeral before committing your public abuse?

https://news.ycombinator.com/item?id=45975524

The work you're outlining here is was performed by "Bill Heineman" - maybe you are mixing up Bill with his sister Rebecca?!?

By throwmoe 2025-11-1920:422 reply

Can you calm down with the absolutely mental rants against people in unrelated threads. Cuckoo crazy behavior.

By DonHopkins 2025-11-209:201 reply

Posting abusive bigoted bullshit in a memorial thread is cuckoo crazy behavior. Calling it out and describing it isn't. You're confusing describing the abuse with committing the abuse. Direct your scorn at the person I'm criticizing, unless you agree with what they did, in which case my criticism also applies directly and personally to you, so no wonder you created a throw away sock puppet account just to attempt to defend your own bigotry and abuse.

By throwmoe 2025-11-2022:28

Have you ranted at John Carmack yet?

By sakisv 2025-11-1811:595 reply

Well, you can never be sure that he didn't:

https://www.fastly.com/blog/summary-of-june-8-outage

By nevf1 2025-11-1813:391 reply

It's also what was the cause of the Azure Front Doors global outage two weeks ago - https://aka.ms/air/YKYN-BWZ

"A specific sequence of customer configuration changes, performed across two different control plane build versions, resulted in incompatible customer configuration metadata being generated. These customer configuration changes themselves were valid and non-malicious – however they produced metadata that, when deployed to edge site servers, exposed a latent bug in the data plane. This incompatibility triggered a crash during asynchronous processing within the data plane service. This defect escaped detection due to a gap in our pre-production validation, since not all features are validated across different control plane build versions."

By boruto 2025-11-195:13

This is actually pretty nice and amazing that they publish video format incident retrospectives.

By itzjacki 2025-11-1812:011 reply

Oh don't you worry. We are very much talking about the global outage as if he was the root cause. Like good colleagues :)

By rapnie 2025-11-1812:411 reply

Hmm, wait a minute.. maybe he was the cause! (no, kidding. just upping the pressure as a good peer :)

By bryanrasmussen 2025-11-1813:091 reply

are we truly good if we don't start a class action suit against this hapless scapegoat?!

By conorcleary 2025-11-1814:16

Just join the one we've started over in this cubicle!

By srmarm 2025-11-1812:55

> May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.

I'd love to know more about what those specific circumstances were!

By Bloomy22 2025-11-1813:39

I'm pretty sure I crashed Gmail using something weird in its filters. It was a few years ago. Every time I did something specific (I don't remember what), it would freeze and then display a 502 error for a while.

By CableNinja 2025-11-1813:022 reply

Damn, imagine being the customer responsible for that, oof

By WJW 2025-11-1813:14

What do you imagine would be the result if you brought down cloudflare with a legitimate config update (ie not specifically crafted to trigger known bugs) while not even working for them? If I were the customer "responsible" for this outage, I'd just be annoyed that their software is apparently so fragile.

By whstl 2025-11-1813:061 reply

I would be fine if it was my "fault", but I'm sure people in business would find a way to make me suffer.

But on a personal level, this is like ordering something at a restaurant and the cook burning the kitchen because they forgot to take out your pizza out of the oven or something.

I would be telling it to everyone over beers (but not my boss).

By sakisv 2025-11-1815:10

I would be tempted to put it on my CV :D

By Freak_NL 2025-11-1811:4014 reply

Is there a word for that feeling of relief when someone else fucked up after initially thinking it was you?

By spamizbad 2025-11-1812:522 reply

What’s funny is as I get older this feeling of relief turns more like a feeling of dread. The nice thing about problems that you cause is that you have considerable autonomy to fix them. Cloudflare goes down you’re sitting and waiting for a 3 party to fix something.

By mewpmewp2 2025-11-1813:122 reply

Why is it dread? I always feel good when big players mess up, as it makes me feel better about my own mess ups in life previously.

By twodave 2025-11-1813:281 reply

Can’t speak for GP but ultimately I’d rather it be my fault or my company’s fault so I have something I can directly do for my customers who can’t use our software. The sense of dread isn’t about failure but feeling empathy for others who might not make payroll on time or whatever because my service that they rely on is down. And the second order effects, like some employee of a customer being unable to make rent or be forced to take out a short term loan or whatever. The fallout from something like this can have an unexpected human cost at times. Thankfully it’s Tuesday, not a critical payroll day for most employees.

By mewpmewp2 2025-11-1813:321 reply

But why does this case specifically matter? What if their system was down due to their WiFi or other layers beyond your software? Would you feel the same as well?

What about all the other systems and people suffering elsewhere in the World?

By twodave 2025-11-1817:18

I don't understand what point you're trying to make. Are you suggesting that if I can't feel empathy for everybody at once, or in every one of their circumstances, that I should not feel anything at all for anyone? That's not how anything works. Life (or, as I believe, God) brings us into contact with all kinds of people experiencing different levels of joy and pain. It's natural to empathize with the people you're around, whatever they're feeling. Don't over-complicate it.

By kasey_junk 2025-11-1813:162 reply

Because my customers don’t (and shouldn’t care) it’s a third party. If I caused it there is a chance I can fix it.

By _factor 2025-11-1813:331 reply

So you would rather be incompetent than powerless? Choice of third party vendor on client facing services is still on you, so maybe you prefer your incompetence be more direct and tangible?

Even still, you should have policies in place to mitigate such eventualities, that way you can focus the incompetence into systematic issues instead. The larger the company, the less acceptable these failures become. Lessons learned is a better excuse for a shake and break startup than an established player that can pay to be secure.

At some point, the finger has to be pointed. Personally, I don't dread it pointing elsewhere. Just means I've done my due D and C.

By shufflerofrocks 2025-11-1814:34

Your priority (in this comment atleast) is about the finger-pointing, while the parent's priority is wanting a fix to the issue at hand.

By mewpmewp2 2025-11-1813:27

If customers expected third party downtime to not affect their thing then you shouldn't have picked a third party provider or spent extra resources on not having a single point of failure? If they were happy with choosing the third party with knowledge of depending on said third party provider, then it was an accepted risk.

By sys_64738 2025-11-1813:20

When others cause problems then you can put your feet up and surf the web waiting for resolution. Oh, wait.

By jspash 2025-11-1812:054 reply

The problem is, I still get the wrong end of the stick when AWS or CF go down! Management doesn't care, understandably. They just want the money to keep coming in. It's hard to convince them that this is a pretty big problem. The only thing that will calm them down a bit is to tell them Twitter is also down. If that doesn't get them, I say ChatGPT is also down. Now NOBODY will get any work done! lol.

By hylaride 2025-11-1813:08

This is why you ALWAYS have a proposal ready. I literally had my ass saved by having tickets with reliability/redundancy work clearly laid out with comments by out of touch product/people managers deprioritizing the work after attempts to pull it off the backlog (in one infamous case for a notoriously poorly conceived and expensive failure of a project that haunted us again with lost opportunity cost).

The hilarious part of the whole story is that the same PMs and product managers were (and I cannot overemphasize this enough) absolutely militant orthodox agile practitioners with jira.

By aurareturn 2025-11-1813:031 reply

Every time a major cloud goes down, management tells us why don't we have a backup service that we can switch to. Then I tell them that a bunch of services worth a lot more than us are also down. Do you really want to spend the insane amount of resources to make sure our service stays up when the global internet is down?

By ec109685 2025-11-1813:21

Having an alt to Cloudflare isn’t preposterous.

By graemep 2025-11-1812:37

Who decided to go with AWS of CF? If its a management decision tell them you need the resources to have a fallback if they want their system to be more reliable than AWS or CF.

By adriand 2025-11-1812:51

Haha yeah I just got off the phone and I said, look, either this gets fixed soon or there's going to be news headlines with photographs of giant queues of people milling around in airports.

By shortrounddev2 2025-11-1812:001 reply

When I'm debugging something, I'm not usually looking for the solution to the problem; I'm looking for sufficient evidence that I didn't cause the problem. Once I have that, the velocity at which I work slows down

By sys_64738 2025-11-1813:22

My manager once asked if he could have a "quick word". I said "velocity".

By jpmonette 2025-11-1811:444 reply

phewphoria

By Sholmesy 2025-11-1812:03

Well, at least something good came out of this incident.

Perfect.

By Freak_NL 2025-11-1811:50

Yup, that works.

By zzzeek 2025-11-1812:484 reply

it has to sound like a german word though

By puilp0502 2025-11-1813:072 reply

Is there a word for a feeling that there's gotta be a German word for this niche feeling?

By patneedham 2025-11-1813:39

Probably Deutschwortsehnsucht (https://www.iamexpat.de/education/education-news/german-word...)

By usrusr 2025-11-1813:45

You mean like when the Wortzusammensetzungsverdacht just hits you? (yeah, I just made that up, that's the beauty)

By Shugyousha 2025-11-1814:32

Fremdverfehlungserleichterung?

By chrisweekly 2025-11-1813:13

Entlastungvergnugen?

By datenhorst 2025-11-1813:09

puhphorie

By mcphage 2025-11-1812:483 reply

Maybe this isn’t great, but I get a hint of that feeling when I’m on an airplane and hear a baby crying. For a number of years, if I heard a baby crying, it was probably my baby and I had to deal with it. But now my kids are past that phase, so when I hear the crying, after that initial jolt of panic I realize that it isn’t my problem, and that does give me the warm fuzzies. Even though I do feel bad for the baby and their parents.

By adriand 2025-11-1812:53

Related situation: you're at a family gathering and everyone has young kids running around. You hear a thump, and then some kid starts screaming. Conversation stops and every parent keenly listens to the screams to try and figure out whose kid just got hurt, then some other parent jumps up - it's not your kid! #phewphoria

By RhysU 2025-11-1813:22

You're not alone in this feeling. I occasionally smile when it's not my kid.

By hackeraccount 2025-11-1814:03

This is one of the secret joys of being a parent.

By bookofjoe 2025-11-1813:061 reply

The German word “schadenfreude” means taking pleasure in someone else’s misfortune; enjoyment rather than relief.

By bryanrasmussen 2025-11-1813:151 reply

since schaden is damage and freude is joy, not sure what it should be - maybe Schadeleichtig hmm...

By tauchunfall 2025-11-1813:281 reply

>maybe Schadeleichtig

Maybe "Erleichterung" (relief)? But as a German "Schadenserleichterung" (also: notice the "s" between both compound word parts) rather sounds like a reduction of damage (since "Erleichterung" also means mitigation or alleviation).

By bryanrasmussen 2025-11-190:09

right I thought of that at first and discarded it for that reason. Which the problem really is that the normal story of how Schadenfreude works as a bit of German language how to is that the component that it is other people's damage that is sparking joy is missing from the word itself, that interpretation must be known by the word user, if you were just creating the word and nobody had heard it before in the world it would be pretty reasonable for people to think you had just created a new word for masochism.

By Rooster61 2025-11-1815:50

Schadenfriend?

You gain relief, but you don't exactly derive pleasure as it's someone you know that's getting the ass end of the deal

By stonecharioteer 2025-11-1812:54

It's close enough to Schadenfreude but not really.

By StanAngeloff 2025-11-1811:432 reply

Schadenfreude

By gnfargbl 2025-11-1811:472 reply

Nah, that's delight in someone else's misfortune. This is delight that the misfortune wasn't yours, which is slightly different.

By StanAngeloff 2025-11-1811:481 reply

4 years of German and I still don't quite "get" it :^) TY!

By tagyro 2025-11-1811:57

We have a saying:

You know how you measure eternity?

When you finish learning German.

By namblooc 2025-11-1812:541 reply

Katastrophenverursachererleichterung

By LinXitoW 2025-11-1813:301 reply

Katastrophenverursacherverlagerungserleichterung

By namblooc 2025-11-1813:31

Even better

By simonklitj 2025-11-1811:461 reply

Not quite, that’s more like taking pleasure in the misfortune of someone else. It’s close, but the specific relief bit that it is not _your_ misfortune is not captured

By skottenborg 2025-11-1812:051 reply

Internettet er vist ikke så stort :-)

By simonklitj 2025-11-1813:27

Fætter! Hvor genialt at se dig her. :)

By cromka 2025-11-1812:00

vindication?

By hoistbypetard 2025-11-1822:10

schadenfuckup

By nrhrjrjrjtntbt 2025-11-1812:32

The company where this colleague works? Cloudflare.

By sefke 2025-11-1813:37

I woke up getting bombarded by multiple clients messages of sites not working, I shitted my pants because I've changed the config just yesterday. When I saw the status message "cloudflare down" I was so relieved.

By disconnection 2025-11-1812:29

Good that he worked it out so quick. I recently spent a day debugging email problems on Railway PaaS, because they silently closed an SMTP port without telling anyone.

By bamboozled 2025-11-1812:041 reply

How do we know your colleagues changes didn't take down Cloudflare though?

By itzjacki 2025-11-1812:081 reply

Good point. We should probably assume they did, until proven otherwise.

By puilp0502 2025-11-1813:08

Guilty until proven innocent.

By 0xblinq 2025-11-1813:15

Do you guys work at Cloudflare? Do you mind reverting that change just in case?

By ants_everywhere 2025-11-1813:24

Chances are still good that somewhere within Cloudflare someone really did do a global configuration push that brought down the internet.

When aliens study humans from this period, their book of fairy tales will include several where a terrible evil was triggered by a config push.

By 0xblinq 2025-11-1813:15

Plot twist: They work at Cloudflare

By dcjdfvk 2025-11-1813:151 reply

Even pornhub is down becuase it uses clouflare.

By carlos_rpn 2025-11-1813:29

Is Cloudflare being down the work of conservative hackers and the rest of the internet is just collateral damage?

By belter 2025-11-1812:03

Wait for the post mortem ... It is a technical possibility, race condition propagates one customer config to all nodes... :-)

By raxxorraxor 2025-11-1813:53

Did your colleague perhaps change the Cloudflare config again right now? Seems to be down again.

By theoldgreybeard 2025-11-1814:23

You should tell him his config change took down half the internet.

By NitpickLawyer 2025-11-1811:56

You missed a great opportunity to dead-pan him with something like "No, Bob, not just our site, you brought down the entire Internet, look at this post!"

By aavshr 2025-11-1816:143 reply

> In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack.

From the CTO, Source: https://x.com/dok2001/status/1990791419653484646

By turbobrew 2025-11-1816:396 reply

It still astounds me that the big dogs still do not phase config rollouts. Code is data, configs are data, they are one and the same. It was the same issue with the giant crowdstrike outage last year, they were rawdogging configs globally and a bad config made it out there and everything went kaboom.

You NEED to phase config rollouts like you phase code rollouts.

By crazygringo 2025-11-1818:501 reply

The big dogs absolutely do phase config rollouts as a general rule.

There are still two weaknesses:

1) Some configs are inherently global and cannot be phased. There's only one place to set them. E.g. if you run a webapp, this would be configs for the load balancer as opposed to configs for each webserver

2) Some configs have a cascading effect -- even though a config is applied to 1% of servers, it affects the other servers they interact with, and a bad thing spreads across the entire network

By creatonez 2025-11-1821:121 reply

> Some configs are inherently global and cannot be phased

This is also why "it is always DNS". It's not that DNS itself is particularly unreliable, but rather that it is the one area where you can really screw up a whole system by running a single command, even if everything else is insanely redundant.

By turbobrew 2025-11-1821:282 reply

I don’t believe that there is anything necessarily which requires DNS configs to be global.

You can shard your service behind multiple names:

my-service-1.example.com

my-service-2.example.com

my-service-3.example.com …

Then you can create smoke tests which hit each phase of the DNS and if you start getting errors you stop the rollout of the service.

By creatonez 2025-11-1821:451 reply

Sure, but that doesn't really help for user-facing services where people expect to either type a domain name in their browser or click on a search result, and end up on your website every time.

And the access controls of DNS services are often (but not always) not fine-grained enough to actually prevent someone from ignoring the procedure and changing every single subdomain at once.

By turbobrew 2025-11-1822:042 reply

> Sure, but that doesn't really help for user-facing services where people expect to either type a domain name in their browser or click on a search result, and end up on your website every time.

It does help. For example, at my company we have two public endpoints:

company-staging.com company.com

We roll out changes to company-staging.com first and have smoke tests which hit that endpoint. If the smoketests fail we stop the rollout to company.com.

Users hit company.com

By cowsandmilk 2025-11-1822:541 reply

That doesn’t help with rolling out updates to the DNS for company.com which is the point here. It’s always DNS because your pre-production smoke tests can’t test your production DNS configuration.

By FLHerne 2025-11-1917:371 reply

If I'm understanding it right, the idea is that the DNS configuration for company-staging.com is identical to that for company.com - same IPs and servers, DNS provider, domain registrar. Literally the only differences are s/company/company-staging/, all accesses should hit the same server with the same request other than the Host header.

Then you can update the DNS configuration for company-staging.com, and if that doesn't break there's very little scope for the update to company.com to go differently.

By creatonez 2025-11-2015:26

The purpose of a staged rollout is to test things with some percentage of actual real-world production traffic, after having already thoroughly tested things in a private staging environment. Your staging URL doesn't have that. Unless the public happens to know about it.

The scope for it to go wrong is the differences in real-world and simulation.

It's a good thing to have, but not a replacement for the concept of staged rollout.

By crazygringo 2025-11-1823:06

But users are going to example.com. Not my-service-33.example.com.

So if you've got some configuration that has a problem that only appears at the root-level domain, no amount of subdomain testing is going to catch it.

By siegecraft 2025-11-1818:183 reply

I think it's uncharitable to jump to the conclusion that just because there was a config-based outage they don't do phased config rollouts. And even more uncharitable to compare them to crowdstrike.

By turbobrew 2025-11-1819:47

I have read several cloudflare postmortems and my confidence in their systems is pretty low. They used to run their entire control plane out of a single datacenter which is amateur hour for a tech company that has over $60 billion in market cap.

I also don’t understand how it is uncharitable to compare them to crowdstrike as both companies run critical systems that affect a large number of people’s lives, and both companies seem to have outages at a similar rate (if anything, cloudflare breaks more often than crowdstrike).

By turbobrew 2025-11-190:37

https://blog.cloudflare.com/18-november-2025-outage/

> The larger-than-expected feature file was then propagated to all the machines that make up our network

> As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.

I was right. Global config rollout with bad data. Basically the same failure mode of crowdstrike.

By cyberpunk 2025-11-1818:46

It seem fairly logical to me? If a config change causes services to crash then rollout stops … at least in every phased rollout system i’ve ever built…

By JohnMakin 2025-11-1816:44

In a company I am no longer with I argued much the same when we rolled out "global CI/CD" on IAC. You made one change, committed and pushed, wham it's on 40+ server clusters globally. I hated it. The principal was enamored with it, "cattle not pets" and all that, but the result was things slowed down considerably because anyone working with it became so terrified of making big changes.

By wbl 2025-11-1817:15

Then you get customer visible delays.

By immibis 2025-11-1820:30

Because adversaries adapt quickly, they have a system that deploys their counter-adversary bits quickly without phasing - no matter whether they call them code or configs. See also: Crowdstrike.

By himinlomax 2025-11-1822:24

You can't protect against _latent bugs_ with phased rollouts.

By JohnMakin 2025-11-1816:42

Wish this could rocket to the top of the comment thread, digging through hundreds of comments speculating about a cyberattack to find this felt silly

By imdsm 2025-11-1816:192 reply

Configuration changes are dangerous for CF it seems, and knocked down $NET almost 4% today. I wonder what the industry wide impact is for each of these outages?

By sammy2255 2025-11-1818:401 reply

Pre market was red for all tech stocks today before the outage even happened

By hbbio 2025-11-190:34

Yes, if anything it's bullish on CloudFlare because many investors don't realize how pervasive it is.

By nobody9999 2025-11-190:32

>Configuration changes are dangerous for CF it seems, and knocked down $NET almost 4% today. I wonder what the industry wide impact is for each of these outages?

This is becoming the "new normal." It seems like every few months, there's another "outage" that takes down vast swathes of internet properties, since they're all dependent on a few platforms and those platforms are, clearly, poorly run.

This isn't rocket surgery here. Strong change management, QA processes and active business continuity planning/infrastructure would likely have caught this (or not), as is clear from other large platforms that we don't even think about because outages are so rare.

Like airline reservations systems[0], credit card authorization systems from VISA/MasterCard, American Express, etc.

Those systems (and others) have outages in the "once a decade" or even much, much, longer ranges. Are the folks over at SABRE and American Express that much smarter and better than Cloudflare/AWS/Google Cloud/etc.? No. Not even close. What they are is careful as they know their business is dependent on making sure their customers can use their services anytime/anywhere, without issue.

It amazes me the level of "Stockholm Syndrome"[1] expressed by many posting to this thread, expressing relief that it wasn't "an attack" and essentially blaming themselves for not having the right tools (API keys, etc.) to recover from the gross incompetence of, this time at least, Cloudflare.

I don't doubt that I'll get lots of push back from folks claiming, "it's hard to do things at scale," and/or "there are way too many moving parts," and the like.

Other organizations like the ones I mention above don't screw they're customers every 4-6 months with (clearly) insufficiently tested configuration and infrastructure changes.

Yet many here seem to think that's fine, even though such outages are often crushing to their businesses. But if the customers of these huge providers don't demand better, they'll only get worse. And that's not (at least in my experience) a very deep or profound idea.

[0] https://en.wikipedia.org/wiki/Airline_reservations_system

[1] https://en.wikipedia.org/wiki/Stockholm_syndrome