GCP Outage

Comments

By rvnx 2025-06-1218:422 reply

It looks like that it is a central service @ Google called Chemist that is down.

"Chemist checks the project status, activation status, abuse status, billing status, service status, location restrictions, VPC Service Controls, SuperQuota, and other policies."

-> This would totally explain the error messages "visibility check (of the API) failed" and "cannot load policy" and the wide amount of services affected.

cf. https://cloud.google.com/service-infrastructure/docs/service...

EDIT: Google says "(Google Cloud) is down due to Identity and Access Management Service Issue"

By VWWHFSfQ 2025-06-1218:482 reply

There are multiple internet services down, not just GCP. It's just possible that this "Chemist" service is especially externally affected which is why the failures are propagating to the their internal GCP network services.

By rvnx 2025-06-1218:502 reply

Absolutely possible. Though there is something curious:

https://www.cloudflarestatus.com/

At Cloudflare it started with: "Investigating - Cloudflare engineering is investigating an issue causing Access authentication to fail.".

So this would somehow validate the theory of auth/quotas started failing right after Google, but what happened after ?! Pure snowballing ? That sounds a bit crazy.

By terom 2025-06-1220:021 reply

From the Cloudflare incident:

> Cloudflare’s critical Workers KV service went offline due to an outage of a 3rd party service that is a key dependency. As a result, certain Cloudflare products that rely on KV service to store and disseminate information are unavailable [...]

Surprising, but not entirely unplausible for a GCP outage to spread to CF.

By voytec 2025-06-1220:186 reply

> outage of a 3rd party service that is a key dependency.

Good to know that Cloudflare has services seemingly based on GCP with no redundancy.

By londons_explore 2025-06-1222:231 reply

Probably unintentional. "We just read this config from this URL at startup" can easily snowball into "if that URL is unavailable, this service will go down globally, and all running instances will fail to restart when the devops team try to do a pre-emptive rollback"

By turbobrew 2025-06-133:521 reply

After reading about cloudflare infra in post mortems it has always been surprising how immature their stack is. Like they used to run their entire global control plane in a single failure domain.

Im not sure who is running the show there, but the whole thing seems kinda shoddy given cloudflares position as the backbone of a large portion of the internet.

I personally work at a place with less market cap than cloudflare and we were hit by the exact same instances (datacenter power went out) and had almost no downtime, whereas the entire cloudflare api was down for nearly a day.

By fruit_snack 2025-06-196:12

Nice job keeping your app up during the outage but I'm not sure you can say "the whole thing seems kinda shoddy" when they're handling the amount of traffic they are.

By tibbar 2025-06-1222:574 reply

What's the alternative here? Do you want them to replicate their infrastructure across different cloud providers with automatic fail-over? That sounds -- heck -- I don't know if modern devops is really up to that. It would probably cause more problems than it would solve...

By arccy 2025-06-1223:041 reply

They're a company that has to run their own datacenters, you'd expect them to not fall over when a public cloud does.

By hplk 2025-06-1223:321 reply

I was really surprised. The dependence on another enterprise’s cloud services in-general I think is risky, but pretty much everyone does it these days, but I didn’t expect them to be.

By calvinmorrison 2025-06-130:021 reply

well at some level you can contract deploy private instances of clouds as well.

By UltraSane 2025-06-133:26

AWS has Outpost racks that let you run AWS instances and services in your own datacenter managed like the ones running in AWS datacenters. Neat but incredibly expensive.

By voytec 2025-06-131:50

> What's the alternative here? Do you want them to replicate their infrastructure

Cloudflare adverises themselves as _the_ redundancy / CDN provider. Don't ask me for an "alternative" but tell them to get their backend infra shit in order.

By ghshephard 2025-06-130:35

There are roughly 20-25 major IaaS providers in the world that should have close to dependency on each other. I'm almost certain that cloud flare believe that was their posture, and that the action items coming out of this post mortem will be to make sure that this is the case.

By somanyphotons 2025-06-1223:32

I would expect them to not rely on GCP at all

By ProAm 2025-06-1223:471 reply

Google is an advertising company not a tech company. Do not rely on them performing anything critical that doesn't depend on ad revenue.

By dylan604 2025-06-130:061 reply

What does that make Amazon?

By tapoxi 2025-06-130:14

A cloud services company. AWS is much bigger than Amazon retail at this point.

By arghwhat 2025-06-1223:18

Redundancy ≠ immune to failure.

By bravetraveler 2025-06-1221:25

Content Delivery Thread

By whatevertrevor 2025-06-1219:003 reply

Doesn't cloudflare have its own infrastructure, it's wild to me that both these things are down presumably together with this size of a blast radius.

By derefr 2025-06-1220:582 reply

Cloudflare isn't a cloud in the traditional sense; it's a CDN with extra smarts in the CDN nodes. CF's comparative advantage is in doing clever things with just-big-enough shared-nothing clusters deployed at every edge POP imaginable; not in building f-off huge clusters out in the middle of nowhere that can host half the Internet, including all their own services.

As such, I wouldn't be overly surprised if all of CF's non-edge compute (including, for example, their control plane) is just tossed onto a "competitor" cloud like GCP. To CF, that infra is neither a revenue center, nor a huge cost center worth OpEx-optimizing through vertical integration.

By whatevertrevor 2025-06-1221:212 reply

But then you do expose yourself to huge issues like this if your control plane is dependent on a single cloud provider, especially for a company that wants to be THE reverse proxy and CDN for the internet no?

By snowwrestler 2025-06-1222:361 reply

Cloudflare does not actually want to reverse proxy and CDN the whole internet. Their business model is B2B; they make most of their revenue from a set of companies who buy at high price points and represent a tiny percentage of the total sites behind CF.

Scale is just a way to keep costs low. In addition to economies of scale, routing tons of traffic puts them in position to negotiate no-cost peering agreements with other bandwidth providers. Freemium scale is good marketing too.

So there is no strategic reason to avoid dependencies on Google or other clouds. If they can save costs that way, they will.

By whatevertrevor 2025-06-1222:50

Well I mean most of the internet in terms of traffic, not in terms of the corpus of sites. I agree the long-tail of websites is probably not profitable for them.

By mbreese 2025-06-1222:39

True, but how often do outages like this happen? And when outages do happen, does Cloudflare have any more exposure than Google? I mean, if Google can’t handle it, why should Cloudflare be expected to? It also looks like the Cloudflare services have been somewhat restored, so whatever dependency there is looks like it’s able to be somewhat decoupled.

So long as the outages are rare, I don’t think there is much downside for Cloudflare to be tied to Google cloud. And if they can avoid the cost of a full cloud buildout (with multiple data centers and zones, etc…), even better.

By arccy 2025-06-1223:06

They're pushing workers more as a compute platform

Plus their past outage reports indicate they should be running their own DC: https://blog.cloudflare.com/major-data-center-power-failure-...

By smoe 2025-06-1220:102 reply

Latest Cloudflare status update basically confirms that there is a dependency to GCP in their systems:

"Cloudflare’s critical Workers KV service went offline due to an outage of a 3rd party service that is a key dependency. As a result, certain Cloudflare products that rely on KV service to store and disseminate information are unavailable"

By craigseeman 2025-06-1312:24

They lightly mentioned it in this interview a few weeks ago as well - I was surprised! https://youtu.be/C5-741uQPVU?t=1726s

By whatevertrevor 2025-06-1221:191 reply

Yeah I saw that now too. Interesting, I'm definitely a little surprised that they have this big of an external dependency surface.

By smoe 2025-06-1221:29

Definitely very surprised to see, that so much of the CF products that are there to compete with the big cloud providers have such a dependance on GCP.

By cyberpunk 2025-06-1219:063 reply

You'd think so wouldn't you?

DownDetector also reports azure and oracle cloud, I can't see then also being dependant on GCP...

I guess down detector isn't a full source of truth though.

https://ocistatus.oraclecloud.com/#/ https://azure.status.microsoft/en-gb/status

Both green

By mandevil 2025-06-1219:321 reply

Down detector has a problem when whole clouds go down: unexpected dependencies. You see an app on a non-problematic cloud is having trouble, and report it to Down Detector but that cloud is actually fine- their actual stuff is running fine. What is really happening is that the app you are using has a dependency on a different SaaS provider who runs on the problematic cloud, and that is killing them.

It's often things like "we got backpressure like we're supposed to, so we gave the end user an error because the processing queue had built up above threshold, but it was because waiting for the timeout from SaaS X slowed down the processing so much that the queue built up." (Have the scars from this more than once.)

By spwa4 2025-06-1221:131 reply

Surely if you build a status detector you realize that colo or dedicated are your only options, no? Obviously you cannot host such a service in the cloud.

By mandevil 2025-06-1221:51

I'm not even talking about Down Detector's own infra being down, I'm talking about actual legitimate complaints from real users (which is the data that Down Detector collates and displays) because the app they are trying to use on an unaffected cloud is legitimately sending them an error- it's just because of SaaS dependencies and the nature of distributed systems one cloud going down can have a blast radius such that even apps on unaffected clouds will have elevated error rates, and that can end up confusing displays on Down Detector when large enough things go down.

My apps run on AWS, but we use third parties for logging, for auth support, billing, things like that. Some of those could well be on GCP though we didn't see any elevated error rates. Our system is resilient against those being down- after a couple of failed tries to connect it will dump what it was trying to send into a dump file for later re-sending. Most engineers will do that. But I've learned after many bad experiences that after a certain threshold of failures to connect to one of these outside system, my system should just skip calling out except for once every retryCycleTime, because all it will do is add two connectionTimeout's to every processing loop, building up messages in the processing queue, which eventually create backpressure up to the user. If you don't have that level of circuit breaker built, you can cause your own systems to give out higher error rates even if you are on an unaffected cloud.

So today a whole lot of systems that are not on GCP discovered the importance of the circuit breaker design pattern.

By iFred 2025-06-1219:27

Down Detector can have a poor signal to noise ratio given from what I am assuming is users submitting "this is broken" for any particular app. Probably compounded by many hearing of a GCP issue, checking their own cloud service, and reporting the problem at the same time.

By basfo 2025-06-1219:11

Using Azure here, no issues reported so far.

By gcpman 2025-06-1223:03

perhaps the person who maintains Chemist took the buyout

https://www.businessinsider.com/google-return-office-buyouts...

By mrGomesDev 2025-06-1218:491 reply

I use Expo intermediation for notifications, but with this Google context, I imagine that FCM is also suffering, is that possible?

By rvnx 2025-06-1218:512 reply

Very likely. Firebase Auth is down for sure (though unreported yet), so most likely FCM too

By praveen4463 2025-06-1615:32

whole of firebase was actuallky down

By karn97 2025-06-1220:22

[dead]

By atonse 2025-06-1218:2010 reply

Getting a lot of errors for Claude Sonnet 4 (Cursor) and Gemini Pro.

Nooooo I'm going to have to use my brain again and write 100% of my code like a caveman from December 2024.

By burntalmonds 2025-06-1218:361 reply

Same here. Getting this in AI Studio: Failed to generate content: user has exceeded quota. Please try again later.

By username223 2025-06-1218:572 reply

[flagged]

By mpalmer 2025-06-1219:061 reply

Reductive and begging the question.

By throwaway314155 2025-06-1223:02

Don't forget "barely intelligible"

By baq 2025-06-1219:06

generating computer code, duh.

95% of enterprise software coding is molding received data into a schema acceptable to be sent further.

that said, coding is like 15% (or 0% in some cases) of an enterprise software engineer's workload.

By bicx 2025-06-1218:272 reply

I was in the middle of testing Cloud Storage file uploads, so I guess this is a good time to go for a walk.

By matsemann 2025-06-1219:46

A good excuse for adding error handling, which otherwise is often overlooked, heh.

By robin-a 2025-06-1220:07

Cursor throwing some errors for me in Auto Agent mode too.

By cryptonector 2025-06-1219:404 reply

Devs before June 12, 2025: "Ai? Pfft, hallucination central. They'll never replace me!"

Devs during June 12, 2025 GCP outage: "What, no AI?! Do you think I'm a slave?!"

By atonse 2025-06-1220:111 reply

100% agree... I even thought "ok maybe I'll clean up the backlog while I wait" but I'm so used to even using AI to clean up my JIRA backlog (using the Atlassian MCP), so even that feels weird to click into each ticket, just the way I used to do it TWO MONTHS AGO.

This is a good wake-up call on how easily (and quickly) we can all become pretty dependent on these tools.

By tough 2025-06-1222:08

local llm's would work

By sva_ 2025-06-1220:48

It appears like "Devs" is not a homogeneous mass.

By christianqchung 2025-06-134:04

Goomba fallacy

By thefourthchime 2025-06-1220:18

So true

By crocowhile 2025-06-1218:43

openrouter.ai is down for me

By sujayakar 2025-06-1218:351 reply

switch to auto mode and it should still work!

By ashu1461 2025-06-1218:392 reply

GPT is working in agent mode, which kind of confirms that claude is hosted on google and GPT probably on MSFT servers / self hosted.

By kenhwang 2025-06-1220:291 reply

If you want a stronger confirmation about Claude being hosted on GCP, this is about as authoritative as it gets: https://www.anthropic.com/news/anthropic-partners-with-googl...

By mkl 2025-06-131:26

That's nearly 2.5 years old, an eternity in this space. It may still be true, but that article is not good evidence.

By scottmf 2025-06-1220:49

Claude runs on AWS afaik. And OAI on Azure. Edit: oh okay maybe GCP too then. I’m personally having no problem using Claude Code though.

By orangebread 2025-06-1218:34

lmao i refuse to write code by hand anymore too. WHAT IS THIS

By sunir 2025-06-1219:061 reply

I chose sepuku.

By Xavez 2025-06-1219:231 reply

Apple’s local models looking better each day :’)

By nolist_policy 2025-06-1219:311 reply

Google's local models as well (Gemini Nano/Gemma 3n)

By ilc 2025-06-1219:341 reply

How do you run Gemma 3n locally?

By n0mer 2025-06-1221:23

https://github.com/google-ai-edge/gallery/releases/tag/1.0.3

By ipsum2 2025-06-1218:492 reply

Cloudflare is down too. From https://www.cloudflarestatus.com:

Update - We are seeing a number of services suffer intermittent failures. We are continuing to investigate this and we will update this list as we assess the impact on a per-service level.

Impacted services: Access WARP Durable Objects (SQLite backed Durable Objects only) Workers KV Realtime Workers AI Stream Parts of the Cloudflare dashboard Jun 12, 2025 - 18:48 UTC

Edit: https://news.ycombinator.com/item?id=44261064

By 0xy 2025-06-1221:111 reply

Seems like a major wtf if Cloudflare is using GCP as a key dependency.

By a2128 2025-06-1222:443 reply

Some day Cloudflare will depend on GCP and GCP will depend on Cloudflare and AWS will rely on one of the two being online and Cloudflare will also depend on AWS and the internet will go down and no one will know how to restart it

By IX-103 2025-06-1223:252 reply

Supposedly something like this already happened inside Google. There's a distributed data store for small configs read frequently. There's another for larger configs that are rarely read. The small data store depends on a service that depends on the large data store. The large data store depends on the small data store.

Supposedly there are plans for how to conduct a "cold" start of the system, but as far as I know it's never actually been tried.

By turbobrew 2025-06-134:001 reply

The trick there is you take the relevant configs and serialize them to disk periodically, and then in a bootstrap scenario you use the configs on disk.

Presumably for the infrequently read configs you could do this so the service with frequently read configs can bootstrap without the service for infrequently read configs.

By syllogism 2025-06-138:331 reply

Like a backup generator for inputs. Makes sense.

By turbobrew 2025-06-140:07

Yes, this is how I have set up systems to bootstrap.

For example a service discovery system periodically serializes peers to disk, and then if the whole thing falls down we have static IP addresses for a node and the service discovery system can use the last known IPs of peers to bring itself back up.

By solardev 2025-06-139:54

Just put them in Workers KV... oh wait

By lysace 2025-06-1223:11

That's what IRC is for.

(Its Finnish inventor is incidentally working for Google in Stockholm, as per https://en.wikipedia.org/wiki/Jarkko_Oikarinen)

By Analemma_ 2025-06-1223:55

Don’t worry, we’ll just ask Chat-GPT.

By paulddraper 2025-06-1218:571 reply

Broken link? EDIT: Weird, definitely was just empty

By ipsum2 2025-06-1218:59

Should work, but its also on the front page.