Comments

  • By aeldidi 2025-11-1822:333 reply

    I'm becoming concerned with the rate at which major software systems seem to be failing as of late. For context, last year I only logged four outages that actually disrupted my work; this quarter alone I'm already on my fourth, all within the past few weeks. This is, of course, just an anecdote and not evidence of any wider trend (not to mention that I might not have even logged everything last year), but it was enough to nudge me into writing this today (helped by the fact that I suddenly had some downtime). Keep in mind, this isn't necessarily specific to this outage, just something that's been on my mind enough to warrant writing about it.

    It feels like resiliency is becoming a bit of a lost art in networked software. I've spent a good chunk of this year chasing down intermittent failures at work, and I really underestimated how much work goes into shrinking the "blast radius", so to speak, of any bug or outage. Even though we mostly run a monolith, we still depend on a bunch of external pieces like daemons, databases, Redis, S3, monitoring, and third-party integrations, and we generally assume that these things are present and working in most places, which wasn't always the case. My response was to better document the failure conditions, and once I did, realize that there was many more than we initially thought. Since then we've done things like: move some things to a VPS instead of cloud services, automate deployment more than we already had, greatly improve the test suite and docs to include these newly considered failure conditions, and generally cut down on moving parts. It was a ton of effort, but the payoff has finally shown up: our records show fewer surprises which means fewer distractions and a much calmer system overall. Without that unglamorous work, things would've only grown more fragile as complexity crept in. And I worry that, more broadly, we're slowly un-learning how to build systems that stay up even when the inevitable bug or failure shows up.

    For completeness, here are the outages that prompted this: the AWS us-east-1 outage in October (took down the Lightspeed R series API), the Azure Front Door outage (prevented Playwright from downloading browsers for tests), today’s Cloudflare outage (took down Lightspeed’s website, which some of our clients rely on), and the Github outage affecting basically everyone who uses it as their git host.

    • By HardwareLust 2025-11-1822:376 reply

      It's money, of course. No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.

      • By stinkbeetle 2025-11-1823:545 reply

        > It's money, of course.

        100%

        > No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.

        Well, fly by night outfits will do that. Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.

        Look at a big bank or a big corporation's accounting systems, they'll pay millions just for the hot standby mainframes or minicomputers that, for most of them, would never be required.

        • By solid_fuel 2025-11-192:48

          > Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.

          Used to, but it feels like there is no corporate responsibility in this country anymore. These monopolies have gotten so large that they don't feel any impact from these issues. Microsoft is huge and doesn't really have large competitors. Google and Apple aren't really competing in the source code hosting space in the same way GitHub is.

        • By collingreen 2025-11-194:09

          > Take the number of vehicles in the field, A, multiply it by the probable rate of failure, B, then multiply it by the result of the average out of court settlement, C. A times B times C equals X. If X is less than the cost of a recall, we don't do one.

          https://youtu.be/SiB8GVMNJkE

        • By closeparen 2025-11-1918:33

          Modern internet company backends are very complex, even on a good day they're at the outer limits of their designers' and operators' understanding, & every day they're growing and changing (because of all the money and effort that's being spent on them!). It's often a short leap to a state that nobody thought of as a possibility or fully grasped the consequences of. It's not clear that it would be practical with any amount of money to test or rule out every such state in advance. Some exciting techniques are being developed in that area (Antithesis, formal verification, etc) but that stuff isn't standard of care for a working SWE yet. Unit tests and design reviews only get you so far.

        • By csomar 2025-11-1910:34

          > Look at a big bank or a big corporation's accounting systems

          Not my experience. Any banking I used, in multiple countries, had multiple and significant outages and some of them where their cards have failed to function. Do a search of "U.S. Bank outage" to see how many outages have happened so far this year.

        • By Jenk 2025-11-190:061 reply

          I've worked at many big banks and corporations. They are all held together with the proverbial sticky tape, bubblegum, and hope.

          They do have multiple layers of redundancies, and thus have the big budgets, but they won't be kept hot, or there will be some critical flaws that all of the engineers know about but they haven't been given permission/funding to fix, and are so badly managed by the firm, they dgaf either and secretly want the thing to burn.

          There will be sustained periods of downtime if their primary system blips.

          They will all still be dependent on some hyper-critical system that nobody really knows how it works, the last change was introduced in 1988 and it (probably) requires a terminal emulator to operate.

          • By stinkbeetle 2025-11-190:28

            I've worked on software used by these and have been called in to help support from time to time. One customer which is a top single digit public company by market cap (they may have been #1 at the time, a few years ago) had their SAP systems go down once every few days. This wasn't causing a real monetary problem for them because their hot standby took over.

            They weren't using mainframes, just "big iron" servers, but each one would have been north of $5 million for the box alone, I guess on a 5ish year replacement schedule. Then there's all the networking, storage, licensing, support, and internal administration costs for it which would easily cost that much again.

            Now people will say SAP systems are made entirely of dict tape and bubblegum. But it all worked. This system ran all their sales/purchasing sites and portals and was doing a million dollars every couple of minutes so that all paid for itself many times over during the course of that bug. Cold standby would not have cut it. Especially since these big systems take many minutes to boot and HANA takes even longer to load from storage.

      • By lopatin 2025-11-1823:031 reply

        I agree that it's all money.

        That's why it's always DNS right?

        > No one wants to pay for resilience/redundancy

        These companies do take it seriously, on the software side, but when it comes to configurations, what are you going to do:

        Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk. It looks like even the most critical and prestigious companies in the world are doing the former.

        • By macintux 2025-11-190:521 reply

          > Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk.

          There's also the problem that doubling your cloud footprint to reduce the risk of a single point of failure introduces new risks: more configuration to break, new modes of failure when both infrastructures are accidentally live and processing traffic, etc.

          Back when companies typically ran their own datacenters (or otherwise heavily relied on physical devices), I was very skeptical about redundant switches, fearing the redundant hardware would cause more problems than it solved.

          • By paulddraper 2025-11-195:53

            Complexity breeds bugs.

            Which is why the “art” of engineering is reducing complexity while retaining functionality.

      • By 1718627440 2025-11-1919:37

        I'm not sure, it's only money. People could have a lot of simpler cheaper software, by relying on core (OS) features instead of rolling there own, or relying on bloated third-parties, but a lot don't due to cargo culting.

      • By Wowfunhappy 2025-11-1916:44

        …can I make the case that this might be reasonable? If you’re not running a hospital†, how much is too much to avoid a few hours of downtime around once a year?

        † Hopefully there aren’t any hospitals that depends on GitHub being continuously available?

      • By raxxorraxor 2025-11-1914:39

        And tech hype. Infrastructure to mitigate here isn't expensive. In many cases quite the opposite. The expensive thing is that you made yourself dependent on these services. Sometimes this is inevitable, but to host on GitHub is a choice.

      • By ForHackernews 2025-11-1823:051 reply

        Why should they? Honestly most of what we do simply does not matter that much. 99.9% uptime is fine in 99.999% of cases.

        • By porridgeraisin 2025-11-190:55

          This is true. But unfortunately the exact same process is used even for critical stuff (the crowdstrike thing for example). Maybe there needs to be a separate swe process for those things as well, just like there is for aviation. This means not using the same dev tooling, which is a lot of effort.

    • By roxolotl 2025-11-191:15

      To agree with the comments it seems likely it's money which has begun to result in a slow "un-learning how to build systems that stay up even when the inevitable bug or failure shows up."

    • By suddenlybananas 2025-11-1822:433 reply

      To be deliberately provocative, LLMs are being more and more widely used.

      • By zdragnar 2025-11-191:012 reply

        Word on the street is github was already a giant mess before the rise of LLMs, and it has not improved with the move to MS.

        • By dsagent 2025-11-191:24

          They are also in the process of moving most of the infra from on-prem to Azure. I'm sure will see more issues over the next couple months.

          https://thenewstack.io/github-will-prioritize-migrating-to-a...

        • By array_key_first 2025-11-1914:01

          I don't know anything about githubs codebase, but as a user, their software has many obvious deficiencies. The most glaring being performance. Oh my God, github performs like absolute shit on large repos and big diffs.

          Performance issues always scare me. A lot of the time it's indicative of fragile systems. Like with a lot of banking software - the performance is often bad because the software relies on 10 APIs to perform simple tasks.

          I doubt this is the case with GitHub, but it still makes you wonder about their code and processes. Especially when it's been a problem for many years, with virtually no improvement.

      • By Tadpole9181 2025-11-195:47

        To be deliberately provocative, so is offshoring work.

      • By blibble 2025-11-1822:59

        imagine what it'll be like in 10 years time

        Microsoft: the film Idiocracy was not supposed to be a manual

  • By mandus 2025-11-1821:069 reply

    Good thing git was designed as a decentralized revision control system, so you don’t really need GitHub. It’s just a nice convenience

    • By jimbokun 2025-11-1821:272 reply

      As long as you didn't go all in on GitHub Actions. Like my company has.

      • By IshKebab 2025-11-1821:4310 reply

        Do you think you'd get better uptime with your own solution? I doubt it. It would just be at a different time.

        • By wavemode 2025-11-1821:552 reply

          Uptime is much, much easier at low scale than at high scale.

          The reason for buying centralized cloud solutions is not uptime, it's to safe the headache of developing and maintaining the thing.

          • By manquer 2025-11-192:15

            It is easier until things go down.

            Meaning the cloud may go down more frequently than small scale self deployments , however downtimes are always on average much shorter on cloud. A lot of money is at stake for clouds providers, so GitHub et al have the resources to put to fix a problem compared to you or me when self hosting.

            On the other hand when things go down self hosted, it is far more difficult or expensive to have on call engineers who can actual restore services quickly .

            The skill to understand and fix a problem is limited so it takes longer for semi skilled talent to do so, while the failure modes are simpler but not simple.

            The skill difference between setting up something locally that works and something works reliably is vastly different. The talent with the latter are scarce to find or retain .

          • By tyre 2025-11-1822:261 reply

            My reason for centralized cloud solutions is also uptime.

            Multi-AZ RDS is 100% higher availability than me managing something.

            • By wavemode 2025-11-1822:401 reply

              Well, just a few weeks ago we weren't able to connect to RDS for several hours. That's way more downtime than we ever had at the company I worked for 10 years ago, where the DB was just running on a computer in the basement.

              Anecdotal, but ¯\_(ツ)_/¯

              • By sshine 2025-11-190:02

                An anecdote that repeats.

                Most software doesn’t need to be distributed. But it’s the growth paradigm where we build everything on principles that can scale to world-wide low-latency accessibility.

                A UNIX pipe gets replaced with a $1200/mo. maximum IOPS RDS channel, bandwidth not included in price. Vendor lock-in guaranteed.

        • By jakewins 2025-11-1821:552 reply

          “Your own solution” should be that CI isn’t doing anything you can’t do on developer machines. CI is a convenience that runs your Make or Bazel or Just or whatever you prefer builds, that your production systems work fine without.

          I’ve seen that work first hand to keep critical stuff deployable through several CI outages, and also has the upside of making it trivial to debug “CI issues”, since it’s trivial to run the same target locally

          • By IshKebab 2025-11-198:291 reply

            > should be that CI isn’t doing anything you can’t do on developer machines

            You should aim for this but there are some things that CI can do that you can't do on your own machine, for example running jobs on multiple operating systems/architectures. You also need to use CI to block PRs from merging until it passes, and for merge queues/trains to prevent races.

          • By CGamesPlay 2025-11-191:171 reply

            Yes, this, but it’s a little more nuanced because of secrets. Giving every employee access to the production deploy key isn’t exactly great OpSec.

            • By 1718627440 2025-11-1914:10

              Every Linux desktop system has a keychain implementation. You can of course always use your own system, if you don't like that. You can use different keys and your developers don't need access to the real key, until all the CI servers are down.

        • By deathanatos 2025-11-1823:26

          Yes. I've quite literally run a self-hosted CI/CD solution, and yes, in terms of total availability, I believe we outperformed GHA when we did so.

          We moved to GHA b/c nobody ever got fired ^W^W^W^W leadership thought eng running CI was not a good use of eng time. (Without much question into how much time was actually spent on it… which was pretty close to none. Self-hosted stuff has high initial cost for the setup … and then just kinda runs.)

          Ironically, one of our self-hosted CI outages was caused by Azure — we have to get VMs from somewhere, and Azure … simply ran out. We had to swap to a different AZ to merely get compute.

          The big upside to a self-hosted solution is that when stuff breaks, you can hold someone over the fire. (Above, that would be me, unfortunately.) With Github? Nobody really cares unless it is so big, and so severe, that they're more or less forced to, and even then, the response is usually lackluster.

        • By tcoff91 2025-11-1821:461 reply

          Compared to 2025 github yeah I do think most self-hosted CI systems would be more available. Github goes down weekly lately.

          • By Aperocky 2025-11-1822:061 reply

            Aren't they halting all work to migrate to azure? Does not sound like an easy thing to do and feels quite easy to cause unexpected problems.

            • By macintux 2025-11-190:531 reply

              I recall the Hotmail acquisition and the failed attempts to migrate the service to Windows servers.

        • By Borg3 2025-11-199:08

          10:08:19 up 2218 days, 22:11, 4 users, load average: 0.00, 0.00, 0.00

          It just workz [;

        • By nightski 2025-11-1821:50

          I mean yes. We've hosted internal apps that have four nines reliability for over a decade without much trouble. It depends on your scale of course, but for a small team it's pretty easy. I'd argue it is easier than it has ever been because now you have open source software that is containerized and trivial to spin up/maintain.

          The downtime we do have each year is typically also on our terms, not in the middle of a work day or at a critical moment.

        • By prescriptivist 2025-11-1822:30

          It's fairly straightforward to build resilient, affordable and scalable pipelines with DAG orchestrators like tekton running in kubernetes. Tekton in particular has the benefit of being low level enough that it can just be plugged into the CI tool above it (jenkins, argo, github actions, whatever) and is relatively portable.

        • By davidsainez 2025-11-1821:47

          Doesn’t have to be an in house system, just basic redundancy is fine. eg a simple hook that pushes to both GitHub and gitlab

        • By 1718627440 2025-11-1914:06

          With a build system that can run on any Linux machine, and is only invoked by the CI configuration? Even if all your servers go down, you just run it on any developers machine.

        • By Zambyte 2025-11-2014:02

          Reproducible builds have a pretty good track record for uptime :-)

      • By esafak 2025-11-1821:35

        Then your CI host is your weak point. How many companies have multi-cloud or multi-region CI?

    • By __MatrixMan__ 2025-11-1821:462 reply

      This escalator is temporarily stairs, sorry for the convenience.

      • By Akronymus 2025-11-1822:041 reply

        Tbh, I personally don't trust a stopped escalator. Some of the videos of brake failures on them scared me off of ever going on them.

        • By collingreen 2025-11-1822:122 reply

          You've ruined something for me. My adult side is grateful but the rest of me is throwing a tantrum right now. I hope you're happy with what you've done.

          • By rvnx 2025-11-1822:242 reply

            I read a book about elevators accidents; don't.

            • By Akronymus 2025-11-1911:09

              With people properly using them or not?

              I am fairly certain that the vast majority comes from improper use (bypassing security measures, like riding on top of the cabin) or something going wrong during maintenance.

            • By yjftsjthsd-h 2025-11-1822:541 reply

              elevators accidents or escalator accidents?

              • By rvnx 2025-11-1823:141 reply

                elevators. for escalators, make sure not to watch videos of people falling in "the hole".

          • By Akronymus 2025-11-1823:32

            I am genuinly sorry about that. And no, I am not happy about what I've done.

      • By fishpen0 2025-11-1823:022 reply

        Not really comparable at any compliance or security oriented business. You can't just zip the thing up and sftp it over to the server. All the zany supply chain security stuff needs to happen in CI and not be done by a human or we fail our dozens of audits

        • By goku12 2025-11-1912:52

          While true, the mistake we made was to centralize them. Just imagine the case if git was a centralized software with millions of users connecting over a single domain? I don't care how much easier it would be, or how flashy it would be, I prefer much to struggle with the current incarnation rather than deal with headaches like these. Sadly, the progress towards decentralized alternatives for discussions, issue tracking, patch sharing and CI is rather slow (though they all do exist) due to the fact that the no big investor invests in them.

        • By __MatrixMan__ 2025-11-1823:112 reply

          Why is it that we trust those zany processes more than each other again? Seems like a good place to inject vulnerabilities to me...

          • By cyberax 2025-11-196:17

            Hi! My name is Jia Tan. Here's a nice binary that I compiled for you!

          • By goku12 2025-11-1912:331 reply

            This isn't really a trust issue. People tend to take shortcuts and commit serious mistakes in the process. Humans are incredibly creative (no, LLMs are nowhere close). But for that, we need the freedom to make mistakes without serious consequences. Automation exists to take away the fatigue of trying to not commit mistakes.

            • By __MatrixMan__ 2025-11-1913:49

              I'm not against automation at all. But if all of the devs build it and get one hash and CI runs it through some gauntlet involving a bunch of third party software that I don't have any reason to trust and out pops an artifact with a different hash, then the CI has interfered with the chain of trust between myself and my user.

              Maybe I've just been unlucky, but so far my experience with CI pipelines that have extra steps in them for compliance reasons is that they are full of actual security problems (like curl | bash, or like how you can poison a CircleCI cache using a branch nobody reviewed and pick up the poisoned dependency on a branch which was reviewed but didn't contain the poison).

              Plus, it's a high value target with an elevated threat model. Far more likely to be attacked than each separate dev machine. Plus, a motivated user might build the software themselves out of paranoia, but they're unlikely to securely self host all the infra necessary to also run it through CI.

              If we want it to be secure, the automation you're talking about needs to runnable as part of a local build with tightly controlled inputs and deterministic output, otherwise it breaks the chain of trust between user and developer by being a hop in the middle which is more about a pinky promise and less about something you can verify.

    • By keybored 2025-11-1822:16

      I don’t use GitHub that much. I think the thing about “oh no you have centralized on GitHub” point is a bit exaggerated.[1] But generally, thinking beyond just pushing blobs to the Internet, “decentralization” as in software that lets you do everything that is Not Internet Related locally is just a great thing. So I can never understand people who scoff at Git being decentralized just because “um, actually you end up pushing to the same repository”.

      It would be great to also have the continuous build and test and whatever else you “need” to keep the project going as local alternatives as well. Of course.

      [1] Or maybe there is just that much downtime on GitHub now that it can’t be shrugged off

    • By lopatin 2025-11-1821:381 reply

      The issue is that GitHub is down, not that git is down.

      • By drob518 2025-11-192:09

        Aren’t they the same thing? /sarc

    • By ElijahLynn 2025-11-1821:121 reply

      You just lose the "hub" of connecting others and providing a way to collaborate with others with rich discussions.

      • By parliament32 2025-11-1821:214 reply

        All of those sound achievable by email, which, coincidently, is also decentralized.

        • By Aurornis 2025-11-1821:331 reply

          Some of my open source work is done on mailing lists through e-mail

          It's more work and slower. I'm convinced half of the reason they keep it that way is because the barrier to entry is higher and it scares contributors away.

        • By awesome_dude 2025-11-1821:41

          Wait, email is decentralised?

          You mean, assuming everyone in the conversation is using different email providers. (ie. Not the company wide one, and not gmail... I think that covers 90% of all email accounts in the company...)

        • By drykjdryj 2025-11-192:371 reply

          Email at a company is very not decentralized. Most use Microsoft 365, also hosted in azure, i.e. the same cloud as github is trying to host its stuff in.

          • By parliament32 2025-11-2017:33

            365 is not hosted in Azure. Some of the admin portals and workflows are, but the normal-employee-facing applications and APIs have their own datacenters.

    • By paulddraper 2025-11-195:581 reply

      For sure.

      You can commit, branch, tag, merge, etc and be just fine.

      Now, if you want to share that work, you have to push.

      • By PhilippGille 2025-11-197:271 reply

        You can push to any other Git server during a GitHub outage to still share work, trigger a CI job, deploy etc, and later when GitHub is reachable again you push there too.

        Yes you lose some convenience (like GitHub's pull requests UI can't be used, but you can temporarily use the other Git server's UI for that.

        I think their point was that you're not fully locked in to GitHub. You have the repo locally and can mirror it on any Git remote.

        • By paulddraper 2025-11-1913:541 reply

          For sure, you don’t have to use GitHub to be that shared server.

          It is awfully convenient, web interface, per branch permissions and such.

          But you can choose a different server.

          • By 1718627440 2025-11-1914:14

            If your whole network is down, and you also don't want to connect the hosts with an Ethernet cable, you can even just push to an USB stick.

    • By Conscat 2025-11-1821:422 reply

      I'm on HackerNews because I can't do my job right now.

      • By brokenmachine 2025-11-193:16

        I'm on HN because I don't want to do my job right now.

      • By y42 2025-11-1821:46

        I work in the wrong time zone. Good night.

    • By ramon156 2025-11-1821:133 reply

      SSH also down

      • By gertlex 2025-11-1821:16

        My pushing was failing for reasons I hadn't seen before. I then tried my sanity check of `ssh git@github.com` (I think I'm supposed to throw a -t flag there, but never care to), and that worked.

        But yes ssh pushing was down, was my first clue.

        My work laptop had just been rebooted (it froze...) and the CPU was pegged by security software doing a scan (insert :clown: emoji), so I just wandered over to HN and learned of the outage at that point :)

      • By blueflow 2025-11-1821:141 reply

        SSH is as decentralized as git - just push to your own server? No problem.

        • By jimbokun 2025-11-1821:25

          Well sure but you can't get any collaborators commits that were only pushed to GitHub before it went down.

          Well you can with some effort. But there's certainly some inconvenience.

      • By kragen 2025-11-1821:21

        SSH works fine for me. I'm using it right now. Just not to GitHub!

    • By stevage 2025-11-1821:371 reply

      Curious whether you actually think this, or was it sarcasm?

      • By 0x457 2025-11-1821:451 reply

        It was sarcasm, but git itself is Decentralized VCS. Technically speaking, every git checkout is a repo of itself. GitHub doesn't stop me from having the entire repo history up to last pull, and I still can push either to the company backup server or my coworker directly.

        However, since we use github.com fore more than just a git hosting it is SPOF in most cases, and we treat it as a snow day.

        • By stevage 2025-11-1823:14

          Yep, agreed - Issues being down would be a bit of a killer.

  • By grepfru_it 2025-11-191:16

    There was a comment on another GitHub thread that I replied to. I got a response saying it’s absurd how unreliable Gh is when people depend on it for CI/CD. And I think this is the problem. At GitHub the developers think it’s only a problem because their ci/cd is failing. Oh no, we broke GitHub actions, the actions runners team is going to be mad at us! Instead of, oh no, we broke GitHub actions, half the world is down!

    That larger view held only by a small sliver of employees is likely why reliability is not a concern. That leads to the every team for themselves mentality. “It’s not our problem, and we won’t make it our problem so we don’t get dinged at review time” (ok that is Microsoft attitude leaking)

    Then there’s their entrenched status. Real talk, no one is leaving GitHub. So customers will suck it up and live with it while angry employees grumble on an online forum. I saw this same attitude in major companies like Verio and Verisign in the early 2000s. “Yeah we’re down but who else are you going to go to? Have a 20% discount since you complained. We will only be 1% less profitable this quarter due to it” The kang and kodos argument personified.

    These views are my own and not related to my employer or anyone associated with me.

HackerNews