Rare things become common at scale (2014)

2024-05-2112:5510547longform.asmartbear.com

Software doesn't scale through architecture and automation alone. New, more difficult problems appear that didn't exist before, causing new downstream consequences.

Software doesn’t scale through architecture and automation alone. New, more difficult problems appear that didn’t exist before, causing new downstream consequences.

Cartoon af2abcc6 Cartoon af2abcc6 Cartoon af2abcc6

Something interesting happens when you run more than 1,000 servers, as we do at WP Engine, powering hundreds of thousands of websites.

Suppose that on average a server experiences one fatal failure every three years. The kernel panics (the Linux equivalent of the Blue Screen of Death), or both the main and redundant power supply fails, or some other rare event that causes outage. This isn’t a quality issue—this is normal. This isn’t something to “fix.”

Windows NT crashed.
I am the Blue Screen of Death.
No one hears your screams.

—Haiku from FSF

But remember, we have 1,000 servers. Three years is about 1,000 days. So that means, on average, every single day we have a fatal server error.

Not to mention 10 minor incidents with degraded performance, or a DDoS attack somewhere in the data center affecting our network traffic, or some other thing that sets pagers a-buzzing in our DevOps team and mobilizes our Customer Support team to notify and help customers.

“Well sure,” you say, “that’s normal as you grow. If you had just 10 servers and 100 customers, you’d have fewer problems and many fewer employees. Today you have more customers, more servers, and more employees. What’s so hard about that?”

The insight is that that scale causes rare events to become common. Things happen with 2000 servers that you never saw even once with 50 servers, and things which used to happen once in a blue moon, where a shrug and a manual reboot every six months was in fact an appropriate “process,” now happen every week, or even every day.

Things as rare as, well, you know…

Cartoon cc5fa8c6 Cartoon cc5fa8c6 Cartoon cc5fa8c6

It’s not only problems that morph with scale, but your ability to handle problems.

For example, a dozen minor and major events every day means 20-50 customers affected every day. Now consider what happens as we try to inform 50 customers. For some we won’t have current email addresses, so they don’t get notified. Some of those will notice the problem and create extra customer support load; at worst they’ll post on Twitter about how their website was slow or offline today and WP Engine “didn’t even know it.” Then our social media team has to piece all this together, attempt to respond, maybe put together a special phone call with that customer, and so on. Those customers are also more likely to leave a bad review on some review site, compared with the 99.99% of customers who experience no such incident, but also had no reason to decide that “today is the day I will go to a review site and leave a good review.”

Or consider the scale-ramifications of on-boarding 1,000 new customers a month. In that case, it’s likely that any given server issue will affect a customer who has only been with us for a month or two. Thus the issue causes a “bad first impression,” which is harder to address than a customer who has been with us for three years and has built up a bank account of patience.

So, rare things being common isn’t just difficult from the operational side, but also when you try to handle those problems with customers or other downstream consequences, causing much more work to solve than when the company was small.

The usual response to this is “automate everything.”

As with most knee-jerk responses, there’s truth in it, but it’s not the whole story.

Sure, without automated monitoring we’d be blind, and without automated problem-solving we’d be overwhelmed. So yes, “automate everything.”

But some things you can’t automate. You can’t “automate” a knowledgable, friendly customer support team. You can’t “automate” responding to a complaint on social media. You can’t “automate” the recruiting, training, rapport, culture, and downright caring of teams of human beings who are awake 24/7/365, with skills ranging from multi-tasking on support chat to communicating clearly and professionally over the phone to logging into servers and identifying and fixing issues as fast as (humanly?) possible.

And you can’t “automate” away the rare things, even the technical ones. By their nature they’re difficult to define, hence difficult to monitor, and difficult to repair without the forensic skills of a human engineer.

Does this mean all our customers have a worse experience? No, just the opposite. Any one customer of ours has fewer problems per year now than a year ago, because we’re constantly improving our processes, automation, hardware, and human service. It’s when you look across the entire company, and the non-linear additional effort it takes to not just improve the average experience, but to manage the worst-case experience, that you appreciate the difficulties.

This explains the common effect where people complain about a company every day on Twitter, yet you yourself have never had an issue with them. The paradox is solved by realizing that “rare things” means you probably never experienced it, but at scale, someone is experiencing it each day.

Does that give high-scale companies like WP Engine an excuse to have problems? No way! In fact, if we’re not constantly improving on all fronts, the scale will catch up and overtake us.

But for those of you in the earlier stages of your companies, when you project 5x growth against 5x costs (or only 3x the costs because you’ll get cost-savings at scale), you’re guessing low. When you show 5x growth in projections but don’t budget for new hires in areas like security, technical automation, specialized customer service areas, and managers and executives who have trod this path before and come battle-hardened with play-books on how to tackle all this, you’re heading for an ugly surprise.

And with high growth, the surprise appears quickly, and recovery means acting twice as fast again to claw out of the hole and then finally get ahead of it.

Scaling is hard!


Read the original article

Comments

  • By AnotherGoodName 2024-05-2115:214 reply

    This is also largely the answer to why your weekend garage project doesn't need 1000+ developers that an actual in production project needs. At scale every edge case is a bug that someone has to deal with.

    Not just the hard crash software/hardware edge cases either. Regulatory edge cases, abuse vector edge cases, localization and accessibility edge cases will all need to be dealt with.

    I remember people exclaiming 'musk is right, Twitter doesn't need 1000s of developers' only for Twitter to start failing a lot of regulatory and abuse vector edge cases shortly after their layoffs. It's no good judging business needs based on a basic garage implementation of Twitter. The real world and a few billion users makes for an entirely different set of problems.

    • By nsguy 2024-05-2118:23

      Your garage project also doesn't need to support 100's of random features some product manager thought are critical but nobody uses. Quality issues are exponential to the complexity of the software. Out of those 1000's of engineers there will also lot of variability in terms of productivity and quality. Some engineers are likely doing most of the heavy lifting. Some spend all their time in meetings. Some engineers might be working on their novel or startup while sitting in the office. Extremely unlikely they're all operating at the same level. With 1000's of engineers there's a lot of effort synchronizing those engineers, communication overhead, people stepping over each other's work etc. The mythical tower of babel: https://en.wikipedia.org/wiki/Tower_of_Babel

      Lots of reliable/production software was built by teams much smaller than 1000's. Things like operating systems (e.g. the Linux kernel), compilers, tools or libraries, come to mind. Some even by single people.

      That's not to say that all weekend garage projects are better than what 1000+ developer teams produce but don't knock the ability of small teams to do amazing things. Twitter isn't exactly the pinnacle of engineering accomplishment either. I'm pretty sure Twitter suffered from bloat similar to many other large tech companies. Elon taking a sledgehammer to that is probably not the best approach though. That said some people were saying the whole thing will fall apart in days, and it didn't.

      [EDITed for typo]

    • By compiler-guy 2024-05-2116:111 reply

      Such scale also creates opportunities.

      If you have 10 servers, a developer can't spend a couple of months to get a 0.1% performance boost--the benefit will never cover the cost. If you have 100,000 servers, it just might. If you have 1,000,000 servers, it almost certainly is worth an entire team looking for that much performance, year in and year out.

      This is true of performance, avoiding downtime, or whatever else. You could never justify such an expense at a startup, but with sufficient scale they pay for themselves many times over.

      • By closeparen 2024-05-2117:121 reply

        A large company's million servers are going to be partitioned across thousands of separate applications. The biggest ones do benefit from performance work, but a large portion of the server count is going to be a long tail of services that aren't individually worth that much effort.

        • By vlovich123 2024-05-2117:57

          They would rarely be partitioned because of all the waste - workloads are typically placed cotenant into 1 machine (VMs, containers, normal multi-processing, etc). But I believe you’re right that cumulatively there would be a lot of long tail services that consume resources and the cumulative inefficiencies aren’t being optimized because it’s work that doesn’t scale (i.e. you’d have to fix too many performance issues on services no one cares to make a visible overall dent). Not sure why you’re downvoted though - it’s an astute observation.

    • By pessimizer 2024-05-2116:491 reply

      > I remember people exclaiming 'musk is right, Twitter doesn't need 1000s of developers' only for Twitter to start failing a lot of regulatory and abuse vector edge cases shortly after their layoffs. It's no good judging business needs based on a basic garage implementation of Twitter. The real world and a few billion users makes for an entirely different set of problems.

      It's strange this is the lesson you took from Twitter. Twitter fired 90% of their developers, and almost nothing changed. Twitter is a site that exists in the real world, at scale. People don't even talk about the fact that Twitter fired 90% of their devs anymore, they just began firing devs themselves.

      • By MrDarcy 2024-05-2117:143 reply

        Elon Musk fired Twitter employees. Have you not noticed how the site has deteriorated technically? I personally know people who have suffered abuse on Twitter, reported the abuse, and had no action taken because Elon fired the trust and safety org.

        • By 4hg4ufxhy 2024-05-2117:341 reply

          Do you also know personally people who reported it and had action taken prior?

        • By thfuran 2024-05-2117:50

          Even if it did, that's not really evidence that it'd take nearly as many people to run a Twitter clone at Twitter scale. It may well, but suddenly losing 90% of personnel is some serious bus factor woes.

        • By canoebuilder 2024-05-2117:542 reply

          If reading something is hurting your feelings, you can stop reading it.

          Twitter even provides mute, block, and whatnot functionality to prevent specified things from even showing up in your line of sight to begin with. And if the app is really bothering you, you can always set it down and go outside, take a walk, meet somebody new, do something that will put a smile on your face on your deathbed.

          Lumping in mean comments online, with actual abuse is approaching risible. Words have meanings, we shouldn’t dilute or distort them.

          By Twitter “not taking action,” sounds like your friend is upset that he or she can no longer co-opt the proprietors of the site into enacting punitive measures on people who draw his or her ire.

          Maybe some mean things were said or whatever, but at the end of the day it’s just text on a screen isn’t it? And there’s a lot more to life than text on a screen, isn’t there?

          It’s also weird how you mention the technical functioning of the site, then bring up the “Trust & Safety Org” when the legacy of “Trust & Safety” is a small cabal with extremist views arbitrarily deciding what information to censor and suppress based on their own viewpoints, whims, and influence from government agencies.

          That has nothing to do with the technical functioning of the site which is a matter of reproducible, specifiable, determinate functions implemented in computer code to produce a useful product. The kind of thing that really turns the mind of an autist on.

          P.S. Not to be too blasé about your friend, mean words can be an issue, especially an ongoing pattern, but anonymous strangers online seems like less of an issue than irl, and was this really an issue where block or mute wasn’t sufficient? How so?

          • By petsfed 2024-05-2118:38

            Man, if you can't see the difference between "so-and-so called me a mean name" and "1000 strangers all knocked on my door just to tell me, in excruciating detail, how they wish my children were raped and murdered", I don't know what to tell you.

            X's systems for block and mute require the abuse to occur before you have an avenue to respond. Considering that all you need to get an X account is an email account, it's a pretty low bar for brigading. And that's to say nothing about organized campaigns to falsely report an account for abuse.

            For individuals, I suppose you can make some kind of argument that those tools are sufficient, but if you're the poor social media manager for some township or minor government agency that draws the ire of the internet hate machine, you have to deal with all the abuse that goes with it. You are barred by the constitution from blocking people (and rightly so), and you have no real power to prevent them from creating sock puppet accounts to continue the abuse. PTSD is pretty common amongst (former, since they fired them all) twitter content moderators, because being consistently exposed to that stuff can eventually be pretty traumatizing.

          • By Atotalnoob 2024-05-2219:34

            “Mean words” are a small part of what trust and safety does.

            CSAM, beheadings, videos of the worst things imaginable are what trust and safety deal with on a daily basis.

    • By brabel 2024-05-2116:192 reply

      Do you have more information about the Twitter abuse vector edge cases?

      And how would developers mitigate that? Isn't that the kind of thing you need human content reviewers to handle (assuming any automated tool Twitter had was still there after the layoffs)?

      • By AnotherGoodName 2024-05-2116:31

        The development work comes in stopping repeated abuse. Eg. A block of up addresses known to be bad should not be able to repeatedly sign up for new accounts after a ban. You may need to add catchas for users which have certain criteria met. Call pumping needs to be constantly re-scripted to stop bad phone companies tricking your systems into sending SMS to expensive numbers (Twitter turned off all SMS two factor auth to stop this which is telling of the gap they left here).

      • By closeparen 2024-05-2117:18

        Rules engines and ML models to escalate interesting Tweets, accounts, and networks of related accounts for human review. Data pipelines to feed them. Change management and observability so that analysts can safely keep them up to speed with threats. Case management systems to put the findings into. Reviews, approvals, deviations, escalations, M of N reviewer consensus. Each one of these things is at least as complex as Twitter's core product.

  • By petsfed 2024-05-2116:532 reply

    I feel like this is a huge digression, but this is basically the perfect counter-argument to "its never been illegal to write down license plate numbers/stake out people's houses/write down the addresses of mail delivered to a home/etc, why should we make new rules now that AI is doing it?"

    Obviously manually interceding whenever you have a server failure is unsustainable when you have 1000 (or more servers). Why do people somehow believe that manually interceding to prevent bad action with surveillance is somehow sustainable, when previously the rate limiter was that the surveillance itself was manual?

    • By janalsncm 2024-05-2117:43

      Definitely. There are emergent properties of technology when applied at scale that make something that was just weird and annoying before into something truly dangerous.

      There are a lot of things that are unintuitive at larger scales. AI and intellectual property is another.

      And I have sympathy for people who said that algorithmically amplified speech on large platforms is qualitatively different from regular town square speech. It is. I don’t know how the laws should change in response but to pretend that Twitter is just like a bar is pretty naive imo.

    • By Gunax 2024-05-2121:511 reply

      I appreciate your comment because I think I am usually on the other side.

      • By petsfed 2024-05-2122:22

        Considering that I'm usually on the side of "its already illegal, making it more illegal won't help", I've struggled for a long time to express what exactly was the problem with e.g. automated surveillance, because I couldn't express it in a way that was convincing even to myself.

  • By d1sxeyes 2024-05-2116:422 reply

    While the sentiment is lovely, the conclusions ignore business reality. I did some back-of-an-envelope calculations.

    A CS rep who “truly cares” is gonna set you back around 50K[0] in salary, call it 75K total cost to employ. I don’t know what their average customer value is, but seems they start doing phone support at 40USD, and 24x7 chat support for all customers. Let’s be generous and assume 50USD/customer average.

    That means it takes at least 125 average customers to pay a CS rep’s salary.

    Now bear in mind it’s 24x7, so you need at least 8 CS reps. That means you need to be retaining 1000 customers per year just to break even on your CS team. That’s around 20 a week.

    The best customer support is the one your users never have to contact. If you can automate fixes (bonus points for automated pro-rated credits on bills based on downtime impact), or improve reliability, the customers with the highest value to you will stick with you.

    That’s not to say “give a shitty support experience because it doesn’t matter”, it’s just that it’s a solution to a different problem than the one presented here.

    • By petsfed 2024-05-2117:042 reply

      I think your math is not quite right here.

      You can probably get away with 1-2 good CS reps, provided the lower tiers have the right tools to triage things effectively. Put another way, if you have 1 CS rep that really cares, that does the followup calls, then you can employ 7-8 front-line CS reps at half the cost who just take diligent notes, and handle the common errors that customers make. I sincerely doubt that the person I have to call to help me connect my cable modem to my ISP makes even 40K a year. But all they have to do is follow a script, write down my answers to their questions, and escalate if the script calls for it.

      I agree, completely, however, that ideally, every real issue happens precisely once and results in a change in the automated failure response.

      • By d1sxeyes 2024-05-2117:441 reply

        > I sincerely doubt that the person I have to call to help me connect my cable modem to my ISP makes even 40K a year.

        That’s sort of my point… do you feel as though that rep really cares? Is that an outstanding customer support experience for you? Many folks would say ISPs and telcos have the worst customer service.

        You’re not wrong though that a good script can help a lot of users with simple issues, and as it turns out, that’s the kind of thing that’s actually pretty easy and effective to automate.

        • By petsfed 2024-05-2118:20

          It was actually one of the best Telco CS experiences I've ever had, and I think its because the gist of the call was "here are the numbers from my modem, so that you can authorize it to talk to your network, do you see it? Yes? Ok, we're done now".

          The main reason bad customer service happens is because the customer has a problem that the CS rep is not empowered to fix or escalate the problem. Its not a question of "caring", its a question of results. Its really hard to care about your job when faced with unrealistic expectations from the customer, and insufficient resources from the employer. A lot of folks with customer service departments could save a bundle on labor and earn a lot of goodwill if they would recognize that.

      • By deathanatos 2024-05-2119:202 reply

        > then you can employ 7-8 front-line CS reps at half the cost who just take diligent notes

        … the actual reality is that companies employ 7-8 front-line CS reps that can't even read the ticket, and end up asking questions already asked & answered by the form I was required to fill out during the ticket!

        I'm not buying upthreads' math either, though. Either the support plan is paid, or not. In the case of paid support plans. AFAICT given the support vs. the pricing for plans I've been a part of, a customer is just profitable. I don't get a full time, 40h/wk support agent, I only get them for the duration of time they're attending my tickets, and that proportional yearly salary cost is < the support plan's price in the cases I've seen; support plans are just ludicrously expensive, but so many companies feel they're obligated to have them, whether for DR, compliance, or whatever, that they don't question it. There are some companies that just do support as part of the included purchase (this is the right way, IMO), in which case, yeah, it takes a few customers. But in this case, how many customers/support rep is dependent on how much you can or are spending on operating costs.

        > I agree, completely, however, that ideally, every real issue happens precisely once and results in a change in the automated failure response.

        Today's customer support zeitgeist is utterly opposed to this idea. I agree that it's the correct response, but support's goal — that is, the metric which they are apparently measured by, in nearly every case I've seen is, is to close the ticket as fast as possible. This means not waiting around for permanent fixes: if a problem can be kludged around & the ticket closed faster, that wins. But then the extra due diligence of leaning on the engineering side becomes optional, extra … and just doesn't happen.

        I've literally had tickets closed because "there hasn't been a response on this ticket for some time" — with the ticket plainly in the provider's court — and "we don't want to waste time your time with long running tickets" — nor a fix to my problem, clearly.

        Customer Support is Goodhart's Law.

        The latest change is that now I have to fight an LLM that cannot address my problem to get to that front-line rep who won't read my problem. Yay, progress! /s

        • By petsfed 2024-05-2119:50

          >Today's customer support zeitgeist is utterly opposed to this idea. I agree that it's the correct response, but support's goal — that is, the metric which they are apparently measured by, in nearly every case I've seen is, is to close the ticket as fast as possible.

          First, I can't believe its taken me this long to start italicizing when I quote somebody. I'm going to do that for all future quotations.

          Second, I can't easily imagine a better illustration of Goodhart's law than this. I can certainly see how they got to "average time to close", even though obviously the important metric is "how many customers eventually reached a satisfactory conclusion". Its just that its hard to answer that without bugging a bunch of people, reminding them of that time when the product shit the bed and they had to call support.

          And "time to close" is not a bad metric, but it can't be the only metric. This reminds me of something that seems like a corollary of Goodhart's law: the easier it is to measure something, the less likely it is to be useful as a metric.

        • By d1sxeyes 2024-05-227:50

          Support is not paid exactly if I understand correctly. According to the pricing, even the lowest tier includes '24/7 WordPress technical expertise', which includes chat support. The second tier includes phone support.

          I don't even see any upsell to 'premium support' or anything like that for retail users. However, probably enterprise customers do indeed get separate pricing for support.

          All told though, I'm not really saying that my maths was supposed to be on the money, just that the costs of running a CS team like the article suggests are high, and having an expensive CS team is only justified if they drive enough revenue, either through attracting new customers or retaining existing ones.

          And to be clear, having outstanding CS is indeed a differentiator. But it's not a differentiator compared to reliability/automation, it's a differentiator compared to other companies' CS teams.

    • By eszed 2024-05-2119:38

      I agree with the sibling posts about escalation and so forth, but your numbers also don't capture customer recruitment. I'm fortunate enough to live within Sonic's catchment area in the Bay Area, and have just moved the company for which I'm responsible over to 100% Sonic. My advocacy for them is 85% based on their superlative tech support, and our new business account will pay for half of one of those "truly cares" engineers. I'm damn happy to do it. Incidentally, I think I've persuaded three or four neighbors to switch to (residential) Sonic, and one professional contact to set up a business account for his company.

      I think good customer support/relations has to be committed to for reasons that don't immediately show up on spreadsheets, with the trust that they will pay long-term dividends. I realize that is antithetical to The Way Things Work Now, but in both my personal and professional lives I try only to do business with companies that agree.

HackerNews