The future of software engineering is SRE

2026-01-2522:18265138swizec.com

When code gets cheap operational excellence wins. Anyone can build a greenfield demo, but it takes engineering to run a service.

When code gets cheap operational excellence wins. Anyone can build a greenfield demo, but it takes engineering to run a service.

You may be wondering: With all the hype about agentic coding, will we even need software engineers anymore? Yes! We'll need more.

Writing code was always the easy part of this job. The hard part was keeping your code running for the long time. Software engineering is programming over time. It's about how systems change.

Lessons from the no-code and spreadsheets era

Let's take no-code and spreadsheets as an example of the kind of software people say is the future – custom-built, throwaway, built by non-experts to solve specific problems.

Joe Schmoe from accounting takes 10 hours to do a thing. He's does this every week and it feels repetitive, mechanical, and boring. Joe could do the work in his sleep.

But he can't get engineering resources to build a tool. The engineers are busy building the product. No worries, Joe is a smart dude. With a little Googling, a few no-code tools, and good old spreadsheet macros he builds a tool.

Amazing.

Joe's tool is a little janky but his 10 hour weekly task now takes 1 hour! 🎉 Sure, he finds a new edge case every every week and there's constant tinkering, but he's having a lot more fun.

ion 2xpng 55j7d8

Time passes, the business changes, accounting rules are in constant flux, and let's never talk about timezones or daylight savings ever again. Joe is sick of this bullshit.

All he wanted was to make his job easier and now he's shackled to this stupid system. He can't go on vacation, he can't train anyone else to run this thing successfully, and it never fucking works right.

Joe can't remember the last time running his code didn't fill him with dread. He spends hours carefully making sure it all worked.

The computer disease

Feynman called this the computer disease.

The problem with computers is that you tinker. Automating things is fun! You might forget you don't need to 😆

The part that's not fun is running things. Providing a service. Reliably, at scale, for years on end. A service that people will hire to do their jobs.

Why operational excellence is the future

People don't buy software, they hire a service.

You don't care how iCloud works, you just want your photos to magically show up across devices every time. You don't care about Word or Notion or gDocs, you just want to write what's on your mind, share it with others, and see their changes. And you definitely don't care how a payments network point of sale terminal and your bank talk to each other, you just want your $7 matcha latte to get you through the week.

Good software is invisible.

And that takes work. A lot of work. Because the first 90% to get a working demo is easy. It's the other 190% that matters.

  • What's your uptime?
  • Defect rate?
  • How quickly do you recover from defects?
  • Do I have to reach out or will you know before me?
  • Can you own upstream dependencies?
  • When a vendor misbehaves, will you notice or wait until your users complain?
  • When users share ideas, how long does it take?
  • How do you keep engineers from breaking each other's systems?
  • Do you have systems to keep engineers moving without turning your app into a disjointed mess?
  • Can you build software bigger than fits in 1 person's brain?
  • When I'm in a 12 hour different timezone, your engineers are asleep, and there's a big issue ... will it be fixed before I give up?
  • Can you recover from failures, yours and upstream, or does important data get lost?
  • Are you keeping up with security updates?
  • Will you leak all my data?
  • Do I trust you?
  • Can I rely on you?
  • How can you be so sure?
  • Will you sign a legally binding guarantee that your software works when I need it? 😉

Those are the ~~fun~~ hard engineering challenges. Writing code is easy.

Cheers,
~Swizec

Published on January 24th, 2026 in Software Engineering, SRE, DevOps, Scaling Fast Book
ScalingFast3DCover

Enter your email to receive a sample chapter of Scaling Fast: Software Engineering Through the Hockeystick and learn how to navigate hypergrowth without burning out your team.

Senior Engineer Mindset cover

Get promoted, earn a bigger salary, work for top companies

Learn more

Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.

Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.

Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.

Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev

Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization

Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections

Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog

Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com

By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️


Read the original article

Comments

  • By v_CodeSentinal 2026-01-2614:006 reply

    Hard agree. As LLMs drive the cost of writing code toward zero, the volume of code we produce is going to explode. But the cost of complexity doesn't go down—it actually might go up because we're generating code faster than we can mentally model it.

    SRE becomes the most critical layer because it's the only discipline focused on 'does this actually run reliably?' rather than 'did we ship the feature?'. We're moving from a world of 'crafting logic' to 'managing logic flows'.

    • By ottah 2026-01-2615:541 reply

      I dunno, I don't think in practice SRE or DevOPs are even really different from the people we used to call sys admins (former sysadmin myself). I think the future of mediocre companies is SRE chasing after LLM fires, but I think a competitive business would have a much better strategy for building systems. Humans are still by far the most efficient and generalized reasoners, and putting the energy intensive, brittle ai model in charge of most implementation is setting yourself up to fail.

      • By stvvvv 2026-01-2618:161 reply

        Former sysadmin and I've been an SRE for >15 years now.

        They are very different. If your SREs are spending much of their time chasing fires, they are doing it wrong.

        • By ottah 2026-01-2622:461 reply

          Unfortunately sometimes it's more of a title than a job description. Company's define the job, and call it what ever they feel like.

          • By stvvvv 2026-01-2812:51

            Right, unfortunately too many people re-brand their ops team as SREs and expect things to be different.

    • By mupuff1234 2026-01-2615:102 reply

      > But the cost of complexity doesn't go down

      But how much of current day software complexity is inherent in the problem space vs just bad design and too many (human) chefs in the kitchen? I'm guessing most of it is the latter category.

      We might get more software but with less complexity overall, assuming LLMs become good enough.

      • By legorobot 2026-01-2615:35

        I agree that there's a lot of complexity today due to the process in which we write code (people, lack of understanding the problem space, etc.) vs the problem itself.

        Would we say us as humans also have captured the "best" way to reduce complexity and write great code? Maybe there's patterns and guidelines but no hard and fast rules. Until we have better understanding around that, LLMs may also not arrive at those levels either. Most of that knowledge is gleamed when sticking with a system -- dealing with past choices and requiring changes and tweaks to the code, complexity and solution over time. Maybe the right "memory" or compaction could help LLMs get better over time, but we're just scratching the surface there today.

        LLMs output code as good as their training data. They can reason about parts of code they are prompted and offer ideas, but they're inherently based on the data and concepts they've trained on. And unfortunately...its likely much more average code than highly respected ones that flood the training data, at least for now.

        Ideally I'd love to see better code written and complexity driven down by _whatever_ writes the code. But there will always been verification required when using a writer that is probabilistic.

      • By oblio 2026-01-2616:56

        That probably requires superhuman AI, though.

    • By wavemode 2026-01-2617:352 reply

      By "SRE", are people actually talking about "QA"?

      SREs usually don't know the first thing about whether particular logic within the product is working according to a particular set of business requirements. That's just not their role.

      • By stvvvv 2026-01-2618:191 reply

        Good SREs at a senior level do. They are familiar with the product, and the customers and the business requirements.

        Without that it's impossible to correctly prioritise your work.

        • By wavemode 2026-01-2618:291 reply

          Any SRE who does that is really filling a QA role. It's not part of the SRE job title, which is more about deployments/monitoring/availability/performance, than about specific functional requirements.

          In a well-run org, the software engineers (along with QA if you have them) are responsible for validation of requirements.

          • By stvvvv 2026-01-2812:53

            well-run ops requires knowing the business. It's not enough to know "This rpc is failing 100%", but also what the impact on the customer is, and how important to the business it is.

            Mature SRE teams get involved with the development of systems before they've even launched, to ensure that they have reliability and supportability baked in from the start, rather than shoddily retrofitted.

      • By zeroCalories 2026-01-2617:44

        Most companies don't have QA anymore, just their CI/CD's automated tests.

    • By belter 2026-01-2618:44

      >> As LLMs drive the cost of writing code toward zero

      And they drive the cost of validating the correctness of such code towards infinity...

    • By storystarling 2026-01-2618:18

      I see it less as SRE and more about defensive backend architecture. When you are dealing with non-deterministic outputs, you can't just monitor for uptime, you have to architect for containment. I've been relying heavily on LangGraph and Celery to manage state, basically treating the LLM as a fuzzy component that needs a rigid wrapper. It feels like we are building state machines where the transitions are probabilistic, so the infrastructure (Redis, queues) has to be much more robust than the code generating the content.

    • By franktankbank 2026-01-2620:36

      This sounds like the most min maxed drivel. What if I took every concept and dialed it to either zero or 11 and then picked a random conclusion!!!??

  • By solatic 2026-01-267:353 reply

    I think there's two kinds of software-producing-organizations:

    There's the small shops where you're running some kind of monolith generally open to the Internet, maybe you have a database hooked up to it. These shops do not need dedicated DevOps/SRE. Throw it into a container platform (e.g. AWS ECS/Fargate, GCP Cloud Run, fly.io, the market is broad enough that it's basically getting commoditized), hook up observability/alerting, maybe pay a consultant to review it and make sure you didn't do anything stupid. Then just pay the bill every month, and don't over-think it.

    Then you have large shops: the ones where you're running at the scale where the cost premium of container platforms is higher than the salary of an engineer to move you off it, the ones where you have to figure out how to get the systems from different companies pre-M&A to talk to each other, where you have N development teams organizationally far away from the sales and legal teams signing SLAs yet need to be constrained by said SLAs, where you have some system that was architected to handle X scale and the business has now sold 100X and you have to figure out what band-aids to throw at the failing system while telling the devs they need to re-architect, where you need to build your Alertmanager routing tree configuration dynamically because YAML is garbage and the routing rules change based on whether or not SRE decided to return the pager, plus ensuring that devs have the ability to self-service create new services, plus progressive rollout of new alerts across the organization, etc., so even Alertmanager config needs to be owned by an engineer.

    I really can't imagine LLMs replacing SREs in large shops. SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.

    • By ffsm8 2026-01-269:422 reply

      > SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.

      According to the specified goals of SRE, this is actually not just a small fraction - but something that shouldn't happen. To be clear, I'm fully aware that this will always be necessary - but whenever it happened - it's because the site reliability engineer (SRE) overlooked something.

      Hence if that's considered a large part of the job.. then you're just not a SRE as Google defined that role

      https://sre.google/sre-book/table-of-contents/

      Very little connection to the blog post we're commenting on though - at least as far as I can tell.

      At least I didn't find any focus on debugging. It put forward that the capability to produce reliable software is what will distinguish in the future, and I think this holds up and is inline with the official definition of SRE

      • By ottah 2026-01-2615:59

        I don't think people really adhere to Google's definition; most companies don't even have nearly similar scale. Most SRE I've seen are running from one Pagerduty alert to the next and not really doing much of a deep dive into understanding the problem.

      • By bigDinosaur 2026-01-2612:271 reply

        This makes sense - as am analogy the flight crash investigator is presumably a very different role to the engineer designing flight safety systems.

        • By arcbyte 2026-01-2613:05

          I think you've identified analogous functions, but I don't think your analogy holds as you've written it. A more faithful analogy to OP is that there is no better flight crash investigator than the aviation engineer designing the plane, but flight crash investigation is an actual failure of his primary duty of engineering safe planes.

          Still not a great rendition of this thought, but closer.

    • By tryauuum 2026-01-2623:271 reply

      those alertmanager descriptions feel scary. I'm stuck in the zabbix era.

      what do you mean "progressive rollout of new alerts across the organization"? what kind of alerts?

      • By solatic 2026-01-275:54

        Well, all kinds. Alerting is a really great way to track things that need to change, tell people about that thing along established channels, and also tell them when it's been addressed satisfactorily. Alertmanager will already be configured with credentials and network access to PagerDuty, Slack, Jira, email, etc., and you can use something like Karma to give people interfaces to the different Alertmanagers and manage silences.

        If you're deploying alerts, then yeah you want a progressive rollout just like anything else, or you run the risk of alert fatigue from false positives, which is Really Bad because it undermines faith in the alerting system.

        For example, say you want to start to track, per team, how many code quality issues they have, and set thresholds above which they will get alerted. The alert will make a Jira ticket - getting code quality under control can be afforded to be scheduled into a sprint. You probably need different alert thresholds for different teams, and you want to test the waters before you start having Alertmanager make real Jira issues. So, yeah, progressive rollout.

    • By weitendorf 2026-01-2611:045 reply

      Having worked on Cloud Run/Cloud Functions, I think almost every company that isn't itself a cloud provider could be in category 1, with moderately more featureful implementations that actually competed with K8s.

      Kubernetes is a huge problem, it's IMO a shitty prototype that industry ran away with (because Google tried to throw a wrench at Docker/AWS when Containers and Cloud were the hot new things, pretending Kubernetes is basically the same as Borg), then the community calcified around the prototype state and bought all this SAAS/structured their production environments around it, and now all these SAAS providers and Platform Engineers/Devops people who make a living off of milking money out of Kubernetes users are guarding their gold mines.

      Part of the K8s marketing push was rebranding Infrastructure Engineering = building atop Kubernetes (vs operating at the layers at and beneath it), and K8s leaks abstractions/exposes an enormous configuration surface area, so you just get K8s But More Configuration/Leaks. Also, You Need A Platform, so do Platform Engineering too, for your totally unique use case of connecting git to CI to slackbot/email/2FA to our release scripts.

      At my new company we're working on fixing this but it'll probably be 1-2 more years until we can open source it (mostly because it's not generalized enough yet and I don't want to make the same mistake as Kubernetes. But we will open source it). The problem is mostly multitenancy, better primitives, modeling the whole user story in the platform itself, and getting rid of false dichotomies/bad abstractions regarding scaling and state (including the entire control plane). Also, more official tooling and you have to put on a dunce cap if YAML gets within 2 network hopes of any zone.

      In your example, I think

      1. you shouldn't have to think about scaling and provisioning at this level of granularity, it should always be at the multitenant zonal level, this is one of the cardinal sins Kubernetes made that Borg handled much better

      2. YAML is indeed garbage but availability reporting and alerting need better official support, it doesn't make sense for every ecommerce shop and bank to building this stuff

      3. a huge amount of alerts and configs could actually be expressed in business logic if cloud platforms exposed synchronous/real-time billing with the scaling speed of Cloud Run.

      If you think about it, so so so many problems devops teams deal with are literally just

      1. We need to be able to handle scaling events

      2. We need to control costs

      3. Sometimes these conflict and we struggle to translate between the two.

      4. Nobody lets me set hard billing limits/enforcement at the platform level.

      (I implemented enforcement for something close to this for Run/Appengine/Functions, it truly is a very difficult problem, but I do think it's possible. Real time usage->billing->balance debits was one of the first things we implemented on our platform).

      5. For some reason scaling and provisioning are different things (partly because the cloud provider is slow, partly because Kubernetes is single-tenant)

      6. Our ops team's job is to translate between business logic and resource logic, and half our alerts are basically asking a human to manually make some cost/scaling analysis or tradeoff, because we can't automate that, because the underlying resource model/platform makes it impossible.

      You gotta go under the hood to fix this stuff.

      • By spockz 2026-01-2613:491 reply

        Since you are developing in this domain. Our challenge with both lambdas and cloud run type managed solutions is that they seem incompatible with our service mesh. Cloud run and lambdas can not be incorporated with gcp service mesh, but only if it is managed through gcp as well. Anything custom is out of the question. Since we require end to end mTLS in our setup we cannot use cloud run.

        To me this shows that cloud run is more of an end product than a building block and it hinders the adoption as basically we need to replicate most of cloud run ourselves just to add that tiny bit of also running our Sidecar.

        How do you see this going in your new solution?

        • By weitendorf 2026-01-2615:07

          > Cloud run and lambdas can not be incorporated with gcp service mesh, but only if it is managed through gcp as well

          I'm not exactly sure what this means, a few different interpretations make sense to me. If this is purely a run <-> other gcp product in a vpc problem, I'm not sure how much info about that is considered proprietary and which I could share, or even if my understanding of it is even accurate anymore. If it's that cloud run can't run in your service mesh then it's just, these are both managed services. But yes, I do think it's possible to run into a situation/configuration that is impossible to express in run that doesn't seem like it should be inexpressible.

          This is why designing around multitenancy is important. I think with hierarchical namespacing and a transparent resource model you could offer better escape hatches for integrating managed services/products that don't know how to talk to each other. Even though your project may be a single "tenant", because these managed services are probably implemented in different ways under the hood and have opaque resource models (ie run doesn't fully expose all underlying primitives), they end up basically being multitenant relative to each other.

          That being said, I don't see why you couldn't use mTLS to talk to Cloud Run instances, you just might have to implement it differently from how you're doing it elsewhere? This almost just sounds like a shortcoming of your service mesh implementation that it doesn't bundle something exposing run-like semantics by default (which is basically what we're doing), because why would it know how to talk to a proprietary third party managed service?

      • By linuxftw 2026-01-2613:56

        There are plenty of PaaS components that run on k8s if you want to use them. I'm not a fan, because I think giving developers direct access to k8s is the better pattern.

        Managed k8s services like EKS have been super reliable the last few years.

        YAML is fine, it's just configuration language.

        > you shouldn't have to think about scaling and provisioning at this level of granularity, it should always be at the multitenant zonal level, this is one of the cardinal sins Kubernetes made that Borg handled much better

        I'm not sure what you mean here. Manage k8s services, and even k8s clusters you deploy yourself, can autoscale across AZ's. This has been a feature for many years now. You just set a topology key on your pod template spec, your pods will spread across the AZ's, easy.

        Most tasks you would want to do to deploy an application, there's an out of the box solution for k8s that already exists. There have been millions of labor-hours poured into k8s as a platform, unless you have some extremely niche use case, you are wasting your time building an alternative.

      • By firesteelrain 2026-01-2613:30

        Lots to unpack here.

        I will just say based on recent experience the fix is not Kubernetes bad it’s Kubernetes is not a product platform; it’s a substrate, and most orgs actually want a platform.

        We recently ripped out a barebones Kubernetes product (like Rancher but not Rancher). It was hosting a lot of our software development apps like GitLab, Nexus, KeyCloak, etc

        But in order to run those things, you have to build an entire platform and wire it all together. This is on premises running on vxRail.

        We ended up discovering that our company had an internal software development platform based on EKS-A and it comes with auto installers with all the apps and includes ArgoCD to maintain state and orchestrate new deployments.

        The previous team did a shitty job DIY-ing the prior platform. So we switched to something more maintainable.

        If someone made a product like that then I am sure a lot of people would buy it.

      • By solatic 2026-01-2615:471 reply

        > real-time usage -> billing

        This is one of the things that excites me about TigerBeetle; the reason why so much billing by cloud providers is reported only on an hourly granularity at best is because the underlying systems are running batch jobs to calculate final billed sums. Having a billing database that is efficient enough to keep up with real-time is a game-changer and we've barely scratched the surface of what it makes possible.

        • By weitendorf 2026-01-2617:03

          Thanks for mentioning them, we're doing quite similar debit-credit stuff as https://docs.tigerbeetle.com/concepts/debit-credit/ but reading https://docs.tigerbeetle.com/concepts/performance/ they are definitely thinking about the problem differently from us. You need much more prescribed entities (eg resources and skus) on the modelling side and different choices on the performance side (for something like a usage pricing system) for a cloud platform.

          This feels like a single-tenant, centralized ACH but I think what you actually want for a multitenant, multizonal cloud platform is not ACH but something more capability-based. The problem is that cloud resources are billed as subscriptions/rates and you can't centralize anything on the hot-path (like this does) because it means that zone/any availability interacting with that node causes a lack of availability for everything else. Also, the business logic and complexity for computing an actual final bill for a cloud customer's usage is quite complex because it's reliant on so many different kinds of things, including pricing models which can get very complex or bespoke, and it doesn't seem like tigerbeetle wants calculating prices to be part of their transactions (I think)

          The way we're modelling this is with hierarchical sub-ledgers (eg per-zone, per-tenant, per-resourcegroup) and something which you could think of as a line of credit. In my opinion the pricing and resource modelling + integration with the billing tx are much more challenging because they need to be able to handle a lot of business logic. Anyway, if someone chooses to opt-in to invoice billing there's an escape hatch and way for us to handle things we can't express yet.

      • By vrosas 2026-01-2611:231 reply

        Every time I’ve pushed for cloud run at jobs that were on or leaning towards k8s I was looked at as a very unserious person. Like you can’t be a “real” engineer if you’re not battling yaml configs and argoCD all day (and all night).

        • By weitendorf 2026-01-2612:07

          It does have real tradeoffs/flaws/limitations, chief among them, Run isn't allowed to "become" Kubernetes, you're expected to "graduate". There's been an immense marketing push for Kubernetes and Platform Engineering and all the associated SAAS sending the same message (also, notice how much less praise you hear about it now that the marketing has died down?).

          The incentives are just really messed up all around. Think about all the actual people working in devops who have their careers/job tied to Kubernetes, and how many developers get drawn in by the allure and marketing because it lets them work on more fun problems than their actual job, and all the provisioned instances and vendor software and certs and conferences, and all the money that represents.

  • By augusteo 2026-01-262:391 reply

    stackskipton makes a good point about authority. SRE works at Google because SREs can block launches and demand fixes. Without that organizational power, you're just an on-call engineer who also writes tooling.

    The article's premise (AI makes code cheap, so operations becomes the differentiator) has some truth to it. But I'd frame it differently: the bottleneck was never really "writing code." It was understanding what to build and keeping it running. AI helps with one of those. Maybe.

    • By nasretdinov 2026-01-268:54

      > because SREs can block launches and demand fixes

      I didn't find that particularly true during my tenure, but obviously Google is huge, so there probably exist teams that actually can afford to behave this way...

HackerNews