Hacker News

Don't rent the cloud, own instead

2026-02-055:501222498blog.comma.ai

Data centers are cool, everyone should have one.

Show article

These days it seems you need a trillion fake dollars, or lunch with politicians to get your own data center. They may help, but they’re not required. At comma we’ve been running our own data center for years. All of our model training, metrics, and data live in our own data center in our own office. Having your own data center is cool, and in this blog post I will describe how ours works, so you can be inspired to have your own data center too.

Why no cloud?

If your business relies on compute, and you run that compute in the cloud, you are putting a lot of trust in your cloud provider. Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out. If you want to control your own destiny, you must run your own compute.

Self-reliance is great, but there are other benefits to running your own compute. It inspires good engineering. Maintaining a data center is much more about solving real-world challenges. The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.

Avoiding the cloud for ML also creates better incentives for engineers. Engineers generally want to improve things. In ML many problems go away by just using more compute. In the cloud that means improvements are just a budget increase away. This locks you into inefficient and expensive solutions. Instead, when all you have available is your current compute, the quickest improvements are usually speeding up your code, or fixing fundamental issues.

Finally there’s cost, owning a data center can be far cheaper than renting in the cloud. Especially if your compute or storage needs are fairly consistent, which tends to be true if you are in the business of training or running models. In comma’s case I estimate we’ve spent ~5M on our data center, and we would have spent 25M+ had we done the same things in the cloud.

What’s all needed?

Our data center is pretty simple. It’s maintained and built by only a couple engineers and technicians. Your needs may be slightly different, our implementation should provide useful context.

Power

To run servers you need power. We currently use about 450kW at max. Operating a data center exposes you to many fun engineering challenges, but procuring power is not one of them. San Diego power cost is over 40c/kWh, ~3x the global average. It’s a ripoff, and overpriced simply due to political dysfunction. We spent $540,112 on power in 2025, a big part of the data center cost. In a future blog post I hope I can tell you about how we produce our own power and you should too.

Cooling

Data centers need cool dry air. Typically this is achieved with a CRAC system, but they are power-hungry. San Diego has a mild climate and we opted for pure outside air cooling. This gives us less control of the temperature and humidity, but uses only a couple dozen kW. We have dual 48” intake fans and dual 48” exhaust fans to keep the air cool. To ensure low humidity (<45%) we use recirculating fans to mix hot exhaust air with the intake air. One server is connected to several sensors and runs a PID loop to control the fans to optimize the temperature and humidity.

Filtered intake fan on right, 2 recirculating fans at the top

Servers

The majority of our current compute is 600 GPUs in 75 TinyBox Pro machines. They were built in-house, which saves us money and ensures they suit our needs. Our self-built machines fail at a similar rate to pre-built machines we’ve bought, but we’re capable of fixing them ourselves quickly. They have 2 CPUs and 8 GPUs each, and work as both training machines and general compute workers.

Breaker panels for all the computers, that’s a lot of breakers!

For data storage we have a few racks of Dell machines (R630 and R730). They are filled with SSDs for a total of ~4PB of storage. We use SSDs for reliability and speed. Our main storage arrays have no redundancy and each node needs to be able to saturate the network bandwidth with random access reads. For the storage machines this means reading up to 20Gbps of each 80TB chunk.

Other than storage and compute machines we have several one-off machines to run services. This includes a router, climate controller, data ingestion machine, storage master servers, metric servers, redis servers, and a few more.

Running the network requires switches, but at this scale we don’t need to bother with complicated switch topologies. We have 3 100Gbps interconnected Z9264F switches, which serve as the main ethernet network. We have two more infiniband switches to interconnect the 2 tinybox pro groups for training all-reduce.

The software

To effectively use all these compute and storage machines you need some infra. At this scale, services don’t need redundancy to achieve 99% uptime. We use a single master for all services, which makes things pretty simple.

Setup

All servers get ubuntu installed with pxeboot and are managed by salt.

Distributed storage: minikeyvalue

All of our storage arrays use mkv. The main array is 3PB of non-redundant storage hosting our driving data we train on. We can read from this array at ~1TB/s, which means we can train directly on the raw data without caching. Redundancy is not needed since no specific data is critical.

We have an additional ~300TB non-redundant array to cache intermediate processed results. And lastly, we have a redundant mkv storage array to store all of our trained models and training metrics. Each of these 3 arrays have a separate single master server.

Workload management: slurm

We use slurm to manage the compute nodes, and compute jobs. We schedule two types of distributed compute. Pytorch training jobs, and miniray workers.

Distributed training: pytorch

To train models across multiple GPU nodes we use torch.distributed FSDP. We have 2 separate training partitions, each intra-connected with Infiniband for training across machines. We wrote our own training framework which handles the training loop boilerplate, but it’s mostly just pytorch.

reporter; comma’s experiment tracking service

We have a custom model experiment tracking service (similar to wandb or tensorboard). It provides a dashboard for tracking experiments, and shows custom metrics and reports. It is also the interface for the mkv storage array that hosts the model weights. The training runs store the model weights there with a uuid, and they are available to download for whoever needs to run them. The metrics and reports for our latest models are also open.

Distributed compute: miniray

Besides training we have many other compute tasks. This can be anything from running tests, running models, pre-processing data, or even running agent rollouts for on-policy training. We wrote a lightweight open-source task scheduler called miniray that allows you to run arbitrary python code on idle machines. This is a simpler version of dask, with a focus on extreme simplicity. Slurm will schedule any idle machine to be an active miniray worker, and accept pending tasks. All the task information is hosted in a central redis server.

Our main training/compute machines. Notice the 400Gbps switch in the center.

Miniray workers with GPUs will spin up a triton inference server to run model inference with dynamic batching. A miniray worker can thus easily and efficiently run any of the models hosted in the model mkv storage array.

Miniray makes it extremely easy to scale parallel tasks to hundreds of machines. For example, the controls challenge record was set by just having ~1hr of access to our data center with miniray.

Code NFS monorepo

All our code is in a monorepo that we have cloned on our workstations. This monorepo is kept small (<3GB), so it can easily be copied around. When a training job or miniray distributed job is started on any workstation, the local monorepo is cached on a shared NFS drive including all the local changes. Training jobs and miniray tasks are pointed towards this cache, such that all distributed work uses the exact codebase you have locally. Even all the python packages are identical, UV on the worker/trainer syncs the packages specified in the monorepo before starting any work. This entire process of copying your entire local codebase and syncing all the packages takes only ~2s, and is well worth it to prevent the issues mismatches can cause.

All together now

The most complex thing we do at comma is train driving models on-policy, these training runs require training data to be generated during training by running simulated driving rollouts with the most recent model weights. Here’s a real-world command we just used to train such a model. This training run uses all of the infrastructure described above. While only this small command is needed to kick everything off, it orchestrates a lot of moving parts.

./training/train.sh N=4 partition=tbox2 trainer=mlsimdriving dataset=/home/batman/xx/datasets/lists/train_500k_20250717.txt vision_model=8d4e28c7-7078-4caf-ac7d-d0e41255c3d4/500 data.shuffle_size=125k optim.scheduler=COSINE bs=4

Diagram of all infrastructure involved in training an on-policy driving model.

Like this stuff?

Does all this stuff sound exciting? Then build your own datacenter for yourself or your company! You can also come work here.

Harald Schäfer
CTO @ comma.ai

Read the original article

Torq_boi

Karma: 423

@Hacker__News
@hacker._news

Comments

By adamcharnock 2026-02-058:5723 reply

This is an industry we're[0] in. Owning is at one end of the spectrum, with cloud at the other, and a broadly couple of options in-between:

1 - Cloud – This is minimising cap-ex, hiring, and risk, while largely maximising operational costs (its expensive) and cost variability (usage based).

2 - Managed Private Cloud - What we do. Still minimal-to-no cap-ex, hiring, risk, and medium-sized operational cost (around 50% cheaper than AWS et al). We rent or colocate bare metal, manage it for you, handle software deployments, deploy only open-source, etc. Only really makes sense above €$5k/month spend.

3 - Rented Bare Metal – Let someone else handle the hardware financing for you. Still minimal cap-ex, but with greater hiring/skilling and risk. Around 90% cheaper than AWS et al (plus time).

4 - Buy and colocate the hardware yourself – Certainly the cheapest option if you have the skills, scale, cap-ex, and if you plan to run the servers for at least 3-5 years.

A good provider for option 3 is someone like Hetzner. Their internal ROI on server hardware seems to be around the 3 year mark. After which I assume it is either still running with a client, or goes into their server auction system.

Options 3 & 4 generally become more appealing either at scale, or when infrastructure is part of the core business. Option 1 is great for startups who want to spend very little initially, but then grow very quickly. Option 2 is pretty good for SMEs with baseline load, regular-sized business growth, and maybe an overworked DevOps team!

[0] https://lithus.eu, adam@

By torginus 2026-02-0511:099 reply

I think the issue with this formulation is what drives the cost at cloud providers isn't necessarily that their hardware is too expensive (which it is), but that they push you towards overcomplicated and inefficient architectures that cost too much to run.

A core at this are all the 'managed' services - if you have a server box, its in your financial interest to squeeze as much per out of it as possible. If you're using something like ECS or serverless, AWS gains nothing by optimizing the servers to make your code run faster - their hard work results in less billed infrastructure hours.

This 'microservices' push usually means that instead of having an on-server session where you can serve stuff from a temporary cache, all the data that persists between requests needs to be stored in a db somewhere, all the auth logic needs to re-check your credentials, and something needs to direct the traffic and load balance these endpoint, and all this stuff costs money.

I think if you have 4 Java boxes as servers with a redundant DB with read replicas on EC2, your infra is so efficient and cheap that even paying 4x for it rather than going for colocation is well worth it because of the QoL and QoS.

These crazy AWS bills usually come from using every service under the sun.

By bojangleslover 2026-02-0512:236 reply

The complexity is what gets you. One of AWS's favorite situations is

1) Senior engineer starts on AWS

2) Senior engineer leaves because our industry does not value longevity or loyalty at all whatsoever (not saying it should, just observing that it doesn't)

3) New engineer comes in and panics

4) Ends up using a "managed service" to relieve the panic

5) New engineer leaves

6) Second new engineer comes in and not only panics but outright needs help

7) Paired with some "certified AWS partner" who claims to help "reduce cost" but who actually gets a kickback from the extra spend they induce (usually 10% if I'm not mistaken)

Calling it it ransomware is obviously hyperbolic but there are definitely some parallels one could draw

On top of it all, AWS pricing is about to massively go up due to the RAM price increase. There's no way it can't since AWS is over half of Amazon's profit while only around 15% of its revenue.

By Aurornis 2026-02-0516:193 reply

One of the biggest problems with the self-hosted situations I’ve seen is when the senior engineers who set it up leave and the next generation has to figure out how to run it all.

In theory with perfect documentation they’d have a good head start to learn it, but there is always a lot of unwritten knowledge involved in managing an inherited setup.

With AWS the knowledge is at least transferable and you can find people who have worked with that exact thing before.

Engineers also leave for a lot of reasons. Even highly paid engineers go off and retire, change to a job for more novelty, or decide to try starting their own business.

By strobe 2026-02-0519:521 reply

>With AWS the knowledge is at least transferable

unfortunately it lot of things in AWS that also could be messed up so it might be really hard to research what is going on. For example, you could have hundreds of Lambdas running without any idea where original sources and how they connected to each-other, or complex VPCs network routing where some rules and security groups shared randomly between services so if you do small change it could lead to completely difference service to degrade (like you were hired to help with service X but after you changes some service Y went down and you even not aware that it existed)

By Hikikomori 2026-02-0520:24

Not much different from how it worked in companies I used to work for. Except the situation was even worse as we had no api or UI to probe for information.

By ethbr1 2026-02-0518:58

There are many great developers who are not also SREs. Building and operating/maintaining have their different mindsets.

By Breza 2026-02-063:31

In my experience, a back end on the cloud isn't necessarily any less complex than something self hosted.

By coliveira 2026-02-0513:052 reply

The end result of all this is that the percentage of people who know how to implement systems without AWS/Azure will be a single digit. From that point on, this will be the only "economic" way, it doesn't matter what the prices are.

By couscouspie 2026-02-0513:333 reply

That's not a factual statement over reality, but more of a normative judgement to justify resignation. Yes, professionals that know how to actually do these things are not abundantly available, but available enough to achieve the transition. The talent exists and is absolutely passionate about software freedom and hence highly intrinsically motivated to work on it. The only thing that is lacking so far is the demand and the talent available will skyrocket, when the market starts demanding it.

By eitally 2026-02-0514:512 reply

They actually are abundantly available and many are looking for work. The volume of "enterprise IT" sysadmin labor dwarfs that of the population of "big tech" employees and cloud architects.

By organsnyder 2026-02-0515:071 reply

I've worked with many "enterprise IT" sysadmins (in healthcare, specifically). Some are very proficient generalists, but most (in my experience) are fluent in only their specific platforms, no different than the typical AWS engineer.

By toomuchtodo 2026-02-0515:232 reply

Perhaps we need bootcamps for on prem stacks if we are concerned about a skills gap. This is no different imho from the trades skills shortage many developed countries face. The muscle must be flexed. Otherwise, you will be held captive by a provider "who does it all for you".

"Today, we are going to calculate the power requirements for this rack, rack the equipment, wire power and network up, and learn how to use PXE and iLO to get from zero to operational."

By organsnyder 2026-02-0519:021 reply

This might be my own ego talking (I see myself as a generalist), but IMHO what we need are people that are comfortable jumping into unfamiliar systems and learning on-the-fly, applying their existing knowledge to new domains (while recognizing the assumptions their existing knowledge is causing them to make). That seems much harder to teach, especially in a boot camp format.

By toomuchtodo 2026-02-0520:10

As a very curious autodidact, I strongly agree, but this talent is rare and can punch it's own ticket (broadly speaking). These people innovate and build systems for others to maintain, in my experience. But, to your point, we should figure out the sorting hat for folks who want to radically own these on prem systems [1] if they are needed.

[1] https://xkcd.com/705/

By eitally 2026-02-0616:38

I don't really think so. That was a ship that sailed ten years ago and nearly every sysadmin who is still proficient with managing on-prem stacks has adapted to also learn how to manage VPCs in an arbitrary cloud. It's not like this is a recent change.

By torginus 2026-02-0515:55

Yeah, anyone who has >10 years experience with servers/backend dev has almost certainly managed dedicated infra.

By friendzis 2026-02-0514:043 reply

> and the talent available will skyrocket, when the market starts demanding it.

Part of what clouds are selling is experience. A "cloud admin" bootcamp graduate can be a useful "cloud engineer", but it takes some serious years of experience to become a talented on prem sre. So it becomes an ouroboros: moving towards clouds makes it easier to move to the clouds.

By phil21 2026-02-0518:021 reply

> A "cloud admin" bootcamp graduate can be a useful "cloud engineer",

If by useful you mean "useful at generating revenue for AWS or GCP" then sure, I agree.

These certificates and bootcamps are roughly equivalent to the Cisco CCNA certificate and training courses back in the 90's. That certificate existed to sell more Cisco gear - and Cisco outright admitted this at the time.

By friendzis 2026-02-066:52

In part - yes. Useful as in capable of spinning up services without opening glaring security holes or bringing half of the infra down. Like with any tech, it takes experience and guardrails to use it efficiently and effectively.

By SahAssar 2026-02-0515:16

> A "cloud admin" bootcamp graduate can be a useful "cloud engineer"

That is not true. It takes a lot more than a bootcamp to be useful in this space, unless your definition is to copy-paste some CDK without knowing what it does.

By selimthegrim 2026-02-063:41

Moving towards the brothel makes it easier to get away from the brothel.

By bix6 2026-02-0513:411 reply

> The only thing that is lacking so far is the demand and the talent available will skyrocket, when the market starts demanding it.

But will the market demand it? AWS just continues to grow.

By bluGill 2026-02-0514:052 reply

Only time will tell. It depends on when someone with a MBA starts asking questions about cloud spending and runs the real numbers. People promoting self hosting often are not counting all the cost of self hosting (AWS has people working 24x7 so that if something fails someone is there to take action)

By cheema33 2026-02-0517:451 reply

> AWS has people working 24x7 so that if something fails someone is there to take action..

The number of things that these 24x7 people from AWS will cover for you is small. If your application craps out for any number of reasons that doesn't have anything to do with AWS, that is on you. If your app needs to run 24x7 and it is critical, then you need your own 24x7 person anyway.

By bluGill 2026-02-0517:471 reply

All the hardware and network issues are on them. I agree that you still need your own people to support you applications, but that is only part of the problem.

By iso1631 2026-02-0519:58

I've got thousands of devices over hundreds of sites in dozens of countries. The number of hardware failures are a tiny number, and certainly don't need 24/7 response

Meanwhile AWS breaks once or twice a year.

By misir 2026-02-0515:361 reply

From what I've seen, if you're depending on AWS, if something fails you too need someone 24x7 so that you can take action as well. Sometimes magic happens and systems recover after aws restarts their DNS, but usually the combination of event causes the application to get into an unrecoverable state that you need manual action. It doesn't always happen but you need someone to be there if it ever happens. Or bare minimum you need to evaluate if the underlying issue is really caused by AWS or something else has to be done on top of waiting for them to fix.

By bluGill 2026-02-0515:442 reply

How many problems is AWS able to handle for you that you are never aware of though?

By Symbiote 2026-02-0516:23

How many problems do you think there are?

I've only had one outage I could attribute to running on-prem, meanwhile it's a bit of a joke with the non-IT staff in the office that when "The Internet" (i.e. Cloudflare, Amazon) goes down with news reports etc our own services are all running fine.

By ragall 2026-02-0523:40

Distributed systems can partly fail in many subtly different ways, and you almost never notice it because there are people on-call taking care of them.

By ragall 2026-02-064:30

It already is like that, but not because of the cloud. Those of us who begun with computers in the era of the command line were forced to learn the internals of operating systems, and many ended up turning this hobby into a job.

Youngsters nowadays start with very polished interfaces and smartphones, so even if the cloud wasn't there it would take them a decade to learn systems design on-the-job, which means it wouldn't happen anyway for most. The cloud nowadays mostly exists because of that dearth of system internals knowledge.

While there still are around people who are able to design from scratch and operate outside a cloud, these people tend to be quite expensive and many (most?) tend to work for the cloud companies themselves or SaaS businesses, which means there's a great mismatch between demand and supply of experienced system engineers, at least for the salaries that lower tier companies are willing to pay. And this is only going to get worse. Every year, many more experienced engineers are retiring than the noobs starting on the path of systems engineering.

By infecto 2026-02-0512:543 reply

It’s all anecdotal but in my experiences it’s usually opposite. Bored senior engineer wants to use something new and picks a AWS bespoke service for a new project.

I am sure it happens a multitude of ways but I have never seen the case you are describing.

By alpinisme 2026-02-0513:02

I’ve seen your case more than the ransom scenario too. But also even more often: early-to-mid-career dev saw a cloud pattern trending online, heard it was a new “best practice,” and so needed to find a way to move their company to using it.

By walt_grata 2026-02-0517:58

Is that what I should be doing? I'm just encouraging the devs on my team to read designing data intensive apps and setting up time for group discussions. Aside from coding and meetings that is.

By asimeqi 2026-02-063:541 reply

Since when is "bored" a synonym for "dishonest"?

By infecto 2026-02-0613:12

Please elaborate. Such a bold statement with zero logic around it.

By antonvs 2026-02-0516:15

> 3) New engineer comes in and panics

> 4) Ends up using a "managed service" to relieve the panic

It's not as though this is unique to cloud.

I've seen multiple managers come in and introduce some SaaS because it fills a gap in their own understanding and abilities. Then when they leave, everyone stops using it and the account is cancelled.

The difference with cloud is that it tends to be more central to the operation, so can't just be canceled when an advocate leaves.

By antonvs 2026-02-0516:432 reply

> One of AWS's favorite situations

I'll give you an alternative scenario, which IME is more realistic.

I'm a software developer, and I've worked at several companies, big and small and in-between, with poor to abysmal IT/operations. I've introduced and/or advocated cloud at all of them.

The idea that it's "more expensive" is nonsense in these situations. Calculate the cost of the IT/operations incompetence, and the cost of the slowness of getting anything done, and cloud is cheap.

Extremely cheap.

Not only that, it can increase shipping velocity, and enable all kinds of important capabilities that the business otherwise just wouldn't have, or would struggle to implement.

Much of the "cloud so expensive" crowd are just engineers too narrowly focused on a small part of the picture, or in denial about their ability to compete with the competence of cloud providers.

By acdha 2026-02-0517:531 reply

> Much of the "cloud so expensive" crowd are just engineers too narrowly focused on a small part of the picture, or in denial about their ability to compete with the competence of cloud providers

This has been my experience as well. There are legitimate points of criticism but every time I’ve seen someone try to make that argument it’s been comparing significantly different levels of service (e.g. a storage comparison equating S3 with tape) or leaving out entire categories of cost like the time someone tried to say their bare metal costs for a two server database cluster was comparable to RDS despite not even having things like power or backups.

By Symbiote 2026-02-070:271 reply

You are welcome to criticise my DB cluster comparison: https://news.ycombinator.com/item?id=46910521

By acdha 2026-02-0716:041 reply

That leaves out staffing, backups, development and testing of a multi-location failover mechanism as robust as the RDS one, and a bunch of security compliance work if that’s relevant.

It’s totally possible to beat AWS and volume is the way to do it–your admin’s salary doesn’t scale be linearly with storage–but every time I’ve tried to account for all of the costs it’s been close enough that it’s made sense to put people on things which can’t be outsourced.

By Symbiote 2026-02-0812:01

If this database is a large portion of the infrastructure required then the fixed-ish costs don't scale so well, but a smaller cloud/hosting company should be considered.

But I have over 60 servers. Using the pricing calculator for the two AWS SaaS services that closely align with our primary service (40+ of those servers), we'd face a cost of over $1.2M/year if reserved for 3 years and paid upfront — that's for the service alone, I haven't added any bandwidth costs, or getting the data into those systems, and I've picked the minimum values for storage and throughput as I don't know what these should be. (Probably not the minimum.)

Add the remaining compute (~20 decent servers), a petabyte-scale storage pool, and all the rest, and the bill would likely exceed our entire IT budget including hardware, hosting, cloud services we do use, and all the salaries.

My rough estimate is our infrastructure costs would increase 8-10 times using AWS, our staff costs wouldn't reduce, and the risk to the budget would increase with variable usage.

This is tax money being spent, so I am asked every few years to justify why we aren't using cloud. (That's why I'm putting this much effort into a HN reply, the question was asked again recently.)

I know someone working in another country on essentially the same system for that country. They went all-in on AWS and pay every 1-2 months what we spend in a year, but have a fraction of our population/data.

By rcxdude 2026-02-0611:531 reply

Fron what I've seen this can work as a stopgap until IT get their hooks into the cloud system in which case you circle back to paying to costs of incompetence and the costs of the cloud (sometimes stacking on top of each other).

By antonvs 2026-02-0616:45

There's still a benefit in terms of infrastructure reliability. Recovery times are faster, backups more reliable, etc. Basically, vendor managed is better than customer managed in most situations, assuming a competent vendor.

Also, if the cloud systems are architected properly before IT gets hold of them, then they tend to retain their good properties for a long time, especially if others are paying attention to e.g. gitops pull requests.

My current company ended up replacing its (small) operations team in order to get people with cloud expertise. We hired the new team for the skills we needed. It's worked out well.

By themafia 2026-02-0523:34

> 7) Paired with some "certified AWS partner"

What do you think RedHat support contracts are? This situation exists in every technology stack in existence.

By mrweasel 2026-02-0512:083 reply

Just this week a friend of mine was spinning up some AWS managed service, complaining about the complexity, and how any reconfiguration took 45 minutes to reload. It's a service you can just install with apt, the default configuration is fine. Not only is many service no longer cheaper in the cloud, the management overhead also exceed that of on-prem.

By mystifyingpoi 2026-02-0512:222 reply

I'd gladly use (and maybe even pay for!) an open-source reimplementation of AWS RDS Aurora. All the bells and whistles with failover, clustering, volume-based snaps, cross-region replication, metrics etc.

As far as I know, nothing comes close to Aurora functionality. Even in vibecoding world. No, 'apt-get install postgres' is not enough.

By SOLAR_FIELDS 2026-02-0513:17

serverless v2 is one of the products that i was skeptical about but is genuinely one of the most robust solutions out there in that space. it has its warts, but I usually default to it for fresh installs because you get so much out of the box with it

By sgarland 2026-02-0514:531 reply

Nitpick (I blame Amazon for their horrible naming): Aurora and RDS are separate products.

What you’re asking for can mostly be pieced together, but no, it doesn’t exist as-is.

Failover: this has been a thing for a long time. Set up a synchronous standby, then add a monitoring job that checks heartbeats and promotes the standby when needed. Optionally use something like heartbeat to have a floating IP that gets swapped on failover, or handle routing with pgbouncer / pgcat etc. instead. Alternatively, use pg_auto_failover, which does all of this for you.

Clustering: you mean read replicas?

Volume-based snaps: assuming you mean CoW snapshots, that’s a filesystem implementation detail. Use ZFS (or btrfs, but I wouldn’t, personally). Or Ceph if you need a distributed storage solution, but I would definitely not try to run Ceph in prod unless you really, really know what you’re doing. Lightbits is another solution, but it isn’t free (as in beer).

Cross-region replication: this is just replication? It doesn’t matter where the other node[s] are, as long as they’re reachable, and you’ve accepted the tradeoffs of latency (synchronous standbys) or potential data loss (async standbys).

Metrics: Percona Monitoring & Management if you want a dedicated DB-first, all-in-one monitoring solution, otherwise set up your own scrapers and dashboards in whatever you’d like.

What you will not get from this is Aurora’s shared cluster volume. I personally think that’s a good thing, because I think separating compute from storage is a terrible tradeoff for performance, but YMMV. What that means is you need to manage disk utilization and capacity, as well as properly designing your failure domain. For example, if you have a synchronous standby, you may decide that you don’t care if a disk dies, so no messing with any kind of RAID (though you’d then miss out on ZFS’ auto-repair from bad checksums). As long as this aligns with your failure domain model, it’s fine - you might have separate physical disks, but co-locate the Postgres instances in a single physical server (…don’t), or you might require separate servers, or separate racks, or separate data centers, etc.

tl;dr you can fairly closely replicate the experience of Aurora, but you’ll need to know what you’re doing. And frankly, if you don’t, even if someone built a OSS product that does all of this, you shouldn’t be running it in prod - how will you fix issues when they crop up?

By vel0city 2026-02-0515:031 reply

> you can fairly closely replicate the experience of Aurora

Nobody doubts one could build something similar to Aurora given enough budget, time, and skills.

But that's not replicating the experience of Aurora. The experience of Aurora is I can have all of that, in like 30 lines of terraform and a few minutes. And then I don't need to worry about managing the zpools, I don't need to ensure the heartbeats are working fine, I don't need to worry about hardware failures (to a large extent), I don't need to drive to multiple different physical locations to set up the hardware, I don't need to worry about handling patching, etc.

You might replicate the features, but you're not replicating the experience.

By sgarland 2026-02-0516:091 reply

The person I replied to said they wanted an open-source reimplementation of Aurora. My point - which was probably poorly-worded, or just implied - was that there's a lot of work that goes into something like that, and if you can't put the pieces together on your own, you probably shouldn't be running it for anything you can't afford downtime on.

Managed services have a clear value proposition. I personally think they're grossly overpriced, but I understand the appeal. Asking for that experience but also free / cheap doesn't make any sense.

By nextaccountic 2026-02-078:59

> Asking for that experience but also free / cheap doesn't make any sense.

Things that used to be very expensive suddenly gets available for free after someone builds an open source version of it. That's just the nature of open source.

It's unreasonable to demand it from someone, but people do build things and release them for free all the time! Indeed, it makes plenty of sense to imagine that at some point in time, open source offerings of Postgres will be comparable to Aurora in ease of use.

By infecto 2026-02-0512:552 reply

What managed service? Curious, I don’t use the full suite of aws services but wondering what would take 45mins, maybe it was a large cluster of some sort that needed rolling changes?

By coliveira 2026-02-0513:071 reply

My observation is that all these services are exploding in complexity, and they justify saying that there are more features now, so everyone needs to accept spending more and more time and effort for the same results.

By patrick451 2026-02-0513:34

It's basically the same dynamic as hedonic adjustment in the CPI calculations. Cars may cost twice as much now they have usb chargers built in so inflation isn't really that bad.

By mrweasel 2026-02-0513:12

I think this was MWAA

By ragall 2026-02-0523:41

Cloud was never cheaper. It was as convenient.

By coredog64 2026-02-0514:541 reply

> If you're using something like ECS or serverless, AWS gains nothing by optimizing the servers to make your code run faster - their hard work results in less billed infrastructure hours.

If ECS is faster, then you're more satisfied with AWS and less likely to migrate. You're also open to additional services that might bring up the spend (e.g. ECS Container Insights or X-Ray)

Source: Former Amazon employee

By torginus 2026-02-0516:011 reply

We did some benchmarks and ECS was definitely quite a bit more expensive for a given capacity than just running docker on our own EC2 instances. It also bears pointing out that a lot of applications (either in-house or off-the-shelf) expect a persistent mutable config directory or sqlite database.

We used EFS to solve that issue, but it was very awkward, expensive and slow, its certainly not meant for that.

By coredog64 2026-02-0918:39

Wait until you run the smallest Fargate capacities and then Container Insights costs more in CloudWatch than your infrastructure costs :)

By parentheses 2026-02-0517:46

Fully agree to this. I find the cost of cloud providers is mostly driven by architecture. If you're cost conscious, cloud architectures need to be up-front designed with this in mind.

Microservices is a killer with cost. For each microservices pod - you're often running a bunch of side cars - datadog, auth, ingress - you pay massive workload separation overhead with orchestration, management, monitoring and ofc complexity

I am just flabbergasted that this is how we operate as a norm in our industry.

By lumost 2026-02-0517:21

I don’t understand why most cloud backend designs seem to strive for maximizing the number of services used.

My biggest gripe with this is async tasks where the app does numerous hijinks to avoid a 10 minute lambda processing timeout. Rather than structure the process to process many independent and small batches, or simply using a modest container to do the job in a single shot - a myriad of intermediate steps are introduced to write data to dynamo/s3/kinesis + sqs/and coordination.

A dynamically provisioned, serverless container with 24 cores and 64 GB of memory can happily process GBs of data transformations.

By re-thc 2026-02-0512:38

> your infra is so efficient and cheap that even paying 4x for it rather than going for colocation is well worth it because of the QoL and QoS.

You don’t need colocation to save 4x though. Bandwidth pricing is 10x. EC2 is 2-4x especially outside US. EBS for its iops is just bad.

By jdmichal 2026-02-0515:12

It's about fitting your utilization to the model that best serves you.

If you can keep 4 "Java boxes" fed with work 80%+ of the time, then sure EC2 is a good fit.

We do a lot of batch processing and save money over having EC2 boxes always on. Sure we could probably pinch some more pennies if we managed the EC2 box uptime and figured out mechanisms for load balancing the batches... But that's engineering time we just don't really care to spend when ECS nets us most of the savings advantage and is simple to reason about and use.

By BobbyTables2 2026-02-062:33

Also don’t forget the layers of REST services, ORMs, and other lunacy.

It runs slower than a bloated pig, especially on a shared hosting node, so now needs Kubernetes and cloud orchestration to make it “scalable” - beyond a few requests per second.

By nthdesign 2026-02-0514:57

Agreed. There is a wide price difference between running a managed AWS or Azure MySQL service and running MySQL on a VM that you spin up in AWS or Azure.

By bojangleslover 2026-02-0512:201 reply

Great comment. I agree it's a spectrum and those of us who are comfortable on (4) like yourself and probably us at Carolina Cloud [0] as well, (4) seems like a no brainer. But there's a long tail of semi-technical users who are more comfortable in 2-3 or even 1, which is what ultimately traps them into the ransomware-adjacent situation that is a lot of the modern public cloud. I would push back on "usage-based". Yes it is technically usage-based but the base fee also goes way up and there are also sometimes retainers on these services (ie minimum spend). So of course "usage-based" is not wrong but what it usually means is "more expensive and potentially far more expensive".

[0] https://carolinacloud.io, derek@

By spwa4 2026-02-0512:43

The problem is that clouds have easily become 3 or 5 times the price of managed services, 10x the price of option 3, and 20x the price of option 4. To say nothing of the fact that almost all businesses can run fine on "pc under desk" type situations.

So in practice cloud has become the more expensive option the second your spend goes over the price of 1 engineer.

By Lucasoato 2026-02-059:286 reply

Hetzner is definitely an interesting option. I’m a bit scared of managing the services on my own (like Postgres, Site2Site VPN, …) but the price difference makes it so appealing. From our financial models, Hetzner can win over AWS when you spend over 10~15K per month on infrastructure and you’re hiring really well. It’s still a risk, but a risk that definitely can be worthy.

By mrweasel 2026-02-0512:131 reply

> I’m a bit scared of managing the services on my own

I see it from the other direction, when if something fails, I have complete access to everything, meaning that I have a chance of fixing it. That's down to hardware even. Things get abstracted away, hidden behind APIs and data lives beyond my reach, when I run stuff in the cloud.

Security and regular mistakes are much the same in the cloud, but I then have to layer whatever complications the cloud provide comes with on top. If cost has to be much much lower if I'm going to trust a cloud provider over running something in my own data center.

By iso1631 2026-02-0523:28

Do you want the power to fix or do you want the paper to wave so you aren't held accountable.

The main benefit of outsourcing to aws etc is that the CEO isn't yelling at you when it breaks, because their golf buddies are in the same situation.

By adamcharnock 2026-02-059:33

You sum it up very neatly. We've heard this from quite a few companies, and that's kind of why we started our ours.

We figured, "Okay, if we can do this well, reliably, and de-risk it; then we can offer that as a service and just split the difference on the cost savings"

(plus we include engineering time proportional to cluster size, and also do the migration on our own dime as part of the de-risking)

By wulfstan 2026-02-0511:542 reply

I've just shifted my SWE infrastructure from AWS to Hetzner (literally in the last month). My current analysis looks like it will be about 15-20% of the cost - £240 vs 40-50 euros.

Expect a significant exit expense, though, especially if you are shifting large volumes of S3 data. That's been our biggest expense. I've moved this to Wasabi at about 8 euros a month (vs about $70-80 a month on S3), but I've paid transit fees of about $180 - and it was more expensive because I used DataSync.

Retrospectively, I should have just DIYed the transfer, but maybe others can benefit from my error...

By adamcharnock 2026-02-0512:302 reply

FYI, AWS offers free Egress when leaving them (because they were forced to be EU regulation, but they chose to offer it globally):

https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...

But. Don't leave it until the last minute to talk to them about this. They don't make it easy, and require some warning (think months, IIRC)

By sciencejerk 2026-02-0516:18

Thank God for the EU regulations. USA has been too lax about cracking down on anti-competitive market practices

By wulfstan 2026-02-0513:31

Extremely useful information - unfortunately I just assumed this didn't apply to me because I am in the UK and not the EU. Another mistake, though given it's not huge amounts of money I will chalk it up to experience.

Hopefully someone else will benefit from this helpful advice.

By iso1631 2026-02-0510:473 reply

> I’m a bit scared of managing the services on my own (like Postgres, Site2Site VPN, …)

Out of interest, how old are you? This was quite normal expectation of a technical department even 15 years ago.

By christophilus 2026-02-0512:07

I’m curious to know the answer, too. I used to deploy my software on-prem back in the day, and that always included an installation of Microsoft SQL Server. So, all of my clients had at least one database server they had to keep operational. Most of those clients didn’t have an IT staff at all, so if something went wrong (which was exceedingly rare), they’d call me and I’d walk them through diagnosing and fixing things, or I’d Remote Desktop into the server if their firewalls permitted and fix it myself. Backups were automated and would produce an alert if they failed to verify.

It’s not rocket science, especially when you’re talking about small amounts of data (small credit union systems in my example).

By infecto 2026-02-0513:002 reply

No it was not. 15 years ago Heroku was the rage. Even the places that had bare metal usually had someone running something similar to devops and at least core infrar was not being touched. I am sure places existed but 15 years while far away was already pretty far along from what you describe. At least in SV.

By acdha 2026-02-0513:071 reply

Heroku was popular with startups who didn’t have infrastructure skills but the price was high enough that anyone who wasn’t in that triangle of “lavish budget, small team, limited app diversity” wasn’t using it. Things like AWS IaaS were far more popular due to the lower cost and greater flexibility but even that was far from a majority service class.

By infecto 2026-02-0513:213 reply

I am not sure if you are trying to refute my lived experience or what exactly the point is. Heroku was wildly popular with startups at the time, not just those with lavish budgets. I was already touching RDS at this point and even before RDS came around no organization I worked at had me jumping on bare metal to provision services myself. There always a system in place where someone helped out engineering to deploy systems. I know this was not always the case but the person I was responding to made it sound like 15 years ago all engineers were provisioning their own database and doing other times of dev/sys ops on a regular basis. It’s not true at least in SV.

By acdha 2026-02-0517:431 reply

I have no doubt that was your experience. My point was that it wasn’t even common in SV as whole, just the startup scene. Think about headcount: how many times fewer people worked at your startup than any one of Apple, Oracle, HP, Salesforce, Intuit, eBay, Yahoo, etc.? Then thing about how many other companies there are just in the Bay Area who have large IT investments even if they’re not tech companies.

Even at their peak, Heroku was a niche. If you’d gone conferences like WWDC or Pycon at the time, they’d be well represented, yes, and plenty of people liked them but it wasn’t a secret that they didn’t cover everyone’s needs or that pricing was off putting for many people, and that tended to go up the bigger the company you talked to because larger organizations have more complex needs and they use enough stuff that they already have teams of people with those skills.

By infecto 2026-02-0613:08

I think we are talking past each other here. While your language is a bit proactive originally your not wrong as I have already agreed startups absolutely, lavish budgets? No that’s just silly.

Again 15 years even in moderately large organizations it was quite common as a product engineer to not be responsible for the provisioning all the required services for whatever you were building. And again it’s not the rule but it is far from being an exception. Not sure what you’re trying to prove or disprove.

By sanderjd 2026-02-0515:391 reply

A tricky thing on this site is that there are lots of different people with very different kinds of experience, which often results in people talking past each other. A lot of people here have experience as zero-to-one early startup engineers, and yep, I share your experience that Heroku was very popular in that space. A lot of other people have experience at later growth and infrastructure focused startups, and they have totally different experiences. And other people have experience as SREs at big tech, or doing IT / infrastructure for non-tech fortune 500 businesses. All of these are very different experiences, and very different things have been popular over the last couple decades depending on which kind of experience you have.

By infecto 2026-02-0517:321 reply

Absolutely true but I also think it’s a fair callout when the intent was to disprove the original post asking how old someone was because 15 years ago everyone was stringing together their own services which is absolutely not true. There were many shades of gray at that time both in my experience of either have a sysops/devops team to help or deploying to Heroku as well as folks that were indeed stringing together services.

I find it equally disingenuous to suggest that Heroku was only for startups with lavish budgets. Absolutely not true. That’s my only purpose here. Everyone has different experiences but don’t go and push your own narrative as the only one especially when it’s not true.

By sanderjd 2026-02-0519:28

I kind of thought the "15 years" was just one of those things where people kind of forget what year it is. Wow, 2010 was already over 15 years ago?? That kind of mistake. I think this person was thinking pre-2005. I graduated college just after that, and that's when all this cloud and managed services stuff was just starting to explode. I think it's true that before that, pretty much everyone was maintaining actual servers somewhere. (For instance, I helped out with the physical servers for our CS lab some when I was in college. Most of what we hosted on those would be easier to do on the cloud now, but that wasn't a thing then.)

By iso1631 2026-02-0517:211 reply

> Heroku was wildly popular with startups

The world's a lot bigger than startups

By infecto 2026-02-0517:35

Did you fail to finish reading the rest? At the same time I had touch with organizations that were still in data centers but I as an engineer had no touch on the bare metal and ticket systems were in place to help provision necessary services. I was not deploying my own Postgres database.

Your original statement is factually incorrect.

By unethical_ban 2026-02-0517:211 reply

SV and financial services are quite different.

It's 2026 and banks are still running their mainframe, running windows VMs on VMware and building their enterprise software with Java.

The big boys still have their own datacenters they own.

Sure, they try dabbling with cloud services, and maybe they've pushed their edge out there, and some minor services they can afford to experiment with.

By infecto 2026-02-0517:491 reply

If you are working at a bank you are most likely not standing up your own Postgres and related services. Even 15 years ago. I am not saying it never happened, I am saying that even 15 years ago even large orgs with data enters often had in place sys and devops that helped with providing resources. Obviously not the rule but also not an exception.

By unethical_ban 2026-02-0518:04

True. We had separate teams for Oracle and MSSQL management. We had 3 teams each for Windows, "midrange" (Unix) and mainframe server management. That doesn't include IAM.

By Lucasoato 2026-02-0516:53

Ahah I'm 31, but deciding if it makes sense to manage your own db doesn't depend on the age of the CTO.

See, turning up a VM, installing and running Postgres is easy.

The hard part is keeping it updated, keeping the OS updated, automate backups, deploying replicas, encrypting the volumes and the backups, demonstrating to a third party auditor all of the above... and mind that there might be many other things I honestly ignore!

I'm not saying I won't go that path, it might be a good idea after a certain scale, but in the first and second year of a startup your mind should 100% be on "How can I make my customer happy" rather than "We failed again the audit, we won't have the SOC 2 Type I certification in time to sign that new customer".

If deciding between Hetzner and AWS was so easy, one of them might not be pricing its services correctly.

By baby 2026-02-0510:491 reply

I’m wondering if it makes sense to distribute your architecture so that workers who do most of the heavy lifting are in hetzner, while the other stuff is in costly AWS. On the other hand this means you don’t have easy access to S3, etc.

By rockwotj 2026-02-0510:561 reply

networking costs are so high in AWS I doubt this makes sense

By mattbillenstein 2026-02-0520:52

Depends on how data-heavy the work is. We run a bunch of gpu training jobs on other clouds with the data ending up in S3 - the extra transfer costs wrt what we save on getting the gpus from the cheapest cloud available, it makes a lot of sense.

Also, just availability of these things on AWS has been a real pain - I think every startup got a lot of credits there, so flood of people trying to then use them.

By objektif 2026-02-0513:042 reply

No amount of money will make me maintain my own dbs. We tried it at first and it was a nightmare.

By g8oz 2026-02-0513:412 reply

It's worth becoming good at.

By sanderjd 2026-02-0515:412 reply

Is it though? This is a genuine question. My intuition is that the investment of time / stress / risk to become good at this is unlikely to have high ROI to either the person putting in that time or to the business paying them to do so. But maybe that's not right.

By g8oz 2026-02-0521:521 reply

It's more nuanced for sure than my pithy comment suggests. I've done both self-managed and managed and felt it was a good use of my time to self-manage given the size of the organizations, the profile of the workloads and the cost differential. There is a whole universe of technology businesses that do not earn SV/FAANG levels of ROI - for them, self-managed is a reasonable allocation of effort.

One point to keep in mind is that the effort is not constant. Once you reach a certain level of competency and stability in your setup, there is not much difference in time spent. I also felt that self-managed gave us more flexibility in terms of tuning.

My final point is that any investment in databases whether as a developer or as an ops person is long-lived and will pay dividends for a longer time than almost all other technologies.

By sanderjd 2026-02-0522:18

I feel like you and I have similar experiences, but have drawn entirely opposite conclusions from them :)

By Symbiote 2026-02-0516:333 reply

Managing the PostgreSQL databases is a medium to low complexity task as I see it.

Take two equivalent machines, set up with streaming replication exactly as described in the documentation, add Bacula for backups to an off-site location for point-in-time recovery.

We haven't felt the need to set up auto fail-over to the hot spare; that would take some extra effort (and is included with AWS equivalents?) but nothing I'd be scared of.

Add monitoring that the DB servers are working, replication is up-to-date and the backups are working.

By riku_iki 2026-02-0518:501 reply

> We haven't felt the need to set up auto fail-over to the hot spare; that would take some extra effort (and is included with AWS equivalents?) but nothing I'd be scared of.

this part is actually scariest, since there are like 10 different 3rd party solutions of unknown stability and maintanability.

By Symbiote 2026-02-069:19

I think if something like that is worrisome, I'd contact a PostgreSQL consultant for advice.

AWS charge about $500/month for this, so there's plenty of room to pay a consultant and still come out way ahead.

By cheema33 2026-02-0518:093 reply

> Managing the PostgreSQL databases is a medium to low complexity task as I see it.

Same here. But, I assume you have managed PostgreSQL in the past. I have. There are a large number of people software devs who have not. For them, it is not a low complexity task. And I can understand that.

I am a software dev for our small org and I run the servers and services we need. I use ansible and terraform to automate as much as I can. And recently I have added LLMs to the mix. If something goes wrong, I ask Claude to use the ansible and terraform skills that I created for it, to find out what is going on. It is surprisingly good at this. Similarly I use LLMs to create new services or change configuration on existing ones. I review the changes before they are applied, but this process greatly simplifies service management.

By Dylan16807 2026-02-0519:442 reply

> Same here. But, I assume you have managed PostgreSQL in the past. I have. There are a large number of people software devs who have not. For them, it is not a low complexity task. And I can understand that.

I'd say needing to read the documentation for the first time is what bumps it up from low complexity to medium. And then at medium you should still do it if there's a significant cost difference.

By Symbiote 2026-02-069:06

You can ask an AI or StackOverflow or whatever for the correct way to start a standby replica, though I think the PostgreSQL documentation is very good.

But if you were in my team I'd expect you to have read at least some of the documentation for any service you provision (self-hosted or cloud) and be able to explain how it is configured, and to document any caveats, surprises or special concerns and where our setup differs / will differ from the documented default. That could be comments in a provisioning file, or in the internal wiki.

That probably increases our baseline complexity since "I pressed a button on AWS YOLO" isn't accepted. I think it increases our reliability and reduces our overall complexity by avoiding a proliferation of services.

By sanderjd 2026-02-0522:181 reply

But is there a significant cost difference? I'm skeptical.

By Symbiote 2026-02-068:48

I hope this is the correct service, "Amazon RDS for PostgreSQL"? [1]

The main pair of PostgreSQL servers we have at work each have two 32-core (64-vthread) CPUs, so I think that's 128 vCPU each in AWS terms. They also have 768GiB RAM. This is more than we need, and you'll see why at the end, but I'll be generous and leave this as the default the calculator suggests, which is db.m5.12xlarge with 48 vCPU and 192GiB RAM.

That would cost $6559/month, or less if reserved which I assume makes sense in this case — $106400 for 3 years.

Each server has 2TB of RAID disk, of which currently 1TB is used for database data.

That is an additional $245/month.

"CloudWatch Database Insights" looks to be more detailed than the monitoring tool we have, so I will exclude that ($438/month) and exclude the auto-failover proxy as ours is a manual failover.

With the 3-year upfront cost this is $115000, or $192000 for 5 years.

Alternatively, buying two of yesterday's [2] list-price [3] Dell servers which I think are close enough is $40k with five years warranty (including next-business-day replacement parts as necessary).

That leaves $150000 for hosting, which as you can see from [4] won't come anywhere close.

We overprovision the DB server so it has the same CPU and RAM as our processing cluster nodes — that means we can swap things around in some emergency as we can easily handle one fewer cluster node, though this has never been necessary.

When the servers are out of warranty, depending on your business, you may be able to continue using them for a non-prod environment. Significant failures are still very unusual, but minor failures (HDDs) are more common and something we need to know how to handle anyway.

[1] https://calculator.aws/#/createCalculator/RDSPostgreSQL

[2] https://news.ycombinator.com/item?id=46899042

[3] There are significant discounts if you order regularly, buy multiple servers, or can time purchases when e.g. RAM is cheaper.

[4] https://www.voxility.com/colocation/prices

By sanderjd 2026-02-0519:221 reply

For what it's worth, I have also managed my own databases, but that's exactly why I don't think it's a good use of my time. Because it does take time! And managed database options are abundant, inexpensive, and perform well. So I just don't really see the appeal of putting time into this.

By mattbillenstein 2026-02-0520:541 reply

If you have a database, you still have work to do - optimizing, understanding indexes, etc. Managed services don't solve these problems for you magically and once you do them, just running the db itself isn't such a big deal and it's probably easier to tune for what you want to do.

By sanderjd 2026-02-0521:081 reply

Absolutely yes. But you have to do this either way. So it's just purely additive work to run the infrastructure as well.

I think if it were true that the tuning is easier if you run the infrastructure yourself, then this would be a good point. But in my experience, this isn't the case for a couple reasons. First of all, the majority of tuning wins (indexes, etc.) are not on the infrastructure side, so it's not a big win to run it yourself. But then also, the professionals working at a managed DB vendor are better at doing the kind of tuning that is useful on the infra side.

By mattbillenstein 2026-02-061:021 reply

Maybe, but you're paying through the nose continually for something you could learn to do once - or someone on your team could easily learn with a little time and practice. Like, if this is a tradeoff you want to make, it's fine, but at some point learning that 10% more can halve your hosting costs so it's well worth it.

By sanderjd 2026-02-0616:13

It's not the learning, it's the ongoing commitment of time and energy into something that is not a differentiator for the business (unless it is actually a database hosting business).

I can see how the cost savings could justify that, but I think it makes sense to bias toward avoiding investing in things that are not related to the core competency of the business.

By objektif 2026-02-0519:48

How do you manage availability zones in your fully self managed setup?

By sanderjd 2026-02-0517:171 reply

This sounds medium to high complexity to me. You need to do all those things, and also have multiple people who know how to do them, and also make sure that you don't lose all the people who know how to do them, and have one of those people on call to be able to troubleshoot and fix things if they go wrong, and have processes around all that. (At least if you are running in production with real customers depending on you, you should have all those things.)

With a managed solution, all of that is amortized into your monthly payment, and you're sharing the cost of it across all the customers of the provider of the managed offering.

Personally, I would rather focus on things that are in or at least closer to the core competency of our business, and hire out this kind of thing.

By objektif 2026-02-0517:341 reply

You are right. Are you actually seriously considering whether to go fully managed or self managed at this point? Pls go AWS route and thank me later :)

By sanderjd 2026-02-0519:231 reply

No not at all, I have the same opinion as you! But I'm curious to understand the opposite view.

By Symbiote 2026-02-069:141 reply

I ran through roughly our numbers here [1], it looks like self-hosted costs us about 25% of AWS.

I didn't include labour costs, but the self-hosted tasks (set up of hardware, OS, DB, backup, monitoring, replacing a failed component which would be really unusual) are small compared to the labour costs of the DB generally (optimizing indices, moving data around for testing etc, restoring from a backup).

[1] https://news.ycombinator.com/item?id=46910521

By sanderjd 2026-02-0616:15

Yes thank you for that. I always feel like these up front cost analyses miss (or underrate) the ongoing operational cost to monitor and be on call to fix infrastructure when problems occur. But I do find it plausible that the cost savings are such that this can be a wise choice nonetheless.

By objektif 2026-02-0517:37

I really do not think so. Most startups should rather focus on their core competency and direct engineering resources to their edge. When you are $100 mln ARR then feel free to mess around with whatever db setup you want.

By dev_l1x_be 2026-02-0520:47

Or CDN, queues, log service, observability, distributed storage. I am not even sure what the people in the on-prem vs cloud argument think. If you need a highly specialised infra with one or two core services and a lower tier network is ok then on-prem is ok. Otherwise if is a never ending quest to re-discover the millions of engineering hours went into building something like AWS.

By ibejoeb 2026-02-0516:04

Dead on. Recently, 3 and 4 have been compelling. Cloud costs have rocketed up. I started my casual transition to co-lo 2 years ago and just in december finished everything. I have more capacity at about 30% of the cost. If you go option 3, you even get the benefit of 6+ month retro pricing for RAM/storage. I'm running all DDR4, but I have so much of it I don't know what to do with it.

The flip side is that compliance is a little more involved. Rather than, say, carve out a whole swathe of SOC-2 ops, I have to coordinate some controls. It's not a lot, and it's still a lot lighter than I used to do 10+ years ago. Just something to consider.

By boplicity 2026-02-0514:061 reply

I don't know. I rent a bare metal server for $500 a month, which is way overkill. It takes almost no time to manage -- maybe a few hours a year -- and can handle almost anything I throw at it. Maybe my needs are too simple though?

By edge17 2026-02-0514:201 reply

Just curious, what is the spec you pay $6000/year for? Where/what is the line between rent vs buy?

By boplicity 2026-02-0514:531 reply

It's a server with:

- 2x Intel Xeon 5218

- 128gb Ram

- 2x960GB SSD

- 30TB monthly bandwidth

I pay around an extra $200/month for "premium" support and Acronis backups, both of which have come in handy, but are probably not necessary. (Automated backups to AWS are actually pretty cheap.) It definitely helps with peace of mind, though.

By cheema33 2026-02-0517:541 reply

I have a similar system from Hetzner. I pay around $100 for it. No bandwidth cap.

I have setup encrypted backups to go to my backup server in the office. We have a gigabit service at the office. Critical data changes are backed up every hour and full backup once a day.

By boplicity 2026-02-0520:05

Yeah -- I know I could probably get a better deal. I pay more for premium support ($200), as well as a North American location. Plus, probably an addition premium for not wanting to go through the effort of switching servers.

By mgaunard 2026-02-059:071 reply

you're missing 5, what they are doing.

There is a world of difference between renting some cabinets in an Equinix datacenter and operating your own.

By adamcharnock 2026-02-059:204 reply

Fair point!

5 - Datacenter (DC) - Like 4, except also take control of the space/power/HVAC/transit/security side of the equation. Makes sense either at scale, or if you have specific needs. Specific needs could be: specific location, reliability (higher or lower than a DC), resilience (conflict planning).

There are actually some really interesting use cases here. For example, reliability: If your company is in a physical office, how strong is the need to run your internal systems in a data centre? If you run your servers in your office, then there's no connectivity reliability concerns. If the power goes out, then the power is out to your staff's computers anyway (still get a UPS though).

Or perhaps you don't need as high reliability if you're doing only batch workloads? Do you need to pay the premium for redundant network connections and power supplies?

If you want your company to still function in the event of some kind of military conflict, do you really want to rely on fibre optic lines between your office and the data center? Do you want to keep all your infrastructure in such a high-value target?

I think this is one of the more interesting areas to think about, at least for me!

By jermaustin1 2026-02-0513:09

When I worked IT for a school district at the beginning of my career (2006-2007), I was blown away that every school had a MASSIVE server room (my office at each school - the MDF). 3-5 racks filled (depending on school size and connection speed to the central DC - data closet) 50-75% was networking equipment (5 PCs per class hardwired), 10% was the Novell Netware server(s) and storage, the other 15% was application storage for app distributions on login.

By mgaunard 2026-02-059:311 reply

Personally I haven't seen a scenario where it makes sense beyond a small experimental lab where you value the ability to tinker physically with the hardware regularly.

Offices are usually very expensive real estate in city centers and with very limited cooling capabilities.

Then again the US is a different place, they don't have cities like in Europe (bar NYC).

By kryptiskt 2026-02-0513:262 reply

If you are a bank or a bookmaker or similar you may well want to have total control of physical access to the machines. I know one bookmaker I worked with had their own mini-datacenter, mainly because of physical security.

By tomcam 2026-02-0513:57

I am pretty forward-thinking but even when I started writing my first web server 30+ years ago I didn’t foresee the day when the phrase “my bookie’s datacenter” might cross my lips.

By mgaunard 2026-02-0517:06

Most trading venues are in Equinix data centers.

By direwolf20 2026-02-0510:25

If you have less than a rack of hardware, if you have physical security requirements, and/or your hardware is used in the office more than from the internet, it can make sense.

By noosphr 2026-02-0510:02

5 was a great option for ml work last year since colo rented didn't come with a 10kW cable. With ram, sd and GPU prices the way they are now I have no idea what you'd need to do.

Thank goodness we did all the capex before the OpenAI ram deal and expensive nvidia gpus were the worst we had to deal with.

By weavie 2026-02-059:512 reply

What is the upper limit of Hertzner? Say you have an AWS bill in the $100s of millions, could Hertzner realistically take on that scale?

By adamcharnock 2026-02-0510:111 reply

An interesting question, so time for some 100% speculation.

It sounds like they probably have revenue in the €500mm range today. And given that the bare metal cost of AWS-equivalent bills tends to be a 90% reduction, we'll say a €10mm+ bare metal cost.

So I would say a cautious and qualified "yes". But I know even for smaller deployments of tens or hundreds of servers, they'll ask you what the purpose is. If you say something like "blockchain," they're going to say, "Actually, we prefer not to have your business."

I get the strong impression that while they naturally do want business, they also aren't going to take a huge amount of risk on board themselves. Their specialism is optimising on cost, which naturally has to involve avoiding or mitigating risk. I'm sure there'd be business terms to discuss, put it that way.

By StilesCrisis 2026-02-0512:371 reply

Why would a client who wants to run a Blockchain be risky for Herzner? I'm not a fan, I just don't see the issue. If the client pays their monthly bill, who cares if they're using the machine to mine for Bitcoin?

By Symbiote 2026-02-0512:481 reply

They are certain to run the machines at 100% continually, which will cost more than a typical customer who doesn't do this, and leave the old machines with less second-hand value for their auction thing afterwards.

By mbreese 2026-02-0513:141 reply

I’d bet that main reason would be power. Running machines at 100% doesn’t subtract much extra , but a server running hard for 24 hours would use more power than a bursty workload.

(While we’re all speculating)

By ndriscoll 2026-02-0515:46

Also very subject to wildly unstable market dynamics. If it's profitable to mine, they'll want as much capacity as they can get, leading Hetzner to over provision. Then once it becomes unprofitable, they'll want to stop all mining, leaving a ton of idle, unpaid machines. Better to have stable customers that don't swing 0-100 utilization depending on ability to arbitrage compute costs.

I wouldn't be surprised if mining is also associated with fraud (e.g. using stolen credit cards to buy compute).

By geocar 2026-02-0510:205 reply

Who are you thinking of?

Netflix might be spending as much as $120m (but probably a little less), and I thought they were probably Amazon's biggest customer. Does someone (single-buyer) spend more than that with AWS?

Hertzner's revenue is somewhere around $400m, so probably a little scary taking on an additional 30% revenue from a single customer, and Netflix's shareholders would probably be worried about risk relying on a vendor that is much smaller than them.

Sometimes if the companies are friendly to the idea, they could form a joint venture or maybe Netflix could just acquire Hertzner (and compete with Amazon?), but I think it unlikely Hertzner could take on Netflix-sized for nontechnical reasons.

However increasing pop capacity by 30% within 6mo is pretty realistic, so I think they'd probably be able to physically service Netflix without changing too much if management could get comfortable with the idea

By phiresky 2026-02-0510:301 reply

A $120M spend on AWS is equivalent to around a $12M spend on Hetzner Dedicated (likely even less, the factor is 10-20x in my experience), so that would be 3% of their revenue from a single customer.

By geocar 2026-02-0516:541 reply

> A $120M spend on AWS is equivalent to around a $12M spend on Hetzner Dedicated (likely even less, the factor is 10-20x in my experience), so that would be 3% of their revenue from a single customer.

I'm not convinced.

I assume someone at Netflix has thought about this, because if that were true and as simple as you say, Netflix would simply just buy Hetzner.

I think there lots of reasons you could have this experience, and it still wouldn't be Netflix's experience.

For one, big applications tend to get discounts. A decade ago when I (the company I was working for) was paying Amazon a mere $0,2M a month and getting much better prices from my account manager than were posted on the website.

There are other reasons (mostly from my own experiences pricing/costing big applications, but also due to some exotic/unusual Amazon features I'm sure Netflix depends on) but this is probably big enough: Volume gets discounts, and at Netflix-size I would expect spectacular discounts.

I do not think we can estimate the factor better than 1.5-2x without a really good example/case-study of a company someplace in-between: How big are the companies you're thinking about? If they're not spending at least $5m a month I doubt the figures would be indicative of the kind of savings Netflix could expect.

By varsketiz 2026-02-0520:181 reply

We run our own infrastructure, sometimes with our own fincing (4), sometimes external (3). The cost is in tens of millions per year.

When I used to compare to aws, only egress at list price costs as much as my whole infra hosting. All of it.

I would be very interested to understand why netflix does not go 3/4 route. I would speculate that they get more return from putting money in optimising costs for creating original content, rather than cloud bill.

By geocar 2026-02-067:231 reply

> I would be very interested to understand why netflix does not go 3/4 route. I would speculate that they get more return from putting money in optimising costs for creating original content, rather than cloud bill.

I invest in Netflix, which means I'm giving them some fast cash to grow that business.

I'm not giving them cash so that they can have cash.

If they share a business plan that involves them having cash to do X, I wonder why they aren't just taking my cash to do X.

They know this. That's why on the investors calls they don't talk about "optimising costs" unless they're in trouble.

I understand self-hosting and self-building saves money in the long-long term, and so I do this in my own business, but I'm also not a public company constantly raising money.

> When I used to compare to aws, only egress at list price costs as much as my whole infra hosting. All of it.

I'm a mere 0,1% of your spend, and I get discounts.

You would not be paying "list price".

Netflix definitely would not be.

By varsketiz 2026-02-068:04

Of course netflix is optimising costs, otherwise it would not be a business, I just think they put much more effort elsewhere. They could be using other words, like "financial discipline" :)

My point is that even if I get 20 times discount on egress its still nowhere close, since i have to buy everything else - compute, storage are more expensive, and even with 5-10x discounts from list price its not worth it.

(Our cloud bills are in the millions as well, I am familiar with what discounts we can get)

By objektif 2026-02-0513:081 reply

Figma apparently spends around 300-400k/day on AWS. I think this puts them up there.

By mbreese 2026-02-0513:17

How is this reasonable? At what point do they pull a Dropbox and de-AWS? I can’t think of why they would gain with AWS over in house hosting at that point.

I’m not surprised, but you’d think there would be some point where they would decide to build a data center of their own. It’s a mature enough company.

By direwolf20 2026-02-0510:23

That $120m will become $12m when they're not using AWS.

By Quarrel 2026-02-0512:23

> Hertzner's revenue is somewhere around $400m, so probably a little scary taking on an additional 30% revenue from a single customer

A little scare for both sides.

Unless we're misunderstanding something I think the $100Ms figure is hard to consider in a vacuum.

By weavie 2026-02-0518:05

I'm largely just thinking $HUGE when throwing out that number, but there are plenty of companies that have cloud costs in that range. A quick search brings up Walmart, Meta, Netflix, Spotify, Snap, JP Morgan.

By DyslexicAtheist 2026-02-059:162 reply

this is what we did in the 90ies into mid 2000:

> Buy and colocate the hardware yourself – Certainly the cheapest option if you have the skills

back then this type of "skill" was abundant. You could easily get sysadmin contractors who would take a drive down to the data-center (probably rented facilities in a real-estate that belonged to a bank or insurance) to exchange some disks that died for some reason. such a person was full stack in a sense that they covered backups, networking, firewalls, and knew how to source hardware.

the argument was that this was too expensive and the cloud was better. so hundreds of thousands of SME's embraced the cloud - most of them never needed Google-type of scale, but got sucked into the "recurring revenue" grift that is SaaS.

If you opposed this mentality you were basically saying "we as a company will never scale this much" which was at best "toxic" and at worst "career-ending".

The thing is these ancient skills still exist. And most orgs simply do not need AWS type of scale. European orgs would do well to revisit these basic ideas. And Hetzner or Lithus would be a much more natural (and honest) fit for these companies.

By belorn 2026-02-0510:002 reply

I wonder how much companies pay yearly in order to avoid having an employee pick up a drive from a local store, drive to the data center, pull the disk drive, screw out the failing hard drive and put in the new one, add it in the raid, verify the repair process has started, and then return to the office.

By Symbiote 2026-02-0510:421 reply

I don't think I've ever seen a non-hot-swap disk in a normal server. The oldest I dealt with had 16 HDDs per server, and only 12 were accessible from the outside, bu the 4 internal ones were still hot-swap after taking the cover off.

Even some really old (2000s-era) junk I found in a cupboard at work was all hot-swap drives.

But more realistically in this case, you tell the data centre "remote hands" person that a new HDD will arrive next-day from Dell, and it's to go in server XYZ in rack V-U at drive position T. This may well be a free service, assuming normal failure rates.

By belorn 2026-02-0514:02

Yes, I did write that a bit hasty. I changed above to the normal process. As it happened we just installed a server without hotswap disk, but to be fair that is the first one I have personally seen in the last 20 years.

Remote hands is a thing indeed. Servers also tend to be mostly pre-built now days by server retailers, even when buying more custom made ones like servermicro where you pick each component. There isn't that many parts to a generic server purchase. Its a chassi, motherboard, cpu, memory, and disks. PSU tend to be determined by the motherboard/chassi choice, same with disk backplanes/raid/ipmi/network/cables/ventilation/shrouds. The biggest work is in doing the correct purchase, not in the assembly. Once delivered you put on the rails, install any additional item not pre-built, put it in the rack and plug in the cables.

By amluto 2026-02-0510:38

In the Bay Area there are little datacenters that will happily colocate a rack for you and will even provide an engineer who can swap disks. The service is called “remote hands”. It may still be faster to drive over.

By theodric 2026-02-059:591 reply

> ancient skills https://youtu.be/ZtYU87QNjPw?&t=10

It baffles me that my career trajectory somehow managed to insulate me from ever having to deal with the cloud, while such esoteric skills as swapping a hot swap disk or racking and cabling a new blade chassis are apparently on the order of finding a COBOL developer now. Really?

I can promise you that large financial institutions still have datacenters. Many, many, many datacenters!

By direwolf20 2026-02-0510:34

we had two racks in our office of mostly developers. If you have an office you already have a rack for switches and patch panels. Adding a few servers is obvious.

Software development isn't a typical SME however. Mike's Fish and Chips will not buy a server and that's fine.

By sanderjd 2026-02-0516:511 reply

This space of #2 like Lithus is not something I'm very familiar with, so thank you for the comment that piqued my interest!

If you're willing to share, I'm curious who else you would describe as being in this space.

My last decade and a half or so of experience has all been in cloud services, and prior to that it was #3 or #4. What was striking to me when I went to the Lithus website was that I couldn't figure out any details without hitting a "Schedule a Call" button. This makes it difficult for me to map my experiences in using cloud services onto what Lithus offers. Can I use terraform? How does the kubernetes offering work? How does the ML/AI data pipelines work? To me, it would be nice if I could try it out in a very limited way as self-service, or at least read some technical documentation. Without that, I'm left wondering how it works. I'm sure this is a conscious decision to not do this, and for good reasons, but I thought I'd share my impressions!

By adamcharnock 2026-02-0517:061 reply

Hello! I think this is a fair question, and improving the communication on the website is something that is steadily climbing up our priority list.

We're not really that kind of product company; we're more of a services company. What we do is deploy Kubernetes clusters onto bare metal servers. That's the core technical offering. However, everything beyond that is somewhat per-client. Some clients need a lot of compute. Some clients need a custom object storage cluster. Some clients need a lot of high-speed internal networking. Which is why we prefer to have a call to figure out specifically what your needs are. But I can also see how this isn't necessarily satisfying if you're used to just grabbing the API docs and having a look around.

What we will do is take your company's software stack and migrate it off AWS/Azure/Google and deploy it onto our new infrastructure. We will then become (or work with) your DevOps team to supporting you. This can be anything from containerising workloads to diagnosing performance issues to deploying a new multi-region Postgres cluster. Whatever you need done on your hardware that we feel we can reasonably support. We are the ones on-call should NATS fall over at 4am.

Your team also has full access to the Kubernetes cluster to deploy to as you wish.

I think the pricing page is the most concrete thing on our website, and it is entirely accurate. If you were to phone us and say, "I want that exact hardware," we would do it for you. But the real value we also offer is in the DevOps support we provide, actually doing the migration up-front (at our own cost), and being there working with your team every week.

By sanderjd 2026-02-0517:28

This makes total sense to me. I'm thinking through the flow that would lead me to be a customer of yours.

In my current job, I think we're honestly a bit past the phase where I would want to take on a migration to a service like yours. We already have a good team of infrastructure folks running our cloud infrastructure, and we have accepted the lock-in of various AWS managed services. So the high-touch devops support doesn't sound that useful to me (we already have people who are good at this), and replacing all the locked-in components seems unlikely to have good ROI. I think we'd be more likely to go straight to #3 if we decided to take that on to save money.

But I'll probably be a founder or early employee at a new startup again someday, and I'm intrigued by your offering from that perspective. But it seems pretty clear to me that I shouldn't call you up on day 1, because I'm going to be nowhere near $5k a month, and I want to move faster than calling someone up to talk about my needs. I want to self-serve a small amount of usage, and cloud services seem really great for that. But this is how they get you! Once you've started with a particular cloud service, it's always easiest to take on more lock-in.

At some point between these two situations, though, I can see where your offering would be great. But the decision point isn't all that clear to me. In my experience, by the time you start looking at your AWS bill and thinking "crap, that seems pretty expensive", you have better things to do than an infrastructure migration, and you have taken on some lock-in.

I do like the idea of high-touch services to solve the breaking-the-lock-in challenge! I'll certainly keep this in mind next time I find myself in this middle ground where the cloud starts feeling more expensive than it's worth, but we don't want to go straight to #3.

By whiplash451 2026-02-0514:50

> Option 1 is great for startups

Unfortunately, (successful) startups can quickly get trapped into this option. If they're growing fast, everyone on the board will ask why you'd move to another option at the first place. The cloud becomes a very deep local minimum that's hard to get out off.

By eru 2026-02-0511:412 reply

> 4 - Buy and colocate the hardware yourself – Certainly the cheapest option if you have the skills, scale, cap-ex, and if you plan to run the servers for at least 3-5 years.

Is it still the cheapest after you take into account that skills, scale, cap-ex and long term lock-in also have opportunity costs?

By graemep 2026-02-0512:04

That is why the the second "if" is there.

You can get locked into cloud too.

The lock in is not really long term as it is an easy option to migrate off.

By chrisandchris 2026-02-0612:10

Personal experience: Did some cloud stuff for SME, and later on started colocation. I think my learning curve for all the cloud-stuff was the same as for all the colocation stuff, except the cloud will not get you rid of firewalls, NAT, DHCP and all that stuff. Cloud isn't that much easier, it's just a little bit different. IMHO, the largest disadvantage of colocation is that it requires (sometimes) physical presence at a datacenter.

By LeFantome 2026-02-0618:06

This is a good summary and my take as well.

The cloud is great when you just need to start and when you do not know what scale you will need. Minimal initial cost and no wasted time over planning things you do not know enough about.

The cloud is horrible for steady-state demand. You are over-paying for your base load. If your demand does not scale that much, you do not benefit from the flexibility. Distance from the edge can cause performance problems. In an effort to “save money” you will chase complexity and bake in total reliance on your cloud provider. AAA l

Starting with the cloud makes sense. Just make sure not to engineer a solution you cannot take somewhere else.

As you scale and demand becomes known, you can start to migrate some stuff on premises or to other managed providers.

The great thing about “cloud architecture” is that you can use a hybrid model. You can selectively move parts of the stack. You can host your baseline demand and still rely on the cloud for scalability.

Where you need to spend the money and gain the expertise is in design. Not a giant features waterfall but rather knowing how to build an application and infrastructure that is adaptable and portable as you scale.

Keep it simple but also keep it modular.

At least, that had been my experience.

By themafia 2026-02-0523:33

> while largely maximising operational costs

The core services are cheap. S3 is cheap. Dynamo is cheap. Lambda is exceedingly cheap. Not understanding these services on their own terms and failing to read the documentation can lead one to use them in highly inefficient ways.

The "cloud" isn't just "another type of server." It's another type of /service/. Every costly stack I've seen fails to accept this truth.

By preisschild 2026-02-059:441 reply

Been using Hetzner Cloud for Kubernetes and generally like it, but it has its limitations. The network is highly unpredictable. You at best get 2Gbit/s, but at worst a few hundreds of Mbit/s.

https://docs.hetzner.com/cloud/technical-details/faq/#what-k...

By victorbjorklund 2026-02-0511:281 reply

Is that for the virtual private network? I heard some people say that you actually get higher bandwidth if you're using the public network instead of the private network within Hetzner, which is a little bit crazy.

By direwolf20 2026-02-0511:36

Hetzner dedicated is pretty bad at private networks, so bad you should use a VPN instead. Don't know about the cloud side of things.

By Schlagbohrer 2026-02-0510:552 reply

Can someone explain 2 to me. How is a managed private cloud different from full cloud? Like you are still using AWS or Azure but you are keeping all your operation in a bundled, portable way, so you can leave that provider easily at any time, rather than becoming very dependent on them? Is it like staying provider-agnostic but still cloud based?

By adamcharnock 2026-02-0511:202 reply

To put it plainly: We deploy a Kubernetes cluster on Hetzner dedicated servers and become your DevOps team (or a part thereof).

It works because bare metal is about 10% the cost of cloud, and our value-add is in 1) creating a resilient platform on top of that, 2) supporting it, 3) being on-call, and 4) being or supporting your DevOps team.

This starts with us providing a Kubernetes cluster which we manage, but we also take responsibility for the services run on it. If you want Postgres, Redis, Clickhouse, NATS, etc, we'll deploy it and be SLA-on-call for any issues.

If you don't want to deal with Kubernetes then you don't have to. Just have your software engineers hand us the software and we'll handle deployment.

Everything is deployed on open source tooling, you have access to all the configuration for the services we deploy. You have server root access. If you want to leave you can do.

Our customers have full root access, and our engineers (myself included) are in a Slack channel with you engineers.

And, FWIW, it doesn't have to be Hetzner. We can colocate or use other providers, but Hetzner offer excellent bang-per-buck.

Edit: And all this is included in the cluster price, which comes out cheaper than the same hardware on the major cloud providers

By Annatar 2026-02-0512:43

[dead]

By mancerayder 2026-02-0513:271 reply

You give customers root but you're on call when something goes tits up?

You're a brave DevOps team. That would cause a lot of friction in my experience, since people with root or other administrative privileges do naughty things, but others are getting called in on Saturday afternoon.

By belthesar 2026-02-0514:161 reply

From a platform risk perspective, each tenant has dedicated resources, so it's their platform to blow up. If a customer with root access blows up their own system, then the resources from the MSP to fix it are billable, and the after-action meetings would likely include a review of whether that access is appropriate, if additional training is needed to prevent those issues in the future (also billable), or if the customer-provider relationship is the right fit. Will the on-call resource be having a bad time fixing someone else's screw up? Yeah, and having been that guy before, I empathize. The business can and should manage this relationship however, so that it doesn't become an undue burden on their support teams. A customer platform that is always getting broken at 4pm on a Friday when an overzealous customer admin is going in and deciding to run arbitrary kubectl commands takes support capacity away from other customers when a major incident happens, regardless of how much you're making in support billing.

By adamcharnock 2026-02-0515:36

This is essentially how it is. Additionally, the reality is that our customers don't often even need to think about using root access, but they have it if they want it. They are putting a lot of trust in us, so we also put trust in them.

By victorbjorklund 2026-02-0511:25

Instead of using the Cloud's own Kubernetes service, for example, you just buy the compute and run your own Kubernetes cluster. At a certain scale that is going to be cheaper if you have to know how. And since you are no longer tied to which services are provided and you just need access to compute and storage. you can also shop around for better prices than Amazon or Azure since you can really go to any provider of a VPS.

By CrzyLngPwd 2026-02-0511:04

#2.5ish

We rent hardware and also some VPS, as well as use AWS for cheap things such as S3 fronted with Cloudflare, and SES for priority emails.

We have other services we pay for, such as AI content detection, disposable email detection, a small postal email server, and more.

We're only a small business, so having predictable monthly costs is vital.

Our servers are far from maxed out, and we process ~4 million dynamic page and API requests per day.

By megggan 2026-02-0517:541 reply

Getting rid of bureaucratic internal IT department is a game changer for productivity. That alone is worth 10x infra costs, especially for big companies where work can grind to a halt dealing with obstructionists through service now. Good leaders understand this.

By bell-cot 2026-02-0520:15

Sadly true. Or, the so-called internal IT Dept. can be a shambolic mess of PHB's, Brunchlords, Catberts, metric maximizers, and micromanagers, presiding over the hollowed-out and burned out remains of the actual workforce that you'd need to reliably do the job.

By Archelaos 2026-02-0512:33

I am using something inbetween 2 and 3, a hosted Web-site and database service with excellent customer support. On shared hardware it is 22 €/month. A managed server on dedicated hardware starts at about 50 €/month.

By doctorpangloss 2026-02-0519:40

Where do AWS reserved instances come into your hierarchy? What if there existed a “perpetual” reserved instance? Is cap-ex vs. op-ex really the key distinction?

By jgalt212 2026-02-0514:11

We looked at option 4. And colocation is not cheap. It was cheaper for us to lease VMs from Hetzner than to buy boxes and colocate at Equinix.

By rcpt 2026-02-0517:45

5. On-premise and engineers touch the wires every few days.

By adarsh2321 2026-02-0516:02

[dead]

By bpavuk 2026-02-059:105 reply

if someone on the DevOps team knows Nix, option 3 becomes a lot cheaper time-wise! yeah, Nix flakes still need maintenance, especially on the `nixos-unstable` branch, but you get the quickest disaster recovery route possible!

plus, infra flexibility removes random constraints that e.g. Cloudflare Workers have

By slyall 2026-02-0510:02

There are a bunch of ways to manage bare metal servers apart from Nix. People have been doing it for years. Kickstart, theforeman, maas, etc, [0]. Many to choose from according to your needs and layers you want them to manage.

Reality is these days you just boot a basic image that runs containers

[0] Longer list here: https://github.com/alexellis/awesome-baremetal

By adamcharnock 2026-02-059:23

Indeed! We've yet to go down this route, but it's something we're thinking on. A friend and I have been talking about how to bring Nix-like constructs to Kubernetes as well, which has been interesting. (https://github.com/clotodex/kix, very much in the "this is fun to think about" phase)

By aequitas 2026-02-059:58

This is what we do, I gave a talk about our setup earlier this week at CfgMgmtCamp: https://www.youtube.com/watch?v=DBxkVVrN0mA&t=8457s

By muvlon 2026-02-059:371 reply

Option 4 as well, that's how we do it at work and it's been great. However, it can't really be "someone on the team knows Nix", anyone working on Ops will need Nix skills in order to be effective.

By lstodd 2026-02-0521:051 reply

Why this fixation on Nix? You don't need Nix to run bare metal.

By bpavuk 2026-02-088:25

Nix makes sure that everything is exactly as you declared, and that in case of [INSERT APOCALYPTIC EVENT], you'll be able to recover much faster

By preisschild 2026-02-059:45

I'm a NixOS fan, but been using Talos Linux on Hetzner nodes (using Cluster-API) to form a Kubernetes Cluster. Works great too!

By speedgoose 2026-02-058:116 reply

I would suggest to use both on-premise hardware and cloud computing. Which is probably what comma is doing.

For critical infrastructure, I would rather pay a competent cloud provider than being responsible for reliability issues. Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.

For running many slurm jobs on good servers, cloud computing is very expensive and you sometimes save money in a matter of months. And who cares if the server room is a total loss after a while, worst case you write some more YAML and Terraform and deploy a temporary replacement in the cloud.

Another thing between is colocation, where you put hardware you own in a managed data center. It’s a bit old fashioned, but it may make sense in some cases.

I can also mention that research HPCs may be worth considering. In research, we have some of the world fastest computers at a fraction of the cost of cloud computing. It’s great as long as you don’t mind not being root and having to use slurm.

I don’t know in USA, but in Norway you can run your private company slurm AI workloads on research HPCs, though you will pay quite a bit more than universities and research institutions. But you can also have research projects together with universities or research institutions, and everyone will be happy if your business benefits a lot from the collaboration.

By epolanski 2026-02-059:211 reply

> but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO

I worked in a company with two server farms (a main and a a backup one essentially) in Italy located in two different regions and we had a total of 5 employees taking care of them.

We didn't hear about them, we didn't know their names, but we had almost 100% uptime and terrific performance.

There was one single person out of 40 developers who's main responsibility were deploys, and that's it.

It costed my company 800k euros per year to run both the server farms (hardware, salaries, energy), and it spared the company around 7-8M in cloud costs.

Now I work for clients that spend multiple millions in cloud for a fraction of the output and traffic, and I think employ around 15+ dev ops engineers.

By riku_iki 2026-02-0519:531 reply

it depends on complexity of your infra.

Running full scale kubernets, with multiple databases and services and expected 99.99% uptime likely can't be handled by one person.

By lstodd 2026-02-0521:13

Takes a team of 3-4 in my experience. One person doesn't cut it when the talk of percents of uptime starts no matter what scale. (and no matter cloud, dedicated or on-premises).

By olavgg 2026-02-058:417 reply

> I would rather pay a competent cloud provider than being responsible for reliability issues.

Why do so many developers and sysadmins think they're not competent for hosting services. It is a lot easier than you think, and its also fun to solve technical issues you may have.

By pageandrew 2026-02-058:453 reply

The point was about redundancy / geo spread / HA. It’s significantly more difficult to operate two physical sites than one. You can only be in one place at a time.

If you want true reliability, you need redundant physical locations, power, networking. That’s extremely easy to achieve on cloud providers.

By PunchyHamster 2026-02-058:54

You can just rent the rack space in datacenter and have that covered. It's still much cheaper than running that in cloud.

It doesn't make sense if you only have few servers, but if you are renting equivalent of multiple racks of servers from cloud and run them for most of the day, on-prem is staggeringly cheaper.

We have few racks and we do "move to cloud" calculation every few years and without fail they come up at least 3x the cost.

And before the "but you need to do more work" whining I hear from people that never did that - it's not much more than navigating forest of cloud APIs and dealing with random blackbox issues in cloud that you can't really debug, just go around it.

By direwolf20 2026-02-0510:37

How much does your single site go down?

On cloud it's out of your control when an AZ goes down. When it's your server you can do things to increase reliability. Most colos have redundant power feeds and internet. On prem that's a bit harder, but you can buy a UPS.

If your head office is hit by a meteor your business is over. Don't need to prepare for that.

By account42 2026-02-058:521 reply

You don't need full "cloud" providers for that, colocation is a thing.

By nicman23 2026-02-059:41

or just to be good at hiding the round trip of latency

By jim180 2026-02-058:50

Also I'd add this question, why do so many developers and sysadmins think, that cloud companies always hire competent/non-lazy/non-pissed employees?

By rvz 2026-02-058:52

> Why do so many developers and sysadmins think they're not competent for hosting services.

Because those services solve the problem for them. It is the same thing with GitHub.

However, as predicted half a decade ago with GitHub becoming unreliable [0] and as price increases begin to happen, you can see that self-hosting begins to make more sense and you have complete control of the infrastructure and it has never been more easier to self host and bring control over costs.

> its also fun to solve technical issues you may have.

What you have just seen with coding agents is going to have the same effect on "developers" that will have a decline in skills the moment they become over-reliant on coding agents and won't be able to write a single line of code at all to fix a problem they don't fully understand.

[0] https://news.ycombinator.com/item?id=22867803

By faust201 2026-02-059:442 reply

> Why do so many developers and sysadmins think they're not competent for hosting services. It is a lot easier than you think, and its also fun to solve technical issues you may have.

It is a different skillset. SRE is also an under-valued/paid (unless one is in FAANGO).

By clickety_clack 2026-02-0513:461 reply

It’s all downside. If nothing goes wrong, then the company feels like they’re wasting money on a salary. If things go wrong they’re all your fault.

By faust201 2026-02-0514:00

Correct

By sgarland 2026-02-0516:34

SRE has also lost nearly all meaning at this point, and more or less is equivalent to "I run observability" (but that's a SaaS solution too).

By infecto 2026-02-0513:13

Maybe you find it fun. I don’t, I prefer building software not running and setting up servers.

It’s also nontrivial once you go past some level of complexity and volume. I have made my career at building software and part of that requires understanding the limitations and specifics of the underlying hardware but at the end of the day I simply want to provision and run a container, I don’t want to think about the security and networking setup it’s not worth my time.

By tomcam 2026-02-0514:051 reply

Because when I’m running a busy site and I can’t figure out what went wrong, I freak out. I don’t know whether the problem will take 2 hours or 2 days to diagnose.

By MaKey 2026-02-0514:211 reply

Usually you can figure out what went wrong pretty quickly. Freaking out doesn't help with the "quickly" part though.

By tomcam 2026-02-062:32

I’m not as smart as you

By speedgoose 2026-02-059:05

At a previous job, the company had its critical IT infrastructure on their own data center. It was not in the IT industry, but the company was large and rich enough to justify two small data centers. It notably had batteries, diesel generators, 24/7 teams, and some advanced security (for valid reasons).

I agree that solving technical issues is very fun, and hosting services is usually easy, but having resilient infrastructure is costly and I simply don't like to be woken up at night to fix stuff while the company is bleeding money and customers.

By bigfatkitten 2026-02-059:10

> Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.

Speaking as someone who does this, it is very straightforward. You can rent space from people like Equinix or Global Switch for very reasonable prices. They then take care of power, cooling, cabling plant etc.

By Torq_boi 2026-02-0518:06

Yes, we still use the azure for user-facing services and the website. They don't need GPUs and don't need expensive resources, so it's not as worth it to bring those in-house.

We also rely on github. It has historically been good a service, but getting worth it.

By lstodd 2026-02-0521:09

I don't get why most everyone insists on comparing cloud to on-premises and not to dedicated. Why would anyone run own DC infra when there's Hetzner and many others?

By Schlagbohrer 2026-02-0511:12

Unfortunately we experienced an issue where our Slurm pool was contaminated by a misconstrained Postgres Daemon. Normally the contaminated slurm pool would drain into a docker container, but due to Rust it overloaded and the daemon ate its own head. Eventually we returned it to a restful state so all's well that ends well.

(hardware engineer trying to understand wtaf software people are saying when they speak)

By scalemaxx 2026-02-0516:255 reply

Everything comes circle. Back in my day, we just called it a "data center". Or on-premise. You know, before the cloud even existed. A 1990s VP of IT would look at this post and say, what's new? Better computing for sure. Better virtualization and administration software, definitely. Cooling and power and racks? More of the same.

The argument made 2 decades ago was that you shouldn't own the infrastructure (capital expense) and instead just account for the cost as operational expense (opex). The rationale was you exchange ownership for rent. Make your headache someone else's headache.

The ping pong between centralized vs decentralized, owned vs rented, will just keep going. It's never an either or, but when companies make it all-or-nothing then you have to really examine the specifics.

By IG_Semmelweiss 2026-02-0516:471 reply

There's a very interesting insight from your message.

The Cloud providers made a lot of sense to finance departments since aside from the promised savings, you would take that cloud expense now and lower your tax rate.

After the passing of the One Beautiful Bill ("OBB"), the law allows you to accelerate CapEx to instead expense it[1], similar to the benefit given by cloud service providers.

This puts way more wind on the sails of the on-prem movement, for sure

[1] https://www.iqxbusiness.com/big-beautiful-bill-impact-on-cap...

By conductr 2026-02-0522:351 reply

CFO here and I capex everything I can, never understood why you'd want to opex this. I'm trying to make EBITDA as enticing as possible for investors and anyone else that cares. Also want to show we have control over technology cost and it grows at a step function instead of a linear. Capex spending is usually large and planned, so we monitor it more closely and need to see a good reason to approve a large new purchase. Giving AWS a credit card is giving devs a blank check.

By IG_Semmelweiss 2026-02-061:431 reply

its quite simple.

if you are a profitable company paying taxes, you 100% want to defer taxes (part of EBITDA) thus trade earnings for market share.

This is exactly what TCI did [1] with cable

[1] https://www.colinkeeley.com/blog/john-malone-operating-manua...

By conductr 2026-02-062:152 reply

Believe you're talking about conserving cash through reduced taxes since this guy was against paying taxes.

However, spending a premium on cloud services over what you could with an on-prem capital investment does not help your cash position.

His tenant of frugality would have conflicted, especially since the cloud premium can easily exceed the tax rate - that is to say, paying taxes would have been cheaper

Section in your linked article about frugality https://www.colinkeeley.com/blog/john-malone-operating-manua...

In any case, spending on this either opex or capex doesn't help you gain or lose marketshare. Conserving cash can help, so you'd want to employ the lower cost option regardless of what line of the financial statement it hits - it's not going to be cloud if you follow that thought through

If cost was equal then opex gives a tax advantage, most companies are valued on EBITDA so still may not be their priority to optimize tax spend - a lot of other methods to avoid taxes. But the environment I've operated in I choose to capex because it conserves cash (is cheaper) and improves EBITDA optics (is excluded)

By tedd4u 2026-02-064:051 reply

Probably depends on where your gross margins would be with cloud and if you're higher or lower growth. If cloud will let you grow faster (HA/DR on-prem is hard) and you'll still have 75-80%+ gross margins, why slow top-line growth to do on-prem?

By conductr 2026-02-067:59

It’s not a real concern for vast majority of businesses. It’s a common excuse but practically no business is outgrowing a cheaper than cloud solution. Maybe on-prem isn’t right first step, but that doesn’t force you to cloud. There’s dedicated servers and everything in between.

On prem is maybe not the best first step but Colo or dedicated servers gives you a cleaner path to going on-prem if you ever decide to. The cost of growth is too high in cloud.

Learning how to run servers is actually less complicated than all the cloud architecture stuff and doesn’t have to be slower. There’s no one sized fits all, but I believe old boring solutions should be employed first and could be used to run most applications. Technology has a way of getting more complex every year just to accomplish the same tasks. But that’s largely optional.

By IG_Semmelweiss 2026-02-0616:571 reply

I didn't say conserve cash.

I say lower your tax bill.

"not the same ting" - nnt

By conductr 2026-02-0620:33

What you said doesn’t make sense. Make it make sense. I was left guessing.

In other words, please explain how it makes sense to lower tax bill by shift the expenses to opex when that process involves paying more for the same utility?

The only reason to lower the tax bill is to conserve cash. The article you linked to explains it that way too.

By re-thc 2026-02-0516:443 reply

> you shouldn't own the infrastructure (capital expense) and instead just account for the cost as operational expense (opex)

That was part of the reason.

The real reason was the internal infrastructure team in many orgs got nowhere. There was a huge queue and many teams instead had to find infinite workarounds including standing up their own. The "cloud" provided a standardized way to at least deal with this mess e.g. single source of billing.

> A 1990s VP of IT would look at this post and say, what's new?

Speed. The US lives in luxury but outside of that it often takes a LONG time to get proper servers. You don't just go online. There are many places where you have to talk to a vendor with no list price and the drama continues. Being out of capacity can mean weeks to months before you get anywhere.

By sanderjd 2026-02-0517:06

Yep! The biggest win for me when AWS came out was that I could self-serve what I needed and put it on a credit card, rather than filing a ticket and waiting some number of days / weeks / months to get a new VM approved and deployed.

By scalemaxx 2026-02-0517:36

I agree - my reference to the 1990s VP of IT was looking at the post, which is about on-premise data centers... not the cloud. I don't think there's a speed advantage for on-premise data centers now vs the 1990s, but if there is let me know. Otherwise, indeed, it's a 1990s-era blast from the past.

By ragall 2026-02-0615:58

> There are many places where you have to talk to a vendor with no list price

Which many places ?

By conductr 2026-02-0522:311 reply

Curious question. If opex is so exceedingly high for cloud, pushing people back to capex to save money, then why has no cloud entrant come around with a price competitive alternative?

It seems the main issue is that everyone is anchored to AWS so they have no incentive to reduce their prices. Probably same for Azure. I think Google is just risky because they kill products so easily.

By nine_k 2026-02-061:291 reply

It's just, like, not very easy? Say, Digital Ocean is one such entrant. Hetzner Cloud is, too, but it offers much, much fewer services than AWS. If all you want is spinning up instances, attaching storage, and maybe running a managed database and a managed k8s, it may be adequate. If you want DNS, queue services, email services, OCR, etc, etc, AWS has the widest assortment, and uniform access controls.

By conductr 2026-02-0620:39

Once you build on AWS it’s hard to decouple. I get that. But companies that focus on reducing complexity still seem to find a way to do all these things for cheaper. Can’t find the link but yesterday there was a front page article about how you can just use Postgres instead of 7 other specialty databases. The reduced complexity of using one tool is a net gain. People just aren’t trying. It’s the whole “never got fired for buying IBM” thing again.

By the_af 2026-02-0516:32

Agreed. Also, a realistic assessment should not downplay the very real overhead and headache of managing your on-premise data center. It comes at a cost in engineering/firefighting hours, it's not painless. There's a reason this eternal ping pong keeps going on!

By adolph 2026-02-0516:35

Yeah, I think the major improvement of cloud services was the rationalization of them into services with a cost instead of "ask that person for a whatsit" and "hopefully the associate goomba will approve."

All teams will henceforth expose their data and functionality through service interfaces

https://gist.github.com/chitchcock/1281611