Several years ago I actually cared about the differences between AWS CloudFormation and Terraform. Namely, that Terraform did not provide…
Several years ago I actually cared about the differences between AWS CloudFormation and Terraform. Namely, that Terraform did not provide wait conditions and helper scripts that made sense in 2017.
In 2017, I was a bit apprehensive picking up a new tool that didn’t seem as “AWS-native” as CloudFormation was, and I was all-in on AWS
Things are different now. AWS services have evolved quite a bit since then and most companies I lend my services to don’t even run EC2 instances anymore. Many clients I have are running containers or serverless (Lambda) based workloads.
In 2021, you should not use AWS CloudFormation.
Why?
First and mostly, indirection.
Let’s start with the way CloudFormation works, compared to how Terraform works. With CloudFormation, you make requests through one API — the CloudFormation API. This API takes a JSON or YAML-formatted document, validates it, and then passes calls to the other APIs on your behalf. If you are creating an RDS cluster, CloudFormation will call the RDS REST endpoint for you.
With Terraform, your local executable makes rest calls to each service’s REST API for you, meaning no intermediary sits between you and the service you’re controlling. Want an RDS instance? Terraform will make calls directly to the RDS API.
This is *most* important while troubleshooting. With CloudFormation, you’ll get back whatever CloudFormation reports from the target service. Most of the time, the error messages from CloudFormation are not helpful. With Terraform, the error messages you get are usually much more informative, and mirror the error messages you would get from the service, itself. These error messages are easier to search for online, sometimes appearing in AWS’s own documentation.
CloudFormation being a layer of indirection makes it difficult to work with in multi-region/multi-account scenarios. With CloudFormation you have to create Stack Sets and IAM policies that allow the CloudFormation service to impersonate other roles. The prerequisite steps you have to take to use CloudFormation across multiple accounts also must be taken just to have CloudFormation spin up resources in multiple regions within the same account.
With Terraform, you need to use valid credentials that grant access to multiple accounts. For instance, if you were spinning up CloudTrails in multiple accounts within an organization, then sure: you have to either impersonate a role, or use direct credentials in each account wit the required access. But operating in multiple regions across an account is as simple as overloading a provider:
provider “aws” { alias = “us-east-1” region = “us-east-1”
}
provider “aws” { alias = “us-east-2” region = “us-east-2”
}
Obviously this is a much lower barrier to orchestrating your architecture in more than one region at once.
Logic
CloudFormation’s pseudo parameters and intrinsic functions are cruel jokes. By contrast, Terraform offers a rich set of data sources and transforming data with Terraform is a breeze.
Even considering older versions of Terraform’s shortcomings, the Hashicorp DSL that Terraform employs is vastly more capable than AWS’s JSON/YAML implementation.
Speed
AWS CloudFormation is S-L-O-W. Maybe it’s because it mostly executes actions sequentially (or at least in a severely rate-limited fashion, and maybe I don’t care about the difference). It is not uncommon to wait for hours at a time for AWS CloudFormation to exit whatever transitive state it’s in and free up your stack. Until then, development on that stack is blocked.
Sync vs Async
AWS CloudFormation transactions are asynchronous, while Terraform’s interactions are almost always done synchronously. Wrapping CI/CD tooling around AWS CloudFormation is, as a result, a difficult matter of polling stack updates until the right stack state bubbles up from an aws cloudformation CLI query.
Terraform, on the other hand, will occupy your shell until the directly-involved AWS service coughs up an error. No additional tooling is required. Terraform will just relay the error message from the affected service indicating what you’ve done wrong.
AWS’s concept of stack events is just awful, too.
Portability
If you learn AWS CloudFormation, then guess what: you can’t take your skills with you. If you put forth the effort to learn Terraform, then you can take your skills to any other cloud (or any other provider. There are hundreds of providers that are useful outside of the major cloud ecosystems.
Useful providers you can run on your own laptop include TLS key generators and random string generators. These two local providers, alone, make SSH key generation and password generation a trivial matter. AWS CloudFormation can’t do either of these things.
Importing Existing Resources
I can’t even begin to tell you how to import resources with AWS CloudFormation, and neither can AWS’s engineers. Chime in if you can.
Service Quotas
The list of quotas for the AWS CloudFormation service is just hilarious. Meanwhile (for good or ill) I’ve had no problem spinning up over 1,000 resources with Google Cloud Platform from a single Terraform state.
The problem with both is that you quickly accumulate weeks/months of accumulated coding time that costs a pretty penny at market freelance rates. Spending a few hundred K on devops is routine for a lot of companies.
My main issue with this is that a lot of that coding is just reinventing the same wheels over and over again. Jumping through many hoops along the way like a trained poodle.
It's stupid. Why is this so hard & tedious. I've seen enough infrastructure as code projects over the last fifteen years to know that actually very little has changed in terms of the type of thing people build with these tools. There's always a vpc with a bunch of subnets for the different AZs. Inside those go vms that run whatever (typically dockerized these days) with a load balancer in front of it and some cruft on the side for managed services like databases, queues, etc. The LB needs a certificate. I've seen some minor variations of this setup but it basically boils down to that plus maybe some misc cruft in the form of lambdas, buckets, etc.
So why is it so hard to get all of that orchestrated? Why do we have to boil the oceans to get some sane defaults for all of this. Why do we need to micromanage absolutely everything here? Monitoring, health checks, logging systems, backups, security & permissions. All of this needs to be micromanaged. All of it is disaster waiting to happen if you get it wrong. All of it is bespoke cruft that every project is reinventing from scratch.
All I want is "hey AWS, Google, MS, ... go run this thing over there and tell me when its done. This is the domain it needs to run over. It's a bog standard web server expecting to serve traffic via http/websockets. Give me something with sane defaults. Not a disassembled jig saw puzzle with thousands of pieces. This stuff should not be this hard in 2021.
PaaS has existed since the mid-2000s. It turns out people don't want it -- none of them ever got more than a miniscule fraction of the market for boring workloads that 90% of companies are using IaaS for. People want knobs & levers. Just look at the popularity of Kubernetes, it is nothing but knobs & levers.
Kubernetes, once deployed, has surprisingly little knobs to tweak for the end user. You might have to pick between a StatefulSet and a Deployment depending on your workload, but that's about it.
Kubernetes cleanly separates reponsibility between maintainers of the platform (which have to make decisions on how to deploy it, and in cloud environments it's the cloud provider's job to do this) and users of the platform (which use a universal, fairly high-level API that's universal across clusters and cluster flavours). It's usually the former that people complain about being difficult and complex: picking the networking stack, storage stack, implementing cluster updates, ... that matters if you, for some reason, want to run a k8s cluster from scratch. But given something like a GKE cluster and a locally running kubectl pointing to it, it takes much less effort to deploy a production workload there than on a newly created AWS account. And there's much less individual resources and state involved.
Are Kubernetes capabilities not something that Cloud providers should have made available from the beginning. Meaning, its only possible future, is no future at all. Those capabilities should have been there from the beginning or will be in near (very short) future?
Different cloud providers did different things. Google's cloud offerings started with things like GAE, a very high-level service that many people ignored because it was too alien. AWS, on the other hand, provided fairly low-level abstractions like VMs (with some aaS services sprinkled in, but distinctly still 'a thing you access over HTTP over the network'). Both offerings reflect the companies' internal engineering culture, and AWS' was much less alien and more understandable to the outside. Now every other provider basically clones AWS' paradigm, as that's where big enterprise contract money is, not in doing things well but different.
With Kubernetes we actually have something close to a high-level API for provisioning cloud workloads (it's still no GAE, as the networking and authentication problems are still these but can be solved in the long term), and the hope is that cloud providers will implement Kubernetes APIs as a first class service that allows people to truly not care about underlying compute resources. Automatically managed workloads from container images are effectively the middle ground between 'I want a linux VM' pragmatists and 'this shouldn't be this much work' idealists.
With GKE you can spin up a production HA cluster in a click [1], but you still have to think how many machines you want (there's also Autopilot, but it's expensive and I have my problems with it). AWSs' EKS is a shitshow though, it basically requires interacting with the same networking/IAM/instance boilerplate as for any AWS setup [2].
[1] - https://cloud.google.com/kubernetes-engine/docs/how-to/creat...
[2] - https://docs.aws.amazon.com/eks/latest/userguide/getting-sta...
It might also be the wrong incentives being passed around. I mean, if you're hired and paid to push knobs and levers, you'll choose a tool with knobs and levers. Even with more of them.
GAE did this way back when. They give you a high level Python/Java API that works both locally and on prod, and you just push a newer version of your codebase with a command line tool - no need for containers, build steps, dealing with machines and operating systems, configurable databases, certificates, setting up traces or monitoring... No need to set anything up for that particular GAE, just create a new project, point a domain to it if you're feeling fancy, and off you go.
But in the end, the industry apparently prefers low-level AWS bullshit, where you have to manually manage networks, virtual machines, NAT gateways, firewall rules, load balancers, individually poke eight different subsystems just to get the basic thing running … It’s just like dealing with physical hardware, just 10x more expensive and exploiting the FOMO of ‘but what happens if we need to scale’.
I've been working with AWS CDK for a little while now, and it kind of has some of what you want.
In my case, I wanted a scheduled container to run once per day. Setting it up manually with CF or Terraform would have been a lot of work defining various resources, but CDK comes with a higher-level construct[1] that can be parameterized and which will assemble a bunch of lower level resources to do what I want. It was a pretty small amount of Python code to make it happen.
[1] https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aw...
What you’re asking for as someone said is pretty much Heroku.
Heroku is so expensive up front that it usually seems cheaper to DIY on AWS directly though.
AWS Elasticbeanstalk is a reasonable choice for most cases.
I think this is the motivation for https://darklang.com/, trying to solve a lot of the complexity of infra. That said it’s a huge undertaking.
The AWS CDK is getting closer to this. Your standard VPC setup is extremely simple now. Like
new Vpc()
simple. The tooling definitely has its quirks, but is steadily improving, and you can drop into CloudFormation from CDK (with TypeScript type checking) when needed.Although your headline is correct imho, I think there are lots of things that the orchestrator might need to do above this, especially if you want the tool to be cloud agnostic so that everyone can use it, which just makes things a little more complicated.
You might want to add something to the existing system, like another web server. These tools have to "add item and wire it up to load balancer".
You might want to scale up. This might work natively or it might require creating new larger instances and then getting rid of the old ones.
You might want to update the images to newer versions.
You might need more public IPs.
You might be adding something to an existing larger network so you need to reference existing objects.
You might need to "create if not exists"
etc. I think your argument covers the intiial use-case in most places but any system used over time will need the other stuff done to it, hence the "complexity" of the tools. tbf, I don't think Terraform is that complex in itself, I think because it is in config files, it can be more complex to understand and work with.
Still the argument stands. Also the points you listed should be expected as standard - everybody will need sooner or later scaling or image updates right?
I agree. And you just described pass like Render.com, Heroku and diy paas like Convox and I'm sure others. I don't understand why companies, especially young - mess with all the low level infra stuff. It's such a time suck and you end up with fragile system.
So... heroku?
> So why is it so hard to get all of that orchestrated? Why do we have to boil the oceans to get some sane defaults for all of this. Why do we need to micromanage absolutely everything here?
This post truly resonates with me, however i don't think that we appreciate just how many things are necessary to run a web application and do it well. There is an incredible amount of complexity that we attempt to abstract away.
Sometimes i wish that there'd be a tool that could tell me just how many active code lines are responsible for the processes that are currently running on any of the servers and in which languages. Off the top of my head, what's necessary to ship an enterprise web app in 2021.
RUNTIMES - No one* writes web applications in assembler code or a low level language like C with no dependencies - there is usually a complex runtime like JVM (for Java), CLR (for .NET), or whatever Python or Ruby implementations are used, which are already absolutely huge.
LIBRARIES - Then there are libraries for doing common tasks in each language, be it serving web requests, serving files, processing JSON data, doing server side rendering, doing RPC or some sort of message queueing etc, in part due to there not being just one web development language, but many. Whether this is a good thing or a bad thing, i'm not sure. Oh, and front end can also be really complex, since there are numerous libraries/frameworks out there for getting stuff rendering in a browser in an interactive way (Angular, Vue, React, jQuery), each with their own toolchains.
PACKAGING - But then there are also all the ways to package software, be it Docker containers, other OCI compatible containers (ones that have nothing to do with the Docker toolchain, like buildah + podman), approaches like using Vagrant, or shipping full size VMs, or just copying over some files on a server and either using Ansible, Chef, Puppet, Salt or manually configuring the environment. Automating this can also be done in any number of ways, be it GitLab CI, GitHub Actions, Jenkins, Drone or something else.
RUNNING - When you get to actually running your apps, what you have to manage is an entire operating system, from the network stack, to resource management, to everything else. And, of course, there are multiple OS distributions that have different tools and approaches to a variety of tasks (for example, OpenRC in Alpine vs systemd in Debian/Ubuntu).
INGRESS - But these OSes also don't live in a vacuum so you end up needing a point of ingress, possible load balancing or rate limiting, so eventually you introduce something like Apache, Nginx, Caddy, Traefik and optionally something like certbot for the former two. Those are absolutely huge dependencies as well, just have a look at how many modules the typical Apache installation has, all to make sure that your site can be viewed securely, do any rate limiting, path rewriting etc.!
DATA - And of course you'll also need to store your data somewhere. You might manage your databases with the aforementioned approaches to automate configuration and even running them, but at the end of the day you are still running something that has decades of research and updates behind them, regardless of whether it's SQLite, MariaDB, MySQL, PostgreSQL, SQL Server, S3, MongoDB, Redis or anything else. All of which have their own ways of interacting with them and different use cases, for example, you might use MariaDB for data storage, S3 for files and Redis for cache.
SUPPORT - And that's still not it! You also probably want some analytics, be it Google Analytics, Matomo, or something else. And monitoring, something like Nagios, Zabbix, or a setup with Prometheus and Grafana. Oh and you better run something for log aggregation, like ELK or Graylog. And don't forget about APM as well, to see what's going on in your app in depth, like Apache Skywalking or anything else.
OTHERS - There can be additional solutions in there as well, such as a service mesh to aid with discoverability of services, circuit breakers to route traffic appropriately, security solutions like Vault to make sure that your credentials aren't leaked, sometimes an auto scaling solution as well etc.
In summary, it's not just because of there being a lot of tools for doing any single thing, but rather that there are far too many concerns to be addressed in the first place. To that end, it's really amazing that you can even run things on a Raspberry Pi in the first place, and that many of the tools can scale from a small VPS to huge servers that would handle millions of requests.
That said, it doesn't have to always be this complex. If you want to have a maximally simple setup, just use something like PHP with a RDBMS like MariaDB/MySQL and server side rendering. Serve it out of a cheap VPS (i have been using Time4VPS, affiliate link in case you want to check them out: https://www.time4vps.com/?affid=5294, though DigitalOcean, Vultr, Hetzner, Linode and others are perfectly fine too), maybe use some super minimal CI like GitLab CI, Drone, or whatever your platform of choice supports.
That should be enough for most side projects and personal pages. I also opted for a Docker container with Docker Swarm + Portainer, since that's the simplest setup that i can use for a large variety of software and my own projects in different technologies, though that's a personal preference. Of course, not every project needs to scale to serving millions of users, so it's not like i need something advanced like Kubernetes (well, Rancher + K3s can also be good, though many people also enjoy Nomad).
Edit: there are PaaS out there that make things noticeably easier for you by focusing on doing some of the things above for you, but that can lead to a vendor lock, so be careful with those. Regardless, maybe solutions like Heroku or Fly.io are worth checking out as well, though i'd suggest you read this article: https://www.gnu.org/philosophy/who-does-that-server-really-s...
* with very few exceptions
Google Cloud Run basically does this
I prefer using Digital Ocean App Platform for the simplicity it provides.
Starts out with a noble goal to save the world but ends up being co-opted by grifters, managers and consulting companies just like “Agile”
Some people just love coding.
Spring Boot for devops?
> There's always a vpc with a bunch of subnets for the different AZs.
It’s funny because you’re already out of touch with how a lot of people would avoid having to do this in 2021. If your stack was simpler, you might not have such infra as code dependencies.
When you say “if your stack was simpler” do you mean “if your problem was trivial”? I’m always interested in simpler ways to do things, but solving distributed system and data governance issues tends to involve putting things in different places.
If you're not going to mention what it is then you come across as snarky.
Deploy everything using a cloud native architecture and have all services internet facing, read about "zero trust networks" to understand more about securing such things.
Maybe there are "data governance issues" stopping the internet facing thing from happening. But if not, that's a more modern approach then three tier network segmentation.
Counterpoint... Use CloudFormation!
Managed services offer big benefits over software. With CF, new stacks, change sets, updates, rollbacks and drift detection are an API call away.
Managed service providers offer big benefits over software. With CF and AWS support, help with problems are a support ticket away.
Using a single cloud provider has a big benefit over a multi-cloud tooling. I only run workloads on AWS, so the CF syntax, specs and docs unlocks endless first party features. A portable Terraform + Kubernetes contraption is a lowest common denominator approach.
Of course everything depends.
I've configured literally 1000s of systems with CloudFormation with very few problems.
I have seen Terraform turn into a tire-fire of migrations from state files to Terraform enterprise to Atlantis that took an entire DevOps team to care for.
Main use case of Terraform is not portability. Have fun porting your SQS queue, DynamoDB table or VPC config to GCP or Azure equivalents. It won't look like similar at all except the resource name. However, If you are only running containers and virtual machines, sure you can benefit from portability
Cloudformation lags features so bad that you end up hacks or Lambda functions with custom resources. DynamoDB Global tables took 4/5 years to be available in Cloudformation.
I've also seen wrongly constructed Cloudformation deleting critical databases, hanging (often), times out, rollback hanging not succesful, so it's not always rainbows & sunshine there either. Also I don't like Terraform for its usage of excessive state files, handling state files with DynamoDB locks, having them on S3.
I won't deny the good features of it like being managed is a huge plus, but it's so slow and lagging behind, and it's YAML is so verbose an has stack size limits, it's always a workaround with Cloudformation. My company uses it by abstraction for internal PaaS and deployment automation and it takes a lot for trivial changes to complete.
So in short, neither is perfect, but for me Terraform is easier to use, easier to debug, and faster, and features don't lag nearly as CF. Those are good enough reasons to avoid Cloudformation for me. I also don't like CDK because it's too verbose and its still CF, and I would rather generate Terraform JSONs/HCLs myself if I need more logic either.
Terraform also helps when you need to configure multiple stacks i.e for a service you can have a module that reflects your organizational best practices, a Fargate service for runnning container, automatic Datadog dashboards, cloudwatch alarms connected to Opsgenie/Pagerduty etc.
> hanging (often), times out, rollback hanging not succesful
The timeouts in CF are ridiculous. Especially with app deployments. I can't remember which service it is, but something can wait up to 30min on a failed `apply` and then wait the same on a failed revert. Only then you can actually deploy the next version (as long as it wasn't part of the first deploy, then you get to wait until everything gets deleted as well).
(yes, in many cases you can override the timeouts, but let's be honest - who remembers them all on the first run or ever?)
I've been using CF for a few years now with minimal complaints but I just hit a create changeset endless timeout (2 days to finally time out).
The worst part is that there are no error messages. When it fails and I click "Details", it takes me to the stack page and shows 100% green. Support ticket seems slow to get a response too.
That aside, my overall experience has been positive!
I regularly see 20m timeouts using Terraform on AWS. Well in the TF camp, but just saying.
These are AWS service timeouts and they definitely exist too. But with CF you get CF retries/timeouts on top of those.
Terraform has its fair share of lag as well. One particular case that irks me is that the "meta" field on Vault tokens is unsupported. Vault of course being another first-class Hashicorp product makes this particularly odd.
That being said the Vault provider is open source and it's quite easy to add it and roll your own.
If I'm remembering correctly, I'm pretty sure the Vault provider for Terraform was originally contributed by an outside company rather than inside HashiCorp. My guess would be that it has encountered what seems to be a common fate for software that doesn't cleanly fit into the org chart: HashiCorp can't decide whether the Vault team or the Terraform team ought to own it, and therefore effectively nobody owns it, and so it falls behind.
Pretty sure that's the same problem with CloudFormation, albeit at a different scale: these service teams are providing APIs for this stuff already, so do they really want to spend time writing another layer of API on top? Ideally yes, but when you have deadlines you've gotta prioritize.
> DynamoDB Global tables took 4/5 years to be available in Cloudformation.
DynamoDB global tables launched in November 2017, it isn't four years old yet.
CFN lag is an issue for sure but not quite that much of an issue...
So it's 3.5 years old, and the CF was availabile this May. And it is a very widely used feature among many organizations.
If you want to use a new resource and feature, %90 likely it is an issue. There was even a public project to track them. https://github.com/cfntools/cloudformation-gaps/projects/1
You want to have an IAM role, but you can not tag it with Cloudformation. These minor frustrations quickly add up. And see what you need to do to add a custom resource: https://shouldroforion.medium.com/aws-cloudformation-doesnt-...
I really don't get why features come so late to CloudFormation - I guess AWS don't use much CloudFormation internally then, but surely they're not stringing together AWS cli calls? CDK is reasonably new too, unless they waited a long time to go public with it.
Many/most teams internally using AWS use CloudFormation; an AWS service I was a part of was almost entirely built on top of other AWS services, and the standard development mechanism is to use CFN to maintain your infra. You only do drastic things like "stringing CLI calls" if there's something missing from CFN and not coming out soon enough, in which case maybe someone writes a custom CFN resource and you run it in your service account.
Depending on how old the service is, the ownership of the CFN resource may be on the CFN service team (from back when they white-gloved everything) in which case there are massive development bottlenecks (there are campaigns to migrate these to service team ownership) or more often the resource is maintained by the service team itself, in which case the team may not be properly prioritizing CFN support. There can be a whole lot of crap to deal with for releasing a new CFN _resource_, though features within a resource are relatively easy.
On my last team, we did not consider an API feature delivered until it was available in CFN, and we typically turned it around within a couple of weeks of general API availability.
Ah right - yeah it seems like some services are up to date but others really so lag so different ownership explains it.
CDK is a higher-level (and awesome in my experience) way to just generate CloudFormation specs. In other words, you need both CloudFormation and CDK support for features to become available there.
In terms of getting new features fast CDK is strictly worse than CloudFormation.
Agree. CF is not a magic bullet, but neither is ansible or terraform.
We used ansible heavily with AWS for 2 years. Then we decided to gut it out and do CF directly. Why? If we want to switch clouds, it's not like the ansible or terraform modules are transferable ... So might as well go the native supported route.
I agree with the article, messages can be cryptic, but at the end of the day, I have a CF stack that represents an entity. I can blow away the stack, and if there's any failure or issue, I can escalate my permissions and kill it again. Still a problem? Then it's AWS's fault and a ticket away (though I've only had to do this once in 5 years and > 150,000 CF stacks.
I also would argue, if a stack deletion stalls development, you are probably using hard-coded stack names, which isn't wise. Throw in a "random" value like a commit or pipeline identifier.
I've had far less issues with CF than terraform or ansible. I have yet to see CF break backward compatibility, while I had a nightmare day when I couldn't run any playbooks in ansible because the module had a new required parameter on a minor or patch version bump.l (which was when I called it quits on ansible, I then relooked at terraform, and decided to go native)
I will caveat that our use case for AWS involves LOTS of creation and deletion, so I find it super helpful to manage my infrastructure in "stacks" that are created and deleted as a unit.. I dont need to worry about partial creations or deletions.. like ever... It basically never fails redoing known-working stuff... Only "first time" and usually because we follow least-privilege heavily
I’m confused. Isn’t Ansible and CloudFormation what apple is to an orange with completely different use cases and purpose?
One is a configuration management and deployment tool.
The other one is cloud resource provisioning service.
They’re meant to work in tandem, not one to replace another.
I think Ansible has extensions which allow for managing infra such as AWS. See https://docs.ansible.com/ansible/latest/collections/amazon/a... for example.
Yes Ansible does have extensions and can be used to provision AWS services.
The approach between Cloudformation/Terraform/Pulumi and Ansible are entirely different though.
The former are declarative, they define how the end state should look. Ansible is a task runner, you define a set of manual tasks it needs to execute to get to the end state.
I strongly advice against using Ansible for provisioning resources. It's idempotent by convention only. When I had to reluctantly use it for jobs it was extremely difficult to get a repeatable deterministic environment set up. Each execution lead to a different state and was just a nightmare to deal with.
Cloudformation/Terraform/Pulumi are much better in that regard as they generate a graph of what the end state should be, check the current state, generate an execution plan how to make the current state look like the target state.
Where Ansible is better than Cloudformation/Terraform/Pulumi is you have a bunch of servers already set up and you want to change the software installed/configuration on them. That's bit of an anti pattern these days changing config/provisioning at runtime. You can change that slightly and use Ansible with Packer to generate pre-baked images which works ok if you don't mind lots of yaml. This isn't to bad and works reasonably well and works to Ansible strengths all though these days most people don't prebake images with containerization. Also if you are only using Ansible for provisioning config on a host Nix achieves this much more elegantly / reliably.
generally speaking - ansible is not used for clouds. Especially if ur leveraging things like spot instances, etc.
Because ansible needs a trigger to start configuring the server.
The most general practice now is to use cloud-init (https://cloudinit.readthedocs.io/en/latest/)
For example, this is how u set it up on AWS - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-dat...
i think cloud-init is packaged by default on all providers OS images.
Basically, yes.... But essentially:
We already used ansible for other things, so it wasn't too hard to swap over to AWS modules... (Except they were inconsistent and poorly supported, we ultimately found out)
Someone at Hashicorp then convinced mgmt that terraform is almost a write-once system, and we could jump from AWS to Azure or GCP easily "just change the module!"... When actual engineers looked at it, after 3 days there was almost a mutiny and we rejected terraform mostly based on the fact someone lied to our managers to try and get us to adopt it... I know someone who is very happy with terraform nowadays, but that ship sailed for us.
Those were basically the only people in this space, so we started rewriting ansible to CloudFormation. Since we mostly use lambdas to trigger the creation of CF stacks, this really works well for us, since our lambdas exist for less than a second to execute, and then we can check in later to see if there's issues (which is less than 1 in 50,000? 100,000? in my experience... Except for core AWS outages which are irrespective of CF). Compared to our ansible (and limited terraform) setups which required us to run servers or ECS tasks to manage the deploy. We can currently auto-scale at lambda scale-up speed to create up to 30 stacks a second if demand surges (the stack might take 2-3 minutes to be ready, but it's now async). Under ansible/terraform we had to make more servers and worker nodes to watch the processes... And our deployment was .3/.4 stacks per minute per worker (and scaling up required us to make more workers before we could scale up for incoming requests)
If I was building today, I'd probably revisit terraform, but I think the cdk or CF are still what I'd recommend unless there's a need for more-than-AWS... E.g. multi-cloud deployments, or doing post-creation steps that can't be passed in by userdata / cloud-init.. in which case CF can't do the job alone and might not be the right tool.
I'm a big proponent of CF when you are using AWS, but if you are on GCP, don't even bother with their managed tool, just go straight to TF. Their Deployment Manager is very buggy (or at least it was 2 years ago).
CloudFormation/Terraform/etc are also configuration management programs. They just work on the APIs of cloud vendors, rather than a package management tool's command-line options. They've been given a new name because people want to believe they're not just re-inventing the wheel, or that operating on cloud resources makes their tool magically superior.
> We used ansible heavily with AWS for 2 years. Then we decided to gut it out and do CF directly. Why? If we want to switch clouds, it's not like the ansible or terraform modules are transferable ... So might as well go the native supported route. > > I agree with the article, messages can be cryptic, but at the end of the day, I have a CF stack that represents an entity. I can blow away the stack, and if there's any failure or issue, I can escalate my permissions and kill it again. Still a problem? Then it's AWS's fault and a ticket away (though I've only had to do this once in 5 years and > 150,000 CF stacks.
This, 100%.
Another killer feature is StackSet. I managed to rewrite datadog integration CF (their version required manual steps) to a template that contained custom resources that made calls to DD to do registration on their side.
I then deployed such template through StackSets and bam, every account in specific OU automatically configures itself without any manual steps.
> Managed services offer big benefits over software. With CF, new stacks, change sets, updates, rollbacks and drift detection are an API call away. > > Managed service providers offer big benefits over software. With CF and AWS support, help with problems are a support ticket away.
The problem is when those help tickets get responses like “try deleting everything by hand and see if it recreates without an error next time”. They've worked on CloudFormation over the last year or but everyone I've known who's switched to tools like Terraform did so after getting tired of unpredictable deployment times or hitting the many cases where CloudFormation gets itself into an irrecoverable state. I can count on no fingers the number of development teams who used CF and didn't ask for help recovering from an error state in CF which required out-of-band remediation.
I believe they've also gotten better at tracking new AWS features but there were multiple cases where using Terraform got you the ability to use a feature 6+ months ahead of CF.
> A portable Terraform + Kubernetes contraption is a lowest common denominator approach.
Terraform is much, much richer than CloudFormation so I'd compare it to CDK (with the usual aesthetic debate over declarative vs. procedural models) and it doesn't really make sense to call it LCD in the same way that you might use that to describe Kubernetes because it's not trying to build an abstraction which covers up the underlying platform details. Most of the Terraform I've written controls AWS but there's a significant value to also being able to use the same tool to control GCP, GitLab, Cloudflare, Docker, various enterprise tools, etc. with full access to native functionality.
Terraform (and kubernetes) itself aren't a lowest common denominator, however I believe the comment alludes to an approach where you try to abstract cloud features. This can (kind of) reasonably be done with terraform and kubernetes and avoiding vendor specific services such as various ML services, DynamoDB, etc.
However, you can use terraform just fine while still leveraging vendor specific services that actually offer added value, like DynamoDB or Lambda. Cloudformation however doesn't really offer that much added value (if any) over terraform, so using terraform isn't an LCD approach perse.
Yes — that's basically what I was thinking: you could make an argument that using Kubernetes inherently adds an abstraction layer which might not be preferable to using platform-native components but it sounded like the person I was responding to was making the argument that using Terraform requires that approach.
I found that especially puzzling because one of the reasons why we switched to Terraform was because it let us take advantage of new AWS features on average much faster than CloudFormation.
I once got really stuck with CloudFormation not deleting a resource that also didn't show up in the AWS console - not fun.
> Managed services offer big benefits over software.
TF can be used as a managed service.
> Managed service providers offer big benefits over software. With CF and AWS support, help with problems are a support ticket away.
The same is true with TF, except 100000% better unless you're paying boatloads of money for higher tiered support.
> I only run workloads on AWS, so the CF syntax, specs and docs unlocks endless first party features.
CF syntax is an abomination. Lots of the bounds of CF are dogmatic and unhelpful.
> I have seen Terraform turn into a tire-fire of migrations from state files to Terraform enterprise to Atlantis that took an entire DevOps team to care for.
CF generally takes an entire DevOps team to care for, for any substantial project.
> TF can be used as a managed service.
Sure, but I never seen that myself. If TF was used it was always own set up infrastructure at best.
> The same is true with TF, except 100000% better unless you're paying boatloads of money for higher tiered support.
Again, all places I worked had enterprise support and even rep assigned. I think I only used support for CF early on, I don't know if it was buggier back then or I just understood it better and didn't run into issues with it.
> CF syntax is an abomination. Lots of the bounds of CF are dogmatic and unhelpful.
I would agree with you if you were talking about JSON, but since they introduced YAML it is actually better than HCL. One great thing about YAML is that it can be easily generated programmatically without using templates. Things like Troposphere make it even better.
> CF generally takes an entire DevOps team to care for, for any substantial project.
Over nearly 10 years of my experience, I never seen that to be a case. I'm currently in a place that has an interesting approach: you're responsible for deployment of your app, so you can use whatever you want, but you're responsible for it.
So now I'm working with both. And IMO I see a lot of resources that are not being cleaned up (because there's no page like CF has, people often forget to deprovision stuff), also seeing bugs like for example TF needs to be run twice (I think last time I've seen it fail was that it was trying to set tags on a resource that wasn't fully created yet).
There is also situation that CF is just plain better. I mentioned in another comment how I managed to get datadog integration through a single CF file deployed through stackset (this basically ensured that any new account is properly configured). If I would end up using TF for this, I would likely have to write some kind of service that would listen for events from the control tower, whenever a new account was added to OU, then run terraform to configure resources on our side and make API call to DD to configure it to use them.
All I did was to write code that generated CF via troposphere and deploy it to stackset in a master account once.
> Sure, but I never seen that myself.
Right, your post is mostly "I like the thing that I've used, and I do not like the thing I haven't used". They're apples and different apples.
> Again, all places I worked had enterprise support and even rep assigned
So, again, you've worked at places that were deeply invested in CF workflows.
> but since they introduced YAML it is actually better than HCL. One great thing about YAML is that it can be easily generated programmatically without using templates.
Respectfully, this is the first-ever "yaml is good" post I think I've ever seen.
> Over nearly 10 years of my experience, I never seen that to be a case. I'm currently in a place that has an interesting approach: you're responsible for deployment of your app, so you can use whatever you want, but you're responsible for it.
I'd love to hear more about this.
> And IMO I see a lot of resources that are not being cleaned up (because there's no page like CF has, people often forget to deprovision stuff), also seeing bugs like for example TF needs to be run twice (I think last time I've seen it fail was that it was trying to set tags on a resource that wasn't fully created yet).
I guess we're just ignoring CF's rollback failures/delete failures/undeletable resources that require support tickets then?
> There is also situation that CF is just plain better. I mentioned in another comment how I managed to get datadog integration through a single CF file deployed through stackset (this basically ensured that any new account is properly configured). If I would end up using TF for this, I would likely have to write some kind of service that would listen for events from the control tower, whenever a new account was added to OU, then run terraform to configure resources on our side and make API call to DD to configure it to use them.
Again respectfully, yes, the person that both doesn't like and hasn't invested time into using Terraform at scale probably isn't going to find good solutions for complicated problems with it.
> you're responsible for deployment of your app, so you can use whatever you want, but you're responsible for it.
What happens when you leave the project?
> help with problems are a support ticket away
While this is true and AWS support is very responsive and useful, it doesn't mean they solve all the problems. Sometimes their help is: "I'll note that as a feature request, in the meantime you can implement this yourself using lambdas".
CloudFormation can be very flexible, especially with tools like sceptre, it can work very well. A huge issue is that WITHOUT tools like sceptre, you can't really use stacks except as dumb silos. You already need additional tooling (sceptre, CDK, SAM, ...) to make CF workable. I think that most people who despise CF haven't got good tooling.
The issue with CloudFormation is that it lags behind all other AWS services quite often. It seems to be maintained by another team. I realize that getting state management for complex inderdependent resources right requires time and diligence, BUT it's a factor in driving adoption.
- New EKS feature? Sorry, no knob in CF for MONTHS.
- New EBS gp3 volume type available? Sorry, our AWS::Another::Service resource does not accept gp3 values for months after feature release.
- A AWS API exposes information that you would like to use as an input to another resource or stack. SURPRISE, CloudFormation does not return that attribute to you, despite it being available. SORRY NOT FIXABLE.
- Try refactoring and moving resources in/out of stacks while preserving state? Welcome to a fresh hell.
- Quality of CloudFormation "modules" varies. AWS::ElasticSearch::Domain used to suck a whole lot, AWS::S3::Bucket as a core service was always very friendly.
- CloudFormation custom resources are a painful way of "extending" CF to suite your needs. So painful that I refuse to pay the cost of AWS not keeping their stuff up to date and well integrated.
These kinds of lag, this kind of incompleteness when it comes to making information from AWS APIs available have driven me to Terraform for things that are best done in Terraform and require flexibility and CloudFormation for things that work well with it.
At the end of the day CF is a closed-source flavour of Terraform + AWS provider. I would like have gone all in, but just doesn't work and costs hacks, time and flexibility.
That being said, if you have no idea how to work with TF, tell the devs to use CF.
The experience of AWS support is probably also very different when it comes to feature requests. An org that spends half a billion with AWS will get their requests implemented ASAP whereas small fish have to swim with the flow and hope it works for them.
+1 for Sceptre. So much nicer for stack orchestration.
Please give me a simple, type safe, easy to use language (hint Dhall) to represent my infra.
YAML programming is not something that makes people happy. K8s is even worse in this regard, YAML documents embedded into YAML documents.
https://github.com/awslabs/aws-cloudformation-templates/blob...
Pulumi with Go or Typescript? I'm not sure why people are so hung up on using custom languages for their infrastructure.
Do you understand what is the difference between the things you can express with Dhall or ML vs Go or TS?
People are so hung up because we could do so much better with expressing only valid states and you do not need to deploy your infra to figure out that and S3 bucket cannot have a redirect_only flag set and also a website index document set at the same time.
Ooh, Atlantis, thanks for telling me.
TFA is written by someone who just discovered few days back about IAM roles (see his old blogs) :eye roll:
> I've configured literally 1000s of systems with CloudFormation with very few problems.
This is a great way of saying "I've never used CloudFormation" without stating it directly.
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
Hacker News is not a site for snarky accusations of deception.
That's right - use AWS CDK instead. You don't have to worry about the low-level CloudFormation syntax and details. I switched a few years ago and haven't looked back. CDK keeps getting better and better, also handling things like asset deployments (Docker images, S3 content, bundling), quick Lambda updates with --hotswap, quick stack debugging with --no-rollback, etc.
I've developed cdk for about 6 months now and it seems to me like a great idea.
However, the implementation is quite lackluster.
- It's very slow. Deploying our app from scratch takes >30 minues. An update of a single lambda (implementation, not permissions) takes about 3-5 minutes.
- Bad error information. On an error, cdk usually just reports which component couldn't be created. Not why it couldn't be created. That information you 'might' find in the aws cloudformation console, or not.
- It's buggy. Some deployments randomly fail because cdk couldn't find your cli credentials. Sometimes resources can't be created. In each case it works if you deploy again...
I suspect that most shortcomings that I experience are due to cloudformation. I'd really be interested in a 2.0 version that is either highly optimized or polished, or works with a different intermediate abstraction entirely. Updating and validating a tree of resources and policies should not be that difficult, at least not for AWS.
I think that the --hotswap option does what you are hoping for speedier updates. It allows CDK to update some cloud resources directly (e.g. Lambda function code) so that they "drift" away from the CloudFormation template definition. It's good for quick development.
CDK is terrible and way too low level for most apps, and worst of both worlds for code/declarative IaC (verbose & cognitive overhead heavy way to try to author code that generates a CF template - that still may not work).
Use AWS Copilot, it's like a reboot of Elastic Beanstalk complete with good CLI, made for the Fargate age and lets you extend it with bits of CF. (https://aws.github.io/copilot-cli/)
It sounds like your development is focused on containers, and perhaps Copilot is best for that. AWS CDK works very well when your application is based on many different kinds of cloud resources that need to be connected together, utilizing all the serverless and managed features of AWS.
Copilot supports serverless too (App Runner, serverless Aurora, DynamoDB, S3, CodePipeline etc). But if your app is using a huge nr of AWS features, you're probably over engineering it.
Sure it's opinionated enough that it won't fit every use case people have on AWS, but in most cases for new apps it's good.
I agree, and have the same experience. CDK is so much easier, much less verbose, and unit testable (at least to some degree).
Since resource importing is possible in CDK (not nice, but possible) you can even start using it if you already have resources that you do not want to recreate.
I'm always surprised that more people arent aware of CDK. Its an extremely powerful way to write software. Especially once you get good at it. CFN pales in comparison, CDK to me feels like the future of software development.
My comments are my own and don't represent my employer.
Almost every new service built inside Amazon uses CDK. I too am surprised that more people aren't aware of it. And you're right once you get good at it you can spin up infrastructure incredibly quickly with minimal guess work.
CDK was not used extensively at my org a couple of months ago. I agree it’s light years ahead of everything else though.
Oh that's interesting, in my experience across the Books Org and InfoSec its being widely used.
I was at a point willing to try it and what dissuade me was just that I also needed to have nodejs added as a dependency to my project.
Yea the nodejs dependency is an abomination. The cdk also requires a nodejs version newer than what comes on many linux distributions so you end up installing a custom nodejs version just to run cdk.
I don’t know why they picked nodejs over a more sensible language such as python3 which comes installed almost anywhere or golang which is easy to distribute in binary form and relatively reverse compatible.
The aws cli really got it right: just let me install a standalone binary and call it a day.
AWS CDK also works with Python. However, in my experience, TypeScript-based CDK projects are more productive to develop, due to all the built-in error checking and autocompletion features in Visual Studio Code. Node.js also provides a more coherent environment with the package.json structure, which defines all the scripts and dependencies etc. Python-based CDK projects seem to include more "custom hacks". Because of all this, I prefer to use TypeScript for CDK even if the actual application to deploy is based on Python.
Regardless of what language you use to write your cdk stacks, the cdk cli application which transpiles your cdk stacks into cloudformation requires nodejs. So in reality cdk requires nodejs — always — and then whatever language you are writing the stacks in as well. It becomes a pain in the ass when you have a python application which is defined in a python cdk stack, but you still need nodejs to transpile your python cdk stack into cloudformation.
My main gripe is that the cdk cli should be distributed as a standalone binary so I don’t need to install nodejs to deploy my project.
Ok thanks for clarifying. I guess we already have Node.js everywhere where CDK is used, so never noticed it. I can see it's uncomfortable when that is not the case.
Maybe CloudFormation is too entrenched at the moment? I think CDK became publicly available in late 2019 or early 2020. I haven't used it because I'm currently working on a project using Serverless (thin layer on top of CloudFormation) with no easy way to slip any CDK in.
Terraform CDK does this, but instead for just AWS it does it for everything.
CDK is great but because it compiles down to CloudFormation it is slow slow slow.
Yuck - the CDK is everything I dislike about IaaS. Not everything needs to be actual code.
Complexity management is an often overlooked parameter of reliable infrastructure.
YEah but have you ever managed a large CFN project. You'll wish you had something more expressive than YAML
Yes, I have. Strongly prefer plain CF.
The purpose of CDK is exactly to reduce the complexity of your IaC. It automatically sets up all kinds of defaults, such as required permissions between different cloud resources that need to talk to each other. The underlying complex things are handled by AWS-managed library code.
This doesn't even include one of the most useful features of CDK: truly modular components. I can create a module used 100 times and all you have to worry about is the constructor and inputs instead of the nightmarish lack of reusability in other IaC.
You can also dump the CloudFormation template to a file to inspect it or feed it into other tooling.
How does that differ from terraform modules?
Same here. I used TF for about 2 years and switched to CDK early 2020.
It used to have a lot of rough edges, but that's (mostly) not the case anymore.
Pulumi is also nice for non-AWS related stuff.