AWS in 2025: Stuff you think you know that's now wrong

2025-08-2015:30387279www.lastweekinaws.com

One of the neat things about AWS is that it's almost twenty years old. One of the unfortunate things about AWS is... that it's almost twenty years old

Show article

One of the neat things about AWS is that it’s almost twenty years old. One of the unfortunate things about AWS is… that it’s almost twenty years old. If you’ve been using the platform for a while, it can be hard to notice the pace of change in the underlying “foundational” services. More worryingly, even if you’re not an old saw at AWS scrying, it’s still easy to stumble upon outdated blog posts that speak to the way things used to be, rather than the way they are now. I’ve gathered some of these evolutions that may help you out if you find yourself confused.

In EC2, you can now change security groups and IAM roles without shutting the instance down to do it.

You can also resize, attach, or detach EBS volumes from running instances.

As of very recently, you can also force EC2 instances to stop or terminate without waiting for a clean shutdown or a ridiculous timeout, which is great for things you’re never going to spin back up.

They also added the ability to live-migrate instances to other physical hosts; this manifests as it being much rarer nowadays to see an instance degradation notice.

Similarly, instances have gone from a “expect this to disappear out from under you at any time” level of reliability to that being almost unheard of in the modern era.

Spot instances used to be much more of a bidding war / marketplace. These days the shifts are way more gradual, and you get to feel a little bit less like an investment banker watching the numbers move on your dashboards in realtime.

You almost never need dedicated instances for anything. It’s been nearly a decade since they weren’t needed for HIPAA BAAs.

AMI Block Public Access is now default for new accounts, and was turned on for any accounts that hadn’t owned a public AMI for 90 days back in 2023.

S3 isn’t eventually consistent anymore–it’s read-after-write consistent.

You don’t have to randomize the first part of your object keys to ensure they get spread around and avoid hotspots.

ACLs are deprecated and off by default on new buckets.

Block Public Access is now enabled by default on new buckets.

New buckets are transparently encrypted at rest.

Once upon a time Glacier was its own service that had nothing to do with S3. If you look closely (hi, billing data!) you can see vestiges of how this used to be, before the S3 team absorbed it as a series of storage classes.

Similarly, there used to be truly horrifying restore fees for Glacier that were also very hard to predict. That got fixed early on, but the scary stories left scars to the point where I still encounter folks who think restores are both fiendishly expensive as well as confusing. They are not.

Glacier restores are also no longer painfully slow.

Obviously EC2-classic is gone, but that was a long time ago. One caveat that does come up a lot is that public v4 IP addresses are no longer free; they cost the same as Elastic IP addresses.

VPC peering used to be annoying; now there are better options like Transit Gateway, VPC sharing between accounts, resource sharing between accounts, and Cloud WAN.

VPC Lattice exists as a way for things to talk to one another and basically ignore a bunch of AWS networking gotchas. So does Tailscale.

CloudFront isn’t networking but it has been in the AWS “networking” section for ages so you can deal with it: it used to take ~45 minutes for an update, which was terrible. Nowadays it’s closer to 5 minutes—which still feels like 45 when you’re waiting for CloudFormation to finish a deployment.

ELB Classic (“classic” means “deprecated” in AWS land) used to charge cross AZ data transfer in addition to the load balancer “data has passed through me” fee to send to backends on a different availability zone.

ALBs with automatic zone load balancing do not charge additional data transfer fees for cross-AZ traffic, just their LCU fees. The same is true for Classic Load Balancers, but be warned: Network Load Balancers still charge cross-AZ fees!

Network Load Balancers didn’t used to support security groups, but they do now.

Availability Zones used to be randomized between accounts (my us-east-1a was your us-east-1c); you can now use Resource Access Manager to get zone IDs to ensure you’re aligned between any given accounts.

Originally Lambda had a 5 minute timeout and didn’t support container images. Now you can run them for up to 15 minutes, use Docker images, use shared storage with EFS, give them up to 10GB of RAM (for which CPU scales accordingly and invisibly), and give /tmp up to 10GB of storage instead of just half a gig.

Invoking a Lambda in a VPC is no longer dog-slow.

Lambda cold-starts are no longer as big of a problem as they were originally.

You no longer have to put a big pile of useless data on an EFS volume to get your IO allotment to something usable; you can adjust that separately from capacity now that they’ve added a second knob.

You get full performance on new EBS volumes that are empty. If you create an EBS volume from a snapshot, you’ll want to read the entire disk with dd or similar because it lazy-loads snapshot data from S3 and the first read of a block will be very slow. If you’re in a hurry, there are more expensive and complicated options.

EBS volumes can be attached to multiple EC2 instances at the same time (assuming io1), but you almost certainly don’t want to do this.

You can now have empty fields (the newsletter publication system for “Last Week in AWS” STILL uses a field designator of empty because it predates that change) in an item.

Performance has gotten a lot more reliable, to the point where you don’t need to use support-only tools locked behind NDAs to see what your hot key problems look like.

With pricing changes, you almost certainly want to run everything On Demand unless you’re in a very particular space.

Reserved Instances are going away for EC2, slowly but surely. Savings Plans are the path forward. The savings rates on these have diverged, to the point where they no longer offer as deep of a discount as RIs once did, which is offset by their additional flexibility. Pay attention!

EC2 charges by the second now, so spinning one up for five minutes over and over again no longer costs you an hour each time.

The Cost Anomaly Detector has gotten very good at flagging sudden changes in spend patterns. It is free.

The Compute Optimizer also does EBS volumes and other things. Its recommendations are trustworthy, unlike “Trusted” Advisor’s various suggestions.

The Trusted Advisor recommendations remain sketchy and self-contradictory at best, though some of their cost checks can now route through Compute Optimizer.

IAM roles are where permissions should live. IAM users are strictly for legacy applications rather than humans. The IAM Identity Center is the replacement for “AWS SSO” and it’s how humans should engage with their AWS accounts. This does cause some friction at times.

You can have multiple MFA devices configured for the root account.

You also do not need to have root credentials configured for organization member accounts.

us-east-1 is no longer a merrily burning dumpster fire of sadness and regret. This is further true across the board; things are a lot more durable these days, to the point where outages are noteworthy rather than “it’s another given Tuesday afternoon.”

While deprecations remain rare, they’re definitely on the rise; if an AWS service sounds relatively niche or goofy, consider your exodus plan before building atop it. None of the services mentioned thus far qualify.

CloudWatch doesn’t have the last datapoint being super low due to data inconsistency anymore, so if your graphs suddenly drop to zero for the last datapoint your app just shit itself.

You can close AWS accounts in your organization from the root account rather than having to log into each member account as their root user.

My thanks to folks on LinkedIn and BlueSky for helping come up with some of these. You’ve lived the same pain I have.

Read the original article

Comments

By simonw 2025-08-2016:187 reply

S3: "Block Public Access is now enabled by default on new buckets."

On the one hand, this is obviously the right decision. The number of giant data breeches caused by incorrectly configured S3 buckets is enormous.

But... every year or so I find myself wanting to create an S3 bucket with public read access to I can serve files out of it. And every time I need to do that I find something has changed and my old recipe doesn't work any more and I have to figure it out again from scratch!

By sylens 2025-08-2018:101 reply

The thing to keep in mind with the "Block Public Access" setting is that is a redundancy built in to save people from making really big mistakes.

Even if you have a terrible and permissive bucket policy or ACLs (legacy but still around) configured for the S3 bucket, if you have Block Public Access turned on - it won't matter. It still won't allow public access to the objects within.

If you turn it off but you have a well scoped and ironclad bucket policy - you're still good! The bucket policy will dictate who, if anyone, has access. Of course, you have to make sure nobody inadvertantly modifies that bucket policy over time, or adds an IAM role with access, or modifies the trust policy for an existing IAM role that has access, and so on.

By simonw 2025-08-2019:202 reply

I think this is the key of why I find it confusing: I need a very clear diagram showing which rules override which other rules.

By saghm 2025-08-2023:411 reply

My understanding is that there isn't actually any "overriding" in the sense of two rules conflicting and one of them having to "win" and take effect. I think it's more that an enabled rule always is in effect, but it might overlap with another rule, in which case removing one of them still won't remove the restrictions on the area of overlap. It's possible I'm reading too much into your choice of words, but it does sound like there's a chance that the confusion is stemming from an incorrect assumption of how various permissions interact.

That being said, there's certain a lot more that could into making a system like that easier for developers. One thing that springs to mind is tooling that can describe what rules are currently in effect that limit (or grant, depending on the model) permissions for something. That would make it more clear when there are overlapping rules that affect the permissions of something, which in turn would make it much more clear why something is still not accessible from a given context despite one of the rules being removed.

By jagged-chisel 2025-08-2111:401 reply

If one rule explicitly restricts access and another explicitly grants access, which one is in effect? Do restrictions override grants? Does a grant to GroupOne override a restriction to GroupAlpha when the authenticated use in is both groups? Do rules set by GodAdmin override rules set by AngelAdmin?

By saghm 2025-08-2115:21

It's possible I'm making the exact mistake that the article describes and relying on outdated information, but my understanding is that pretty much all of the rules are actually permissions rather than restrictions. "Block public access" is an unfortunate exception to this, and I suspect that it's probably just a poorly named inversion of an "allow public access" permission. You're 100% right that modeling permissions like this requires having everything in the same "direction", i.e. either all permissions or all restrictions.

After thinking about this sort of thing a lot when designing a system for something sort of similar to this (at a much smaller scale, but with the intent to define it in a way that could be extended to define new types of rules for a given set of resources), I feel pretty strongly that the best way for a system like this to work from the protectives of security, ease of implementation, and intuitiveness for users are all aligned in requiring every rule to explicitly be defined as a permission rather than representing any of them as restrictions (both in how they're presented to the user and how they're modeled under the hood). With this model, veryifing whether an action is allowed can be implemented by mapping an action to the set of accesses (or mutations, as the case may be) it would perform, and then checking that each of them has a rule present that allows it. This makes it much easier to figure out whether something is allowed or not, and there's plenty of room for quality of life things to help users understand the system (e.g. being able to easily show a user what rules pertain to a given resource with essentially the same lookup that you'd need to do when verifying an action in it). My sense is that this is actually not far from how AWS permissions are implemented under the hood, but they completely fail at the user-facing side of this by making it much harder than it needs to be to discover where to define the rules for something (and by extension, where to find the rules currently in effect for it).

By luluthefirst 2025-08-2113:51

They don't really override each other but they act like stacked barriers, like a garage door blocking access to an open or closed car. Access is granted if every relevant layer allows it.

By andrewmcwatters 2025-08-2017:371 reply

This sort of thing drives me nuts in interviews, when people are like, are you familiar with such-and-such technology?

Yeah, what month?

By tester756 2025-08-2020:083 reply

If you're aware of changes, then explain that there were changes over time, that's it

By andrewmcwatters 2025-08-2021:31

You seem to be lacking the experience of what actually happens in interviews.

By reactordev 2025-08-212:321 reply

You say this, someone challenges you, now you're on the defensive during an interview and everyone has a bad taste in their mouth. Yeah, that's how it goes.

By pas 2025-08-217:381 reply

That's just the taste of iron from the blood after the duel. But this is completely normal after a formal challenge! Companies want real cyberwarriors, and the old (lame) rockstar ninjas that they hired 10 years ago are very prone to issuing these.

By reactordev 2025-08-2114:581 reply

I don’t want to go to war, I just want a quiet house in the mountains and a career that allows me to think about things.

By andrewmcwatters 2025-08-2116:56

Amen.

By crinkly 2025-08-2017:233 reply

I just stick CloudFront in front of those buckets. You don't need to expose the bucket at all then and can point it at a canonical hostname in your DNS.

By hnlmorg 2025-08-2018:144 reply

That’s definitely the “correct” way of doing things if you’re writing infra professionally. But I do also get that more casual users might prefer not to incur the additional costs nor complexity of having CloudFront in front. Though at that point, one could reasonably ask if S3 is the right choice for causal users.

By gchamonlive 2025-08-2018:521 reply

S3 + cloudfront is also incredibly popular so you can just find recipes for automating that in any technology you want, Terraform, ansible, plain bash scripts, Cloudformation (god forbid)

By gigatexal 2025-08-2018:594 reply

Yeah holy crap why is cloud formation so terrible?

By gchamonlive 2025-08-2019:341 reply

It's designed to be a declarative DSL, but then you have to do all sorts of filters and maps in any group of resources and suddenly you are programming in yaml with both hands tied behind your back

By gigatexal 2025-08-2021:485 reply

Yeah it’s just terrible. If Amazon knew what was good they’d just replace it with almost anything else. Heck just got all in on terraform and call it a day.

By mdaniel 2025-08-214:241 reply

This may be heresy in an AWS thread, but as a concept Bicep actually isn't terrible: https://github.com/Azure/bicep/blob/v0.37.4/src/Bicep.Cli.E2...

It does compile down to Azure Resource Manager's json DSL, so in that way close to Troposphere I guess, only both sides are official and not just some rando project that happens to emit yaml/json

The implementation, of course, is ... very Azure, so I don't mean to praise using it, merely that it's a better idea than rawdogging json

By hnlmorg 2025-08-217:49

I’ve heard so many bad things about bicep on Azure that I’m not convinced it’s an upgrade over TF.

The syntax does look nicer but sadly that’s just a superficial improvement.

By hnlmorg 2025-08-217:441 reply

They do contribute to the AWS provider for Terraform.

Also that have CDK which is a framework for writing IaC in Java/TypeScript, Go, Python, etc.

By gigatexal 2025-08-2121:58

Meh. The CDK doesn’t look terrible. It’s still not ideal. But even if this compiles to a mess of CF it’s still better than writing CF by hand and that’s only because CF is so bad to begin with.

https://dev.to/kelvinskell/getting-started-with-aws-cdk-in-p...

By mdaniel 2025-08-214:283 reply

As for "go all in on terraform," I pray to all that is holy every night that terraform rots in the hell that spawned it. And that's not even getting into the rug pull parts, I mean the very idea of

1. I need a goddamn CLI to run it (versus giving someone a URL they can load in their tenant and have running resources afterward)

1. the goddamn CLI mandates live cloud credentials, but then stright-up never uses them to check a goddamn thing it intends to do to my cloud control plane

You may say "running 'plan' does" and I can offer 50+ examples clearly demonstrating that it does not catch the most facepalm of bugs

1. related to that, having a state file that believes it knows what exists in the world is just ludicrous and pain made manifest

1. a tool that thinks nuking things is an appropriate fix ... whew. Although I guess in our new LLM world, saying such things makes me the old person who should get onboard the "nothing matters" train

and the language is a dumpster, imho

By hnlmorg 2025-08-218:041 reply

There's a lot wrong with Terraform but I don't think you're being at all fair with your specific critisims here:

> 1. I need a goddamn CLI to run it (versus giving someone a URL they can load in their tenant and have running resources afterward)

CloudFormation is the only IaC that supports "running as a URL" and that's only because it's an AWS native solution. And CloudFormation is a hell of a lot more painful to write and slower to iterate on. So you're not any better off for using CF.

What usually happens with TF is you'd build a deploy pipeline. Thus you can test via the CLI then deploy via CI/CD. So you're not limited to just the CLI. But personally, I don't see the CLI as a limitation.

> the goddamn CLI mandates live cloud credentials, but then stright-up never uses them to check a goddamn thing it intends to do to my cloud control plane

All IaC requires live cloud credentials. It would be impossible for them to work without live credentials ;)

Terraform does do a lot of checking. I do agree there is a lot that the plan misses though. That's definitely frustrating. But it's a side effect of cloud vendors having arbitrary conditions that are hard to define and forever changing. You run into the same problem with any tool you'd use to provision. Heck, even manually deploying stuff from the web console sometimes takes a couple of tweaks to get right.

> 1. related to that, having a state file that believes it knows what exists in the world is just ludicrous and pain made manifest

This is a very strange complaint. Having a state file is the bare minimum any IaC NEEDS for it to be considered a viable option. If you don't like IaC tracking state then you're really little better off than managing resources manually.

> a tool that thinks nuking things is an appropriate fix ... whew.

This is grossly unfair. Terraform only destroys resources when:

1. you remove those resources from the source. Which is sensible because you're telling Terraform you no longer want those resources

2. when you make a change that AWS doesn't support doing on live resources. Thus the limitation isn't Terraform, it is AWS

In either scenario, the destroy is explicit in the plan and expected behaviour.

By mdaniel 2025-08-2114:401 reply

> CloudFormation is the only IaC that supports "running as a URL"

Incorrect, ARM does too, they even have a much nicer icon for one click "Deploy to Azure" <https://learn.microsoft.com/en-us/azure/azure-resource-manag...> and as a concrete example (or whole repo of them): <https://github.com/Azure/azure-quickstart-templates/tree/2db...>

> All IaC requires live cloud credentials. It would be impossible for them to work without live credentials ;)

Did you read the rest of the sentence? I said it's the worst of both worlds: I can't run "plan" without live creds, but then it doesn't use them to check jack shit. Also, to circle back to our CF and Bicep discussion, no, I don't need cloud creds to write code for those stacks - I need only creds to apply them

I don't need a state file for CF nor Bicep. Mysterious about that, huh?

By hnlmorg 2025-08-2118:57

> Incorrect, ARM does too, they even have a much nicer icon for one click "Deploy to Azure"

That’s Azure, not AWS. My point was to have “one click” HTTP installs you need native integration with the cloud vendor. For Azure it’s the clusterfuck that is Bicep. For AWS it’s the clusterfuck that is CF

> I don't need a state file for CF nor Bicep.

CF does have a state file, it’s just hidden from view.

And bicep is shit precisely because it doesn’t track state. In fact the lack of a state file is the main complain against bicep and thus the biggest thing holding it back from wider adoption — despite being endorsed by Microsoft Azure.

By gchamonlive 2025-08-2113:56

All Terraform does is build a DAG, compare it with the current state file and pass the changes down to the provider so it can translate to the correct sequence of interactions with the upstream API. Most of your criticism boils down to limitations of the cloud provider API and/or Terraform provider quality. It won't check for naming collision for instance, it assumes you know what you are doing.

Regarding HCL, I respect their decision to keep the language minimal, and for all it's worth you can go very, very far with the language expressions and using modules to abstract some logic, but I think it's a fair criticism for the language not to support custom functions and higher level abstractions.

By SvenL 2025-08-214:501 reply

Amen, and I would add to that list “no, just because you use terraform doesn’t mean you can simply switch between cloud providers”.

By hnlmorg 2025-08-218:291 reply

Is there any IaC solutions where you can “simply switch between cloud providers”?

This isn’t a limitation of TF, it’s an intended consequence of cloud vendor lock in

By mdaniel 2025-08-2114:551 reply

I believe the usual uninformed thinking is "terraform exists outside of AWS, so I can move off of AWS" versus "we have used CF or Bicep, now we're stuck" kind of deal

Which is to say both of you are correct, but OP was highlighting the improper expectations of "if we write in TF, sure it sucks balls but we can then just pivot to $other_cloud" not realizing it's untrue and now you've used a rusty paintbrush as a screwdriver

By hnlmorg 2025-08-2119:53

I don’t think that expectation exists with anyone with even the slightest understanding of IaC and systems.

But maybe I’ve just been blessed to work with people who aren’t complete idiots?

By stogot 2025-08-212:22

Isn’t that what CDK was for?

By SteveNuts 2025-08-2019:121 reply

Last time I tried to use CF, the third party IAC tools were faster to release new features than the functionality of CF itself. (Like Terraform would support some S3 bucket feature when creating a bucket, but CF did not).

I'm not sure if that's changed recently, I've stopped using it.

By tkjef 2025-08-2114:09

I have been on the terraform side for 7 years-ish.

eksctl just really impressed me with its eks management, specifically managed node groups & cluster add-ons, over terraform.

that uses cloudformation under the hood. so i gave it a try, and it’s awesome. combine with github actions and you have your IAC automation.

nice web interface for others to check stacks status, events for debugging and associated resources that were created.

oh, ever destroy some legacy complex (or not that complex) aws shit in terraform? it’s not going to be smooth. site to site connections, network interfaces, subnets, peering connections, associated resources… oh, my.

so far cloudformation has been good at destroying, but i haven’t tested that with massive legacy infra yet.

but i am happily converted tf>cf.

and will happily use both alongside each other as needed.

By dragonwriter 2025-08-210:53

Because its an old early IaC language, but it works and lots depends on it, so instead of dumping or retooling it, AWS keeps it around as a compilation target, while pushing other solutions (years ago, the SAM transform on top of it, more recently CDK) as the main thing for people to actually use directly.

By baby_souffle 2025-08-210:48

> Yeah holy crap why is cloud formation so terrible?

I can't confirm it, but I suspect that it was always meant to be a sales tool.

Every AWS announcement blog has a "just copy this JSON blob, and paste it $here to get your own copy of the toy demo we used to demonstrate in this announcement blog" vibe to it.

By damieng 2025-08-2021:361 reply

I'd argue putting CloudFront on top of S3 is less complex than getting the permissions and static sharing setup right on S3 itself.

By hnlmorg 2025-08-218:12

I do get where you're coming from, but I don't agree. With the CF+S3 combo you now need to choose which sharing mode to work with S3 (there are several different ways you can link CF to S3). Then you have the wider configuration of CF to manage too. And that's before you account for any caching issues you might run into when debugging your site.

If you know what you're doing, as it sounds like you and I do, then all of this is very easy to get set up (but then aren't most things easy when you already know how? hehe). However we are talking about people who aren't comfortable with vanilla S3, so throwing another service into the mix isn't going to make things easier for them.

By crinkly 2025-08-2018:492 reply

It's actually incredibly cheap. I think our software distribution costs, in the account I run, are around $2.00 a month. That's pushing out several thousand MSI packages a day.

By hnlmorg 2025-08-218:19

S3 is actually quite expensive compared to the competition for both storage costs and egress costs. At a previous start-up, we had terrabytes of data on S3 and it was our second largest cost (after GPUs) and by some margin.

For small scale stuff, S3s storage and egress charges are unlikely to be impactful. But it doesn’t mean they’re cheap relative to the competition.

There are also ways you can reduce S3 costs, but then you're trading the costs received from AWS with the costs of hiring competent DevOps. Either way, you pay.

By oblio 2025-08-2021:56

With CloudFront?

By tayo42 2025-08-2020:141 reply

>S3 is the right choice for causal users.

It's so simple for storing and serving a static website.

Are there good and cheap alternatives?

By MaKey 2025-08-2020:271 reply

Yeah, your classic web hoster. Just today I uploaded a static website to one via FTP.

By fodkodrasz 2025-08-2020:491 reply

Really? If I remember correctly: My Static website served from S3 + CF + R53 by about 0.67$ / mo, 0.5 being R53 from that, 0.16 being CF, S3 being 0.01 for my page.

BTW: Is GitHub Page still free for custom domains? (I don't know the EULA)

By daydream 2025-08-2022:22

GitHub Pages are still free but commercial websites are forbidden.

By herpderperator 2025-08-2019:241 reply

For the sake of understanding, can you explain why putting CloudFront in front of the buckets helps?

By bhattisatish 2025-08-2020:32

Cloudfront allows you to map your S3 with both

- signed url's in case you want a session base files download

- default public files, for e.g. a static site.

You can also map a domain (sub-domain) to Cloudfront with a CNAME record and serve the files via your own domain.

Cloudfront distributions are also CDN based. This way you serve files local to the users location, thus increasing the speed of your site.

For lower to mid range traffic, cloudfront with s3 is cheaper as the network cost of cloudfront is cheaper. But for large network traffic, cloudfront cost can balloon very fast. But in those scenarios S3 costs are prohibitive too!

By dcminter 2025-08-2111:50

Not always that simple - for example if you want to automatically load /foo/index.html when the browser requests /foo/ you'll need to either use the web serving feature of S3 (bucket can't be private) or set up some lambda at edge or similar fiddly shenanigans.

By cedws 2025-08-213:37

I’m getting deja vu, didn’t they already do this like 10 years ago because people kept leaving their buckets wide open?

By awongh 2025-08-2023:222 reply

This is exactly what I use LLMs for. To just read the docs for me and pull out the base level demo code that's buried in all the AWS documentation.

Once I have that I can also ask it for the custom tweaks I need.

By jiggawatts 2025-08-2110:491 reply

Back when GPT4 was the new hotness, I dumped the markdown text from the Azure documentation GitHub repo into a vector index and wrapped a chatbot around it. That way, I got answers based on the latest documentation instead of a year-old LLM model's fuzzy memory.

I now have the daunting challenge of deploying an Azure Kubernetes cluster with... shudder... Windows Server containers on top. There's a mile-long list of deprecations and missing features that were fixed just "last week" (or whatever). That is just too much work to keep up with for mere humans.

I'm thinking of doing the same kind of customised chatbot but with a scheduled daily script that pulls the latest doco commits, and the Azure blogs, and the open GitHub issue tickets in the relevant projects and dumps all of that directly into the chat context.

I'm going to roll up my sleeves next week and actually do that.

Then, then, I'm going to ask the wizard in the machine how to make this madness work.

Pray for me.

By elcritch 2025-08-222:41

I just want a service that does this. Pulls in the latest docs into a vector db with a chat or front-end. Not the windows containers bit.

By dcminter 2025-08-216:462 reply

This could not possibly go wrong...

You're braver than me if you're willing to trust the LLM here - fine if you're ready to properly review all the relevant docs once you have code in hand, but there are some very expensive risks otherwise.

By awongh 2025-08-2110:291 reply

This is LLM as semantic search- so it's way way easier to start from the basic example code and google to confirm that it's correct than it is to read the docs from scratch and piece together the basic example code. Especially for things like configurations and permissions.

By dcminter 2025-08-2111:27

Sure, if you do that second part of verifying it. If you just get the LLM to spit it out then yolo it into production it is going to make you sad at some point.

By simianwords 2025-08-217:012 reply

There’s nothing brave in this. It generally works the way it should and even if it doesn’t - you just go back to see what went wrong.

I take code from stack overflow all the time and there’s like a 90% chance it can work. What’s the difference here?

By jcattle 2025-08-217:521 reply

However on AWS the difference between "generally working the way it should and not working the way it should" can be a 30,000$ cloud bill racked up in a few hours with EC2 going full speed ahead mining bitcoin.

By simianwords 2025-08-217:56

For those high stakes cases maybe you can be more careful. You can still use an LLM to search and get references to the appropriate place and do your own verification.

But for low stakes LLM works just fine - not everything is going to blow up to a 30,000 bill.

In fact I'll take the complete opposite stance - verifying your design with an LLM will help you _save_ money more often than not. It knows things you don't and has awareness of concepts that you might have not even read about.

By dcminter 2025-08-217:411 reply

Well, the "accidentally making the S3 bucket public" scenario would be a good one. If you review carefully with full understanding of what e.g. all your policies are doing then great, no problem.

If you don't do that will you necessarily notice that you accidentally leaked customer data to the world?

The problem isn't the LLM it's assuming its output is correct just the same as assuming Stack Overflow answers are correct without verifying/understanding them.

By simianwords 2025-08-217:522 reply

I agree but its about the extent. I'm willing to accept the risk of ocassionally making S3 public but getting things done much faster, much like I don't meticulously read documentation when I can get the answer from stackoverflow.

If you are comparing with stackoverflow then I guess we are on the same page - most people are fine with taking stuff from stackoverflow and it doesn't count as "brave".

By dcminter 2025-08-218:091 reply

I think anyone who just copies and pastes from SO is indeed "brave" for pretty much exactly the same reason.

> I'm willing to accept the risk of ocassionally making S3 public

This is definitely where we diverge. I'm generally working with stuff that legally cannot be exposed - with hefty compliance fines on the horizon if we fuck up.

By simianwords 2025-08-218:39

That's fair - I would definitely use stackoverflow liberally and dive into documentation when situation demands it.

By awongh 2025-08-2110:31

The thing is that you can now ask the LLM for links and you can ask it to break down why it thinks a piece of code, for example, protects the bucket from being public. Things that are easy to verify against the actual docs.

I feel like this workflow is still less time, easier and less error prone than digging out the exact right syntax from the AWS docs.

By reactordev 2025-08-212:30

They'll teach you how for $250 and a certification test...

By SOLAR_FIELDS 2025-08-2016:192 reply

I honestly don't mind that you have to jump through hurdles to make your bucket publically available and that it's annoying. That to me seems like a feature, not a bug

By dghlsakjg 2025-08-2016:39

I think the OPs objection is not that hurdles exist but that they move them every time you try and run the track.

By simonw 2025-08-2016:36

Sure... but last time I needed to jump through those hurdles I lost nearly an hour to them!

I'm still not sure I know how to do it if I need to again.

By viccis 2025-08-2020:292 reply

>In EC2, you can now change security groups and IAM roles without shutting the instance down to do it.

Hasn't it been this way for many years?

>Spot instances used to be much more of a bidding war / marketplace.

Yeah because there's no bidding any more at all, which is great because you don't get those super high spikes as availability drops and only the ones who bid super high to ensure they wouldn't be priced out are able to get them.

>You don’t have to randomize the first part of your object keys to ensure they get spread around and avoid hotspots.

This one was a nightmare and it took ages to convince some of my more pig headed coworkers in the past that they didn't need to do it any more. The funniest part is that they were storing their data as millions and millions of 10-100kb files, so the S3 backend scaling wasn't the thing bottlenecking performance anyway!

>Originally Lambda had a 5 minute timeout and didn’t support container images. Now you can run them for up to 15 minutes, use Docker images, use shared storage with EFS, give them up to 10GB of RAM (for which CPU scales accordingly and invisibly), and give /tmp up to 10GB of storage instead of just half a gig.

This was/is killer. It used to be such a pain to have to manage pyarrow's package size if I wanted a Python Lambda function that used it. One thing I'll add that took me an embarrassingly long time to realize is that your Python global scope is actually persisted, not just the /tmp directory.

By Inufu 2025-08-2318:17

> You don’t have to randomize the first part of your object keys to ensure they get spread around and avoid hotspots.

Sorry, this is absolutely still the case if you want to scale throughput beyond the few thousand IOPS a single shard can serve. S3 will automatically reshard your key space, but if your keys are sequential (eg leading timestamp) all your writes will still hit the same shard.

Source: direct conversations with AWS teams.

By indigodaddy 2025-08-2115:571 reply

Re: SG, yeah I wasnt doing any cloud stuff when that was the case. Never had to restart anything for an SG change and this must be at least 5-6 years..

By buzzdenver 2025-08-2318:09

IAM Role change is more recent though

By jp57 2025-08-2021:095 reply

> Glacier restores are also no longer painfully slow.

I had a theory (based on no evidence I'm aware of except knowing how Amazon operates) that the original Glacier service operated out of an Amazon fulfillment center somewhere. When you put it a request for your data, a picker would go to a shelf, pick up some removable media, take it back, and slot it into a drive in a rack.

This, BTW, is how tape backups on timesharing machines used to work once upon a time. You'd put in a request for a tape and the operator in the machine room would have to go get it from a shelf and mount it on the tape drive.

By danudey 2025-08-2022:471 reply

The most likely explanation is that they used a tape robot, such as the one seen here:

https://www.reddit.com/r/DataHoarder/comments/12um0ga/the_ro...

Which is basically exactly what you described but the picker is a robot.

Data requests go into a queue; when your request comes up, the robot looks up the data you requested, finds the tape and the offset, fetches the tape and inserts it into the drive, fast-forwards it to the offset, reads the file to temporary storage, rewinds the tape, ejects it, and puts it back. The latency of offline storage is in fetching/replacing the casette and in forwarding/rewinding the tape, plus waiting for an available drive.

Realistically, the systems probably fetch the next request from the queue, look up the tape it's on, and then process every request from that tape so they're not swapping the same tape in and out twenty times for twenty requests.

By philistine 2025-08-210:212 reply

I've read very definitive discussions on here that Glacier never used tape. It has always been powered off hard disks.

By UltraSane 2025-08-211:053 reply

For truly write once read never data tape is the optimal storage method. It is exactly what the LTO standard was designed to do and it does it very well. You can be confident that you will be able to read every bit of data from a 30 year old tape, probably even 50 years old. It has the lowest bit error rate of any technology I am aware of. LTO-9 is better than 1 uncorrectable bit error in 10^20 user bits, which is 1 bit error in 12.5 exabytes. There is also the substantial advantage that tapes on a shelf are completely immune to ransomware. As a sysadmin I get that warm fuzzy feeling when critical data is backed up on a good LTO tape library.

By dabiged 2025-08-211:15

As someone who does tape recovery on very very old tape I largely concur with this with a couple of caveats.

1. Do not encrypt your tapes if you want the data back in 30/50 years. We have had so many companies lose encryption keys and turn their tapes into paperweights because the company they bought out 17 years ago had poor key management.

2. The typical failure case on tape is physical damage not bit errors. This can be via blunt force trauma (i.e. dropping, or sometimes crushing) or via poor storage (i.e. mould/mildew).

3. Not all tape formats are created equal. I have seen far higher failure rates on tape formats that are repeatedly accessed, updated, ejected, than your old style write once, read none pattern.

By count 2025-08-211:111 reply

Call it bad luck, but I’ve never had a fully successful restore. Drives eat tapes, drives are damaged and write bad data, robot arms die or malfunction. Tapes have NEVER worked for me. SANs and remote disk though, rock solid.

That said, I don’t miss any of that stuff, gimme S3 any day :)

By UltraSane 2025-08-216:17

You do realized that that isn't normal at all? LTO tape is still used by thousands of companies to backup many exabytes of data. I know it once saved Google from permanent loss of gmail data from a bug. You should really get a refund for your tape drives.

By meepmorp 2025-08-214:091 reply

Aren't LTO formats only backward compatible with the immediate prior version?

By UltraSane 2025-08-216:201 reply

They can write to one version back and read two version back. for really long term data storage you have to also store the read/write hardware.

By Dylan16807 2025-08-216:54

By the time it's hard to get a compatible LTO drive, I'd be very suspicious of a mothballed drive working either. If you want reliable long term storage you're going to have to update it every couple decades.

By danudey 2025-08-2218:46

That's... interesting. I wonder what the wear-and-tear on an HDD is to spin it up/power it back down again.

By Twirrim 2025-08-210:223 reply

I can't talk about it, but I've yet to see an accurate guess at how Glacier was originally designed. I think I'm in safe territory to say Glacier operated out of the same data centers as every other AWS service.

It's been a long time, and features launched since I left make clear some changes have happened, but I'll still tread a little carefully (though no one probably cares there anymore):

One of the most crucial things to do in all walks of engineering and product management is to learn how to manage the customer expectations. If you say customers can only upload 10 images, and then allow them to upload 12, they will come to expect that you will always let them upload 12. Sometimes it's really valuable to manage expectations so that you give yourself space for future changes that you may want to make. It's a lot easier to go from supporting 10 images to 20, than the reverse.

By donavanm 2025-08-217:05

Im like 90% sure ive seen folks (unofficially) disclose the original storage and API decisions over the years, in roughly accurate terms. Personally I think the multi dimensional striping/erasure code ideas are way more interesting than the “its just a tape library” speculation/arguments. That and the real lessons learned around product differentiation as supporting technologies converge.

By kelnos 2025-08-213:541 reply

> I can't talk about it, but I've yet to see an accurate guess at how Glacier was originally designed.

It feels odd that this is some sort of secret. Why can't you talk about it?

By Twirrim 2025-08-214:011 reply

I signed NDAs. I wish Glacier was more open about their history, because it's honestly interesting, and they have a number of notable innovations in how they approach things.

By Dylan16807 2025-08-217:02

Well assuming your NDA is a reasonable length I hope you talk about it later.

(And if Amazon is making unreasonable length NDAs I hope they lose a lot of money over it.)

By mh- 2025-08-212:47

..oh. That's clever. Thanks for posting this.

By jp57 2025-08-2121:50

I think folks have missed what I think would have been clever about the implentation I (apparently) dreamt up. It's not that "it's just a tape library", it's that it would have used the existing FC and picker infrastructure that Amazon had already built, with some racks containing drives for removable media. I was thinking that it would not have been some special facility purely for Glacier, but rather one or more regular FCs would just have had some shelves with Glacier media (not necessarily tapes).

Then the existing pickers would get special instructions on their handhelds: Go get item number NNNN from Row/shelf/bin X/Y/Z and take it to [machine-M] and slot it in, etc.

By browningstreet 2025-08-2021:27

Yeah, but they've been robotic for decades since.

By christina97 2025-08-2023:14

They would definitely be using rubies robots given how uniform hard drives are. The only reason warehouses still have humans is that heterogeneity (different sizes, different textures, different squishiness, etc).