Self-hosting a NAT Gateway

2025-11-1717:31166123www.awsistoohard.com

Society would have you believe that self hosting a NAT Gateway is “crazy”, “irresponsible” and potentially even “dangerous”. But in this post I hope to shed some light into why someone would go down…

Society would have you believe that self hosting a NAT Gateway is “crazy”, “irresponsible” and potentially even “dangerous”. But in this post I hope to shed some light into why someone would go down this path, the benefits, and my real experience when implementing this in a real engineering organization.

What even is a NAT Gateway

It's important to start with why. Why would someone even think about replacing a core part of AWS infrastructure. What does a NAT Gateway even do? For those unfamiliar, a NAT Gateway acts as a one way door to your private subnet to access the internet without allowing traffic in. This is important part of good network design. If traffic was allowed in, this would pose a massive security issue - anyone on the internet could reach your internal services. A NAT Gateway is a bouncer at a club - but this club only allows people out, no one can enter.

NAT Gateway Diagram

The problem that this creates is a bottleneck - your internal services have to talk to the internet (think any API call ever). Your entire infrastructure relies on the NAT Gateway to handle outbound internet traffic.

AWS has entered the chat

AWS is primed for this - folks need a high availability, high uptime NAT Gateway in order to function. And due to this requirement they can charge (in my opinion) an exorbitant amount to provide this service. What are you going to do? They can guarantee that this critical piece of infrastructure will scale & be highly available while your ChatGPT wrapper blows up!

DevOps & Infrastructure engineers know the pain of seeing the NAT Gateway hours & NAT Gateway Bytes line item on the AWS bill. Society breathing down your neck saying “There's nothing you can do about it” and “Think of it as the cost of doing business”. To them I say, you’re wrong, you can do anything you set your mind to.

Why would you even think of this?

Before diving into my implementation, I think it's important to state that this is not a one size fits all. I recently worked with Vitalize to speed up some of their Github Actions. We decided to self host Github runners in their private subnet, along with a very robust and deep set of integration tests that run on every PR. Because of this, the dominant cost tended to be NAT Gateway bytes, as there was an enormous amount of traffic that was going through their private subnets.

This was the major motivation behind starting to explore here. We still run NAT Gateways in production (for now), but in environments that are not as high of a risk, with major cost upside, the ability to delete potentially 10-15% of your daily AWS bill is quite appetizing (depending on how much NAT Gateway costs contribute to your overall spend).

Options

So you've made it this far - now it's time to start shopping around. The good news is that there exist heroes in the open source community who have done most of the hard parts for you. In my research, I came across 2 major options.

Option 1: Fck-NAT

This is the main alternative that folks find when first looking this up. Essentially this boils down to a purpose built AWS AMI image that Andrew Guenther has been maintaining. There are some limitations here, as noted on the public facing docs here, but in general it is quite straightforward. They have a terraform module that makes things fairly intuitive to set up, which I dive into deeper under the implementation section below.

Option 2: AlterNAT

This is another alternative that I thought deserved a special mention. Maintained by Chime here, this is a much more in depth and ‘production’ alternate to Fck-NAT.

As mentioned above, a self-hosted NAT Gateway (running on an EC2 instance) can end up becoming a bottleneck if anything were to happen to your EC2 instance. The way that AlterNAT/Chime has gotten around that is quite clever (and complex). From my rudimentary understanding, they use a mix of instances across availability zones (similar to Fck-NAT) to get ahead of downtime in a certain AZ. But they take it a step further by also employing Lambdas to constantly poll to ensure the EC2 instance is behaving as expected. In conjunction with standby NAT Gateways, this allows you to quickly failover to the AWS-managed NAT Gateway if an EC2 instance ever fails. While this will not result in 0 downtime, it can drastically reduce any disruption by automatically updating route tables.

Alternative NAT Network

I encourage folks to look at this repo as it is quite feature filled. I’ve also attached their network diagram below. We did not end up using it since it was a bit overkill for the objective we had. Additionally, it relies on standby NAT gateways, which I was trying to fully eliminate. If I ever rolled this out to production, this would be the approach I would take.

Implementation

In this implementation, since this was primarily an exercise in cost cutting, I decided to go with Fck-NAT. If this was a production environment, the fallback mechanism and robustness of AlterNAT is much more appealing. But truly in this case I wanted to delete the NAT Gateway cost completely from our development environment.

I ended up going with the official terraform module suggested by Fck-NAT. You can see an excerpt from our network module below.

module "fck_nat" {
  source  = "RaJiska/fck-nat/aws"
  version = "1.3.0"
  count   = var.use_fck_nat ? 2 : 0

  name          = "${var.company_name}-fck-nat-${count.index + 1}"
  vpc_id        = aws_vpc.main.id
  subnet_id     = module.subnets.public_subnet_ids[count.index]
  instance_type = var.fck_nat_instance_type

  tags = {
    Name        = "${var.company_name}-fck-nat-${count.index + 1}"
    Environment = var.env
  }
}

We implemented it by using 2 t4g.nano instances. Implementing it resulted in about 15 - 30 seconds of downtime in our development environment which was done in the middle of the night to avoid any angry devs.

Results

In our case, the results were quite dramatic. To start, we were able to cut out NATGateway-Hours by 50%. We maintain a development and production environment, and we fully killed the NAT Gateway in dev:

Hours Cost Results

But the more surprising, and dramatic, cost saving were in around NATGateway-Bytes. As mentioned, in this case we had self-hosted Github Runners and preview environments that pushed a lot of traffic when developers were active. During the week, we would routinely see upwards of $30-$40 of traffic per day. After rolling out this change, the highest we’ve seen is closer to $6 at most.

In this case, I think a lot of this was driven by two main factors:

  1. Every PR we had would create a preview environment that would then run a whole suite of playwright tests. This would run for every PR, for every commit. Though the overhead on compute was quite minimal since they were not very demanding, I believe the amount of traffic contributed to this.
  2. I believe the main cost from the self hosted runners was actually streaming the logs back to Github. I spot checked a few of our tests (unit, integration, etc), and almost every single log file I would download from Github would be ~40-50mb in size. Doing some math, about 5-6 tests per commit per PR means about 250MB per commit, and assuming the average PR has about 5 commits, that's about 1.25GB of data being streamed back to Github (and through our AWS NAT Gateway) per PR. That can easily start adding up, and I believe also contributed to our high costs.

Bytes Cost Results

Another interesting data point that might be relevant for anyone thinking about implementing this: in our implementation, as mentioned, we went with two t4g.nano instances. During the week, we would see peaks of 800GB-900GB of traffic daily. But these two instances have been able to easily handle this load, with no degradation that can be felt by developers.

Data Results

In total, across these two major costs, we've seen in general about a 70% cost reduction in NAT Gateway costs, which has been quite impactful for our total daily spend in this organization.

Conclusion

It may not be for all organizations, but if you find yourself bleeding money into NAT Gateways, and you happen to have environments where the stakes are low (e.g. a development or staging environment), self hosting a NAT Gateway is a lot simpler than you'd expect. The open source community has made this really simple with out of the box terraform.

Sometimes, society can be wrong. Change requires risk takers - bold humans who choose not to listen to the status quo. You only live once - self host a NAT Gateway.


Read the original article

Comments

  • By Arch-TK 2025-11-226:203 reply

    The article seems to perpetuate one of those age old myths that NAT has something to do with protection.

    Yes, in a very superficial sense, you can't literally route a packet over the internet backwards to a host behind NAT without matching a state entry or explicit port forwarding. But implementing NAT on it's own says nothing about the behavior of your router firewall with regards to receiving Martians, or with regards to whether the router firewall itself accepts connections and if the router firewall itself isn't running some service which causes exposure.

    To actually protect things behind NAT you still need firewall rules and you can keep those rules even when you are not using NAT. Thus those rules, and by extension the protection, are separable from the concept of NAT.

    This is the kind of weird argument that has caused a lot of people who hadn't ever used IPv6 to avoid trying it.

    • By mzhaase 2025-11-2210:142 reply

      If you think about it, NAT offers pretty much the same protection as a default stateful firewall. Only allowing packets from the outside related to a connection initiated from the inside.

      • By lloeki 2025-11-2210:331 reply

        > Only allowing packets from the outside related to a connection initiated from the inside.

        NAT a.k.a IP masquerading does not do that, it only figures out that some ingress packets whose DST is the gateway actually map to previous packets coming from a LAN endpoint that have been masqueraded before, performs the reverse masquerading, and routes the new packet there.

        But plop in a route to the network behind and unmatched ingress packets definitely get routed to the internal side. To have that not happen you need to drop those unmatched ingress packets, and that's the firewall doing that.

        Fun fact: some decade ago an ISP where I lived screwed that up. A neighbour and I figured out the network was something like that:

            192.168.1.x --- 192.168.1.1 --
                                          \
                                           10.0.0.x ----> WAN
                                          /
            192.168.2.x --- 192.168.2.1 --
        
        192.168.1 and 192.168.2 would be two ISP subscribers and 10.0.0.x some internal local haul. 192.168.x.1 would perform NAT but not firewall.

        You'd never see that 10.0.0.x usually as things towards WAN would get NAT'd (twice). But 10.0.0.x would know about both of the 192, so you just had to add respective routes to each other in the 192.168.x.1 and bam you'd be able to have packets fly through both ways, NAT be damned.

        Network Address Translation is not a firewall and provides no magically imbued protection.

        • By grosswait 2025-11-2213:173 reply

          I have never seen a NAT implementation that forwarded every packet sent to it. As you stated in your first sentence, NAT forwards packets that match previous packets. Assuming it does that job well, that’s filtering right there.

          • By dijit 2025-11-2215:06

            its pretty common to have the NAT gateway also be a stateful firewall (you’re tracking state, after all) but they’re not the same and you can have one without the other.

            Its just uncommon in consumer or prosumer devices.

            A similar allegory is perhaps industrial washing machines vs consumer ones or that printer/scanner combos are common (even in offices) but print shops and people who actually need a lot of paper would have dedicated equipment that does either scanning or copying better.

            It’s also like a leatherman, they all have some commonality (the need to be gripped) so theres a lot of combination; but a tradie would only use one as a last resort- often preferring a proper screwdriver.

      • By eqvinox 2025-11-2216:06

        > NAT offers pretty much the same protection as a default stateful firewall

        Most NAT requires itself to include a stateful firewall; it's the same thing as the NAT flow table. This whole trope is mostly getting into people's heads to not forget about actually configuring that "free" firewall properly, since it'll just be a poor one otherwise.

    • By gldrk 2025-11-226:521 reply

      >Yes, in a very superficial sense, you can't literally route a packet over the internet backwards to a host behind NAT without matching a state entry or explicit port forwarding.

      Don’t forget source routing. That said, depending on your threat model, it’s not entirely unreasonable to just rely on your ISP’s configuration to protect you from stuff like this, specifically behind an IANA private range.

      • By sedawkgrep 2025-11-2217:49

        I don't think source routing is a thing anymore. At least if you're talking about the ability of a source to specify a path to its destination.

        The last time I heard about source routing actually being a useful feature or a vulnerability used by hackers was the 1990's.

    • By globular-toast 2025-11-229:351 reply

      Yeah, I keep meaning to write something about this. I've definitely noticed people wary of IPv6 because their machines get "real" IP addresses rather than the "safe" RFC1918 ones. Of course, having a real IP address is precisely the point of IPv6.

      It's like we've been collectively trained to think of RFC1918 as "safe" and forgotten what a firewall is. It's one of those "a little knowledge is a dangerous thing" things.

      • By sshine 2025-11-2210:532 reply

        In a world where people think NAT addresses are safe because you don’t need to know anything else about firewalls, IPv6 _is_ fundamentally less secure.

        • By throw0101a 2025-11-2214:062 reply

          > In a world where people think NAT addresses are safe because […]

          The vast, vast majority of people do not know what NAT is: ask your mom, aunt, uncle, grandma, cousin(s), etc. They simply have a 'magic box' (often from the ISP) that "connects to Internet". People connect to it (now mostly via Wifi) and they are "on the Internet".

          They do not know about IPv4 or IPv6 (or ARP, or DHCP, or SLAAC).

          As long as the magic box is statefully inspecting traffic, which is done for IPv4-NAT, and for IPv6 firewalls, it makes no practical difference which address family you are using from a security perspective.

          The rending of garments over having a globally routable IPv6 address (but not globally reachable, because of SPI) on your home is just silliness.

          If you think NAT addresses are safe because… of any reason whatsoever really… simply shows a lack of network understanding. You might as well be talking to a Flat Earther about orbital mechanics.

          • By mqus 2025-11-2214:432 reply

            > which is done for IPv4-NAT, and for IPv6 firewalls

            Are internet routers that do ipv4 NAT usually also doing an IPv6 firewall (meaning they only let incoming connections in if they are explicitly allowed by some configuration)? Maybe thats the point where the insecurity comes from. A Home NAT cannot work any other way(it fails "safely"), a firewall being absent usually means everything just gets through.

            • By throw0101a 2025-11-2215:39

              > Are internet routers that do ipv4 NAT usually also doing an IPv6 firewall (meaning they only let incoming connections in if they are explicitly allowed by some configuration)?

              Consider the counter-factual: can you list any home routers/CPEs that do not do SPI, regardless of protocol? If someone found such a thing, IMHO there would be a CVE issued quite quickly for it.

              And not just residential stuff: $WORK upgraded firewalls earlier in 2025, and in the rules table of the device(s) there is an entry at the bottom that says "Implicit deny all" (for all protocols).

              So my question to NAT/IPv6 Truthers is: what are the devices that allow IPv6 connections without SPI?

              And even if such a thing exists, a single IPv6 /64 subnet is as large as four billion (2^32) IPv4 Internets (2^32 addresses): good luck trying to find a host to hit in that space (RFC 7721).

            • By globular-toast 2025-11-2215:13

              All the ones I've had have had a firewall by default for IPv4 and IPv6, yes. If ISPs are shipping stuff without a firewall by default I'd consider that incompetence given people don't understand this stuff and shitty IoT devices exist.

              I do wonder how real the problem is, though. How are people going to discover a random IPv6 device on the internet? Even if you knew some /64 is residential it's still impractical to scan and find anything there (18 quintillion possible addresses). If you scanned an address per millisecond it would take 10^8 years, or about 1/8 the age of the earth, to scan a /64.

              Are we just not able to think in such big numbers?

          • By thayne 2025-11-2217:002 reply

            There is one practical difference. IPv6 without a NAT exposes information about different devices inside the private network. A NAT (whether ipv4 or ipv6) will obfuscate how many devices are on the network. Whether that is desirable depends on the circumstances.

            • By throw0101a 2025-11-2223:36

              > A NAT (whether ipv4 or ipv6) will obfuscate how many devices are on the network. Whether that is desirable depends on the circumstances.

              "Revisiting IoT Fingerprinting behind a NAT":

              * https://par.nsf.gov/servlets/purl/10332218

              "Study on OS Fingerprinting and NAT/Tethering based on DNS Log Analysis":

              * https://www.irtf.org/raim-2015-papers/raim-2015-paper21.pdf

              Also:

              > […] In this paper, we design an efficient and scalable system via spatial-temporal traffic fingerprinting from an ISP’s perspective in consideration of practical issues like learning- testing asymmetry. Our system can accurately identify typical IoT devices in a network, with the additional capability of identifying what devices are hidden behind NAT and the number of each type of device that share the same IP address. […]

              * https://www.thucloud.com/zhenhua/papers/TON'22%20Hidden_IoT....

              Thinking you're hiding things because you're behind a NAT is security theatre.

            • By labcomputer 2025-11-2217:46

              > IPv6 without a NAT exposes information about different devices inside the private network.

              In practice this has not been true for over 20 years.

              IPv6 devices on SLAAC networks (which is to say, almost all of them) regularly rotate their IPv6 address. The protocol also explicitly encourages (actually, requires) hosts to have more than one IPv6 address active at any given time.

              You are also making a wrong assumption that the externally visible address and port ranges chosen by the NAT device do not make the identity of internal devices easily guessable.

        • By zamadatix 2025-11-2211:49

          In both cases the only consumer security comes from "the home router defaults to being a stateful firewall". The only difference between the two is whether it also defaults to doing NAT with that state, which is not what was making IPv4 secure for people unaware either.

  • By kenrose 2025-11-223:261 reply

    We did this at OpsLevel a few years back. Went from AWS managed NAT gateway to fck-nat (Option 1 in the article).

    It’s a (small) moving part we now have to maintain. But it’s very much worth the massive cost savings in NATGateway-Bytes.

    A big part of OpsLevel is we receive all kinds of event and payload data from prod systems, so as we grew, so did our network costs. fck-nat turned that growing variable cost into an adorably small fixed one.

  • By tonymet 2025-11-222:502 reply

    In aws you can use IPv6 with either security groups or EIGW to avoid NAT fees altogether (you still pay for transfer fees )

    Death , taxes and transfer fees

    • By t0mas88 2025-11-227:401 reply

      That's quite recent. There was some time after AWS started charging for ipv4 addresses where you could not realistically go for an ipv6 only setup behind Cloudfront because it would for example not connect to a v6 only origin.

      This is probably a result of all AWS services being independent teams with their own release schedule. But it would have made sense for AWS to coordinate this better.

      • By tonymet 2025-11-2217:41

        You’re right IPv6 has compatibility issues. But for instances needing NAT gateway (no public ip) , they are often good candidates for IPv6 egress.

    • By mannyv 2025-11-227:39

      Moving to IPv6 works until it doesn't.

HackerNews