Immutable Infrastructure as Code

Developers and operators came together in the DevOps revolution. Now a new boundary - Docker Containers - delineates these disciplines in a way that allows each to concentrate on their own skills again. Containers enable immutable infrastructure and drive modern continuous delivery…

Inside the container is the developer’s responsibility and outside the container is the operator’s job. This is where we started before the developers and operators united to form the short-lived DevOps revolution. First there was tin: physical servers, routers and other network hardware. We would configure these things manually and stand them up in a datacenter with lots of wiring, cooling and teams of people to keep all the plates spinning.

Then we created virtual representations of these physical things. And started to write software tools to help us manage them. Manually. We’d still jump into these virtual machines using ssh or remote desktop and set them up or adjust them ourselves.

Then we automated some of this with tools like Puppet, Chef and Ansible. Trying to eliminate human error and provide some consistency. DevOps was born. The tools just allowed us to modify the infrastructure components in a more predictable way. This is mutation and is very hard to do safely - there are too many edge cases and what-if scenarios making the tools very complicated and often fragile.

Then along came Linux Containers and Docker. All of a sudden we had a way to embrace immutability for our applications. Instead of using tools to mutate our application’s environments we could replace them completely with new environments when something needed to change. DevOps died.

Now the operators don’t need to know or care about what’s inside the container. It’s just a container. It has a standard size and shape.

Containers are as revolutionary as their real-world counterparts in the shipping industry. Linux Containers are revolutionary because they taught us that instead of mutating a server configuration it’s safer to treat servers as immutable and to replace them with brand new instances when something needs to change. When anything needs to change.

A page needs a new button - create new containers from a new image and replace the running ones. Or you want your application to run against a new version of node.js - modify the dockerfile, create a new image and spin up new containers. This is “immutable infrastructure as code”.

But we can take this a lot further.

You can divide your infrastructure into components that are either stateful or stateless. In modern scalable web application architectures, state should be only in 2 places - the very front (the browser) and the very back (the database). Everything in between should be stateless so that it can scale horizontally. And all stateless components should be immutable. You should never change them directly. Indeed the only thing we should be allowed to do is to create new ones to replace the old ones. New containers to replace old containers. New servers to replace old servers. New load balancers to replace old load balancers.

This means that if the component is a container, or a VM, you can remove ssh access! And create security rules that forbid traffic on port 22! What? That sounds pretty draconian. What if there’s a problem that I need to jump in and diagnose?

This is where runtime logging and monitoring comes in. Instrument everything. Install APM everywhere (New Relic is awesome!). Gather logs and store them for analysis, either now or in the future. When you find a problem, make a change to the source and redeploy.

So what is this source? Well, you can store all the application source code, and all the container configuration, and all the infrastructure scripts in a single Github repository (a mono-repo). That repo then contains everything that describes your application, including the environment it runs in and the infrastructure it runs on. And you can make transactional (atomic) commits across all 3 areas (app source, env config, infrastructure) easily keeping them in sync. You get an automatic audit trail of every change that has ever been made to any part of the application and infrastructure.

If you use a tool like Terraform to provision the infrastructure, you not only have code you can run to idempotently update your infrastructure, but you can use it in an immutable way (i.e. replacing stateless components). Importantly, you can also recreate any previous version from scratch at a moment’s notice (although it’s often better to “fix forward” rather than “roll back” as it’s now so easy and fast to do).

Terraform is great - you just describe what your infrastructure should look like in code. You can adjust this declaration and then ask Terraform to show you a plan of what it’s going to do. If you’re happy with this you can ask it to apply the change. You can store the resultant state file somewhere like S3 so that you have a record of the change and a starting point for further changes.

Hang on a second - you said “changes”. I thought we weren’t allowed to “change” anything. Everything has to be re-created, doesn’t it? Well, yes and no. Terraform makes a decision about what it can adjust and what it needs to re-create. Servers, for example, would be recreated. I actually think that the default way of using Terraform can be improved upon…

What if every change to the infrastructure requires a new instance of the whole infrastructure to be created? At Red Badger, we have a set of Terraform scripts that can create a VPC in AWS with 2 public and 4 private subnets across 2 availability zones, an Internet gateway, NAT servers, route tables, security groups, multiple load balanced ECS clusters of autoscaled EC2 instances, with private DNS for service discovery. And it can stand all that infrastructure up in less than 3 minutes. And it can destroy it again in less than 2 minutes. So we can afford to create a whole new infrastructure every time we need to make a “change”. (When I started my software career in 1988 it took 3 minutes to spin up a 40MB CDC hard drive that was the size of a washing machine - now I can spin up a whole datacentre in the same time).

So now the whole infrastructure is immutable. You can test new versions thoroughly before gradually moving users over, 1 percent at a time.

Because it’s so cheap to stand up completely new infrastructure, we can not only afford to create permanent environments, but we can afford to create ephemeral environments. Environments created on demand for specific tasks. Like load and performance testing, and penetration testing (because you know that these versioned ephemeral environments are going to be absolutely identical to each other). We can create environments, for testing features, that are created automatically when a pull request is opened and destroyed automatically when the pull request is closed.

And we can evolve all these environments, in an immutable way, by creating new ones. All the time.

No configuration drift, no outdated operating systems or environmental software, no server rot, no surprises, no human error. Instead, you can trickle changes into the infrastructure using continuous delivery. Exactly the same as for your application itself. It’s the same thing. It’s just code, being deployed continuously from a single repo into the big wide world.

Artwork by Nathalie Goepel