Terraform

Terraform

Terraform is a "infrastructure as code" tool written in go.  I use it to help automate spinning up extra capacity for streaming during peak times, as well as running live video transcoders (taking a single high quality video input and generating multi-rate output, as well as simulcasting to facebook and YouTube).  I have also implemented terraform code for adding additional VOD transcoding.

Behind the scenes, our automation is handled using a combination of terraform code, PHP scripts and cron (on the controller) and shell scripts (on the instances).  We also use pushover to alert activity when VMs are created and to set high priority confirmation required alerts when there is a failure.... more on that later.  We do not use docker or anything similar as it tends to greatly increase time to provision.  On Linode, our typical provision time is about 90 seconds.  On vultr and hetzner, the provisioning time is usually under 150 seconds.  This is total provisioning time, including the external provider/API actions (create VM, install CentOS 7, boot, update DNS) and our actions (install OS updates and any extra repositories, install our custom packages and code, start services).  We don't use docker or any similar container systems, as it tends to massively increase provisioning time without any tangible benefit.

Problems

Not all terraform "providers" are the same.  Even though we are mostly using base level functionality that is not specific to a specific service.  This requires quite a few differences in our initialization scripts between providers.  

Differences in terraform providers

The linode provider we are using makes the IP address available as ip_address, whereas vultr and hetzner use ipv4_address.  This is likely due to older code in the linode provider.  It's a minor inconvenience, but it breaks scripts if not accounted for.

Being that these are different companies with different service offerings, we implemented our own methods for handling data center regions and plans.  

For our purposes, we key on airport codes in the rough area of the data centers and when multiple providers match a location, set priority based on our internal ranking of network quality of a provider at that location.  When we get failover working properly, we will probably want to add either a geographic lookup by distance or a custom algorithm based on expected ping times between datacenters.

We key on number of CPU cores for standard plan types.  This isn't precise, but close enough for our purposes.  For video encoding, vultr tends to perform quite a bit better due to x264 taking advantage of specific features in their newer skylake processors.

No one has written a terraform provider for constellix yet.  For speed of implementation, we use a simple custom written PHP CLI script to change DNS.  Ideally we would write this ourselves and release open source, but developer time is very limited.

Unsolved problems

Failover

We often spin up multiple VMs simultaneously at different providers and data centers.  Oddly, terraform doesn't provide a good way of determining which instance(s) have failed to start.  When terraform apply is complete, we check its return code to look for errors and send a pushover notification of the last 3 lines of the log.  This will give us the count of the number of instances changed broken down by added, removed or failed... but not a clear cut way of telling which ones have failed or why.  Although failure is uncommon, our most common cause of failure is lack of availability of large instances in the vultr LAX region.  For now, we manually respond to these failures, changing the region ID.  Ideally, we would catch these errors and switch providers as well as datacenters in the event of either capacity issues or an API failure at a particular provider.

Overlap with timed shutdown

Our transcoders often contain data we want to keep.  When we start a VM (depending on the purpose, initiated by the client, our automation systems or a schedule set by the client), we set some internal metadata with an expiration.  For live transcoders, we record the incoming stream to the local disk of the VM.  When broadcasting stops, this is uploaded back to the client's account.  If the client either forgets to stop the broadcast or the upload takes to long / fails, we need a mechanism in place to keep the VM alive, to stop the broadcast and start uploading, to complete the upload, or for manual intervention in the case of an error not fully handled by automation.  Our VMs are named for the client.  I may be missing something here, but renaming the VM results in tainting (destroys / recreates).  Timing can be an issue, so we can't simply leave the VM (with services shut down), as it could conflict with the next service start time.