Terraform Revisited

It's now been several months since our terraform based scaling system has needed any manual intervention.

A Recap

Among other things, we use terraform to scale CDN edge capacity, as well as custom compute instances for live video transcoding.

We use multiple cloud providers with overlapping locations and preferred data centers for each location.  The preference is based on a combination of factors including transit providers at the location, long term reliability and cost.  In most cases we use Linode, Vultr or Hetzner, although we are configured to work with additional providers.

Linode is generally our top choice, primarily because our data transfer is high and they pool all allocated transfer.  Linode's host instances are on 40G links, all VMs include a fair amount of data transfer (per hour/month) and the per-VM port speeds are also generous.

We do have issues with Linode, as CPU steal is sometimes high.  This is mostly an issue with their standard plans.  In the past, I've also been disappointed with the N Cal data center, due to both power problems and transit (at the time, mostly HE).

We have seen times when steal bumps up a bit, even on their dedicated CPU instances.  I suspect these are sold by the thread, not by the core so on very rare occasion you may end up on a host that is heavily loaded.

Commands to start and stop instances can be triggered by our clients, by our backend systems or through scheduling.  All of this is implemented through our own scripts.

Troubleshooting and Improvements

When we first started offering services that included live transcoding, we occasionally got complaints of video breakup, dropped frames, etc.  It didn't take long to confirm that this was caused by steal (oversold cpu).  Buying a larger plan doesn't work, as a single low performance core breaks things.  For most applications, slight inconsistencies in performance aren't a huge problem.  In the case of video transcoding, it makes a huge difference.

When we first started doing this, Linode didn't offer 'Dedicated CPU' plans, either.

The first step was a simple warning.  I added a method in the provisioning scripts to run mpstat and send an alert to my devices through pushover if the steal rate at time of provisioning was too high (>1%).  That 1% steal has virtually no effect on a web server, but these aren't web servers.

The second step was returning an error code on the provisioning script if the steal was too high.  To prevent infinite loops (and large bills) I added a safeguard in our provisioning script that keeps a count of retries (that resets after a full success). The script also parses the output to determine which VM is failing.  In our case, there are often multiple VMs provisioning at any time.  After too many retries, it sends a high priority alert via pushover and prevents future attempts at running until a manual reset.  This made a huge difference in the amount of manual effort involved, but there were still occasions where after a few retries, it could not get a host with a clean CPU.

Failover

The final solution builds on the above, but adds a failover between plans and providers.  Our generated .tf files now add a comment line with json encoded data containing important metadata such as the provider used, datacenter, cpu/plan type and retry count.  Our wrapper parses the output from terraform apply and keeps track of which VMs failed due to CPU steal.  After terraform completes its operation, this information is used to run a failover function.  

We have an internal flow used to determine a failover plan.  Here's an example...

Initial configuration was for a standard 6 core linode instance.  On the first try, it spots CPU steal.  Regenerate the configuration, but with an 8 core dedicated (there is no 6 core standard).  There is still a failure (this is very rare), regenerate the configuration at Vultr.  There is still a failure (this hasn't happened in real world only in our simulated environment), try a different Vultr data center.