Cloud Computing Management

I previously wrote on our use of terraform to manage cloud/vps instances.  Here are more updates and notes on what I've ran into.

Consistency and Availability

There is no cloud, it's just someone else's computer

One of the things I've spent quite a bit of time dealing with is keeping consistent performance.  How important this is depends on your usage scenario, but in our case (streaming media) it's vital.  Be aware that most of these problems are not the rule, but rather the exception.  Other than hyper-budget providers, most VPS/cloud providers manage resources well.... but when they do happen, they could ruin your day.

We take a hybrid approach to deployments – a combination of colo, dedicated servers, vps/cloud computing and SaaS using a variety of providers.  Our colo systems are mirrored to multiple providers, and we try to keep anything on a VPS/CCI set up for automated deployment... but not all instances are created equal – even when on the same plan with the same provider.

CPU Steal

CPU steal can be caused by a number of factors... but it's effectively a noisy neighbor problem (Linode) or throttling (AWS t2/lightsail).  Since multiple clients are using the same computer/host node, your processes may slow down significantly due to what other clients are doing.  In my experience, both networking and encoding performance are significantly effected with a CPU steal value higher than 2%.  Our provisioning scripts check the kernel steal value and exit / report a failed installation with a value higher than 1%.

Availability

On rare occasion, cloud provisioning systems become unavailable.  This is why we aren't dependent on a single provider.

We can easily switch a task to a different provider or data center in about 3 minutes.

We have also seen times where larger VMs are temporarily unavailable at a particular data center (mostly happens with Vultr).  For this case, we run a preflight check before generating provisioning commands for terraform confirming availability.  If the provisioning still fails, we switch the VM to a different data center.

Some provider details...

This covers only some of the cloud providers we use.  Later, I may cover my experiences with dedicated and colo providers.

Linode

Performance here is usually good, but we have to monitor things closely. Linode hosts nodes are all connected at 40gbps, and plans offer generous bandwidth caps, which is fantastic for our purposes, and we seldom see any real network issues.  Linode also pools transfer quotas across instances, which is also great for our relatively high volume, high burst usage scenario.  We have, however seen problems with CPU congestion/overselling, though never ran into throttling.  Here are some of the processors we have encountered with Linode.  Note that I count threads here rather than cores, as all providers I have encountered have HT enabled and assign CPU based on threads rather than running without HT and locking CPU affinity to specific cores.

CPU Clock Threads Max per server
E5-2680v2 2.8 20 2
E5-2680v3 2.5 24 2
E5-2697v4 2.3 36 2
Xeon Gold 6148 2.4 40 4
AMD Epyc 7451 2.3 48 2
AMD Epyc 7501 2.0 64 2

Vultr

CPU performance has been more consistent with vultr.  I believe each host is connected at 10gbps.  To the best of my knowledge, vultr does not publish available port speeds/caps.... although 1gbps is pretty safe.  Since VPS nodes support AVX512 and I haven't had a problem with CPU steal, so far they have worked well for transcoder instances.

Vultr also offers 'bare metal' instances which are great all around at the current price point.  They connect at 10gbps and are dedicated, so no concerns about noisy neighbors.  Bare metal instances take a little longer to provision... typically around 5 minutes (while VPS provisioning at most providers is under 2 minutes).

Type CPU Clock Threads Max per server
VPS Xeon Gold 5120 2.6 28 4
Bare Metal Xeon E3-1270v6 3.8 8 1

Amazon

We mostly use amazon for AI services.  We seldom use their cloud computing products.  The price points for EC2 are not competitive with other options, especially for high bandwidth consumption.  Lightsail is priced appropriately but has poor performance (speculated that these are T2 burstable EC2 under the hood).  Both CPU and network are heavily throttled after about 20-30 minutes.  This can be detected both with benchmark software and kernel CPU steal information.

Digital Ocean

We seldom use Digital Ocean, mostly due to network performance.  The listed port speed is 1gbps, but performance is inconsistent, likely due to other clients using the same host name.  We keep them configured as another backup provider for transcoding.

Hetzner Cloud

Hetzner's offering is very nice and the price is fantastic, although data centers are only in Germany and Finland.  We mostly use Hetzner for overflow for on demand video encoding.  I believe each host is connected at 10gbps.  CPU is Xeon Scalable (Skylake-SP) @ 2.1ghz.  I am unsure on the specific model.  Performance is great for encoding, since Skylake supports AVX-512.