Caught Between Cost Efficiency and High Availability
How Distributed Services Across Three Zones Beat Two Zones in Cost and Reliability
Different Stakeholder, Different Priorities
One of the key tasks of IT and solutions architects is to build systems that are highly available yet at the same time as cost efficient as possible. However, the various stakeholders in IT projects tend to have a stronger focus on either one of these two areas. Product Management and Finance often focus more on the cost of running a service, whereas teams in Site Reliability Engineering or Operations are more focused on the reliability and availability of a given service.
This article aims to show why achieving cost efficiency and high availability are not mutually exclusive and why a more distributed system can actually save cost.
Availability Targets & Service Level Agreements
Availability targets for any service being built primarily depends on what an acceptable downtime to the customer is for the type of service that is offered. The overall availability target gets defined as a percentage and documented in the Service Level Agreement (SLA). Not meeting the committed service level usually entitles the customer to claim a credit towards the price they have paid for that service in a given month.
Many cloud providers classify the availability targets for their services in how many 9s they provide. For example three 9s refers to 99.9%, four 9s to 99.99%, etc. The table below shows the most typical service levels and the implications of those targets to the maximum allowable downtime of a service per month to meet that target.
The number of 9s that is offered for a service correlates with how distributed and highly available that service needs to be. Running a single VM or server usually has the lowest availability whereas globally distributed systems can accomplish four 9s and beyond.
Infrastructure can fail on multiple levels. Typical failure points for a VM can be the virtualization layer itself, the hardware the VM runs on, the subnet the VM runs in or even the entire datacenter. This is why it is a best practice to distribute services and systems across availability zones.
Availability zones are independent physical data centres that have independent power sources, network and infrastructure. This assures that outages are isolated to a single availability zone and that systems that are distributed across availability zones do not face full outages.
While a service level of 99.9% and a downtime of 43 minutes per month might be acceptable to some services and applications, business critical functions often require an availability of four or even five 9s.
Reliability of Distributed Systems
The most common service level offered for cloud providers for single-zone services is 99.9%, meaning that the service provider guarantees that the service purchased would be available 99.9% of the time.
So what would the impact to our overall availability be if we distributed our services across multiple nodes in different availability zones? If a service provider offers 99.9% for a server in a single zone, we need to distribute our service across multiple availability zones to accomplish higher availability targets.
The above table shows that if we wanted to provide more than 99.9% availability for our service, we need to be resilient to a single-zone failure. In both scenarios of a service distributed across two zones or three zones, we can accomplish more than five nines by being fault tolerant to losing one zone.
As a result, we need to ensure that the capacity planning for our service considers failure tolerance to the loss of an availability zone.
Capacity Planning and Failure Scenarios
We established that a service needs to be distributed across multiple nodes in multiple availability zones to accomplish high availability. Distributing the service assures high availability if of one of those nodes or even an entire availability zone fails.
A node in this context can be a traditional bare-metal server, a virtual machine, a Kubernetes worker node or any compute service or a set thereof.
The overall concept applies to any type of node, but it is most easily applied with containerized services that run on an orchestration engine, such as Kubernetes, Docker Swarm or Mesos.
If one node or availability zone fails, the orchestration engine redistributes the workload of the nodes in that zone automatically to the remaining, live ones. Hence, sufficient spare capacity needs to be available in our cluster of nodes to be able to accommodate the workload of a failing node or zone.
For the examples below, let us assume that a service requires 32G available memory and that 100% capacity of each node is available for the service to use. In order to survive an outage of a node or zone, the remaining capacity available for the service needs to be at least 32G for the service to remain available.
In a scenario of a service distributed across two nodes as shown above, the total capacity that needs to be provisioned in our example would be 64G of memory to still have 32G available in a failure scenario. If one of the two nodes fail, this assures that the healthy node is able to host the workload of the failing node with the remaining capacity.
As we need to leave 50% of the capacity of each node unused for high availability purposes, the resources a service requires need to be provisioned with a factor of 2 to be highly available.
If the system is distributed across three availability zones, we can load each node to 66% of its capacity and still have sufficient capacity to accommodate the redistributed workload if one node fails.
In this scenario, it would be sufficient to provision three nodes with 16G of memory each to accommodate a service that requires 32G of memory with sufficient spare capacity for high availability.
The total capacity provisioned in a three node scenario is 48G of memory, or 1.5 times the amount of the 32G that the service requires to run. If one of the three 16G nodes fail, we still have 32G (2 x 16G) available as needed by the service.
Consequently, the three node set-up requires 25% less capacity than the two node scenario which requires 64G.
Some orchestration solutions like Kubernetes also provide auto-scaling capabilities. Auto-scaling automatically increases or decreases the amount of nodes depending on the load on the system. As it is important to keep the overall load below 66% to assure sufficient capacity for failures, it is critical to scale the number of nodes up once the system reaches 60–65% of its total capacity. Nodes have to be scaled in multiples of three to keep the capacity between all zones even.
The general assumption that distributing a service across more availability zones and nodes implies higher cost is in most scenarios not warranted. Distributing a service across three availability zones beats a service distributed across two zones in cost by around 25% while providing similar availability.
Explaining these concepts of high availability, distributed systems and fault tolerance to stakeholders can be critical in achieving alignment on a financial and operational plan for a system or service being built.
Thanks to Stephen Kawaguchi for reviewing and providing valuable feedback!