I recently created some guidance for my customer around isolation and resiliency and figured I would share if helpful for others. Warning, this gets kind of complex 🙂
Lets get to it!
Isolation is a key component in deploying resilient services. Understanding the various isolation options in Azure is critical to deploying services that are resilient to outages of various scale.
Figure 1 – Azure isolation constructs
Azure services are provided across a number of regions which are listed at https://azure.microsoft.com/en-us/global-infrastructure/regions/. A region is a set of physical locations that exist within a 2ms latency envelope. Many services are deployed to and exist within a specific region including core resources such as virtual networks. Regions are also paired to enable certain services to have resiliency from a regional outage by replicating data between regions, for example geo-redundant storage (GRS). Additionally, customers can choose to deploy multi-region deployments using these pairings which ensure similar Azure services are available in both regions and any fabric updates are not made simultaneously for regional pairs. These pairings are documented at https://docs.microsoft.com/en-us/azure/best-practices-availability-paired-regions. For the greatest resiliency architectures should include at least two regions which may be used in a disaster recovery pattern with an active and passive deployment across regions, or in an active-active pattern with traffic distributed between regional deployments. The pattern chosen will depend on the workload and data platform used (many databases cannot support active-active across locations without a significant performance penalty).
Availability Zones (AZs) are available in many regions. An AZ is a physical location in a region that has independent power, cooling and networking from other AZs in the region. A failure in one AZ will not impact other AZs in the region which means having a service deployed across multiple AZs provides protection from an AZ failure.
Each Azure subscription has three logical AZs exposed for resource placement. Note that AZs are not consistent between subscriptions, for example AZ1 in one subscription is not the same as AZ1 in a different subscription. For services that require the greatest resiliency within a region, architectures should deploy instances across AZs. This ensure resources within an AZ are isolated from any issue in a different AZ. As an example, VMs that use AZs have an SLA of 99.99%, the highest available for VMs in Azure. Different services utilize AZs differently. Some are zonal which means they are deployed to a single, specific AZ that you specify. Others are zone-redundant which means the service automatically spans across multiple AZs providing resiliency from any single AZ failure. When architecting a solution it is important to:
- Identify all the Azure components/services that will be used. For example, an internal standard load balancer, VMSS, Azure SQL Database and NAT Gateway.
- Identify the AZ supported options for each service. In order of resiliency these are zone-redundant (deploys in a resilient manned across AZs), zonal (deploys in a specific AZ) or regional (no AZ interaction).
- You will need to architect to the lowest common denominator. For example, all the aforementioned services are zone-redundant except for NAT Gateway which is zonal. Therefore, the architecture will need to have a foundation of zonal for services that interact directly with subnets to keep the zonal promise of the zonal services. For example, NAT Gateway is zonal and is configured at a subnet level. Therefore, the architecture will require that resources deployed to subnets are AZ aligned, i.e. an AZ aligned subnet per AZ.
- Resources that are NOT directly linked to a zonal resource that have zone-redundant capabilities can still be leveraged. For example, NAT Gateway may be used which will require a deployment per AZ which in turn requires a separate VMSS deployed in each AZ in its own subnet (since VMSS deployments deploy to a single subnet). The Standard Load Balancer however can be deployed in a zone redundant manner which can then have all 3 VMSS instances as part of a single backend set.
Availability Sets (AS) exist within a single physical facility. When deploying workloads to an Availability Set the workload is automatically distributed among three Fault Domains. A Fault Domain can be thought of as a rack within the datacenter which has its own network switch, power supply unit etc. By deploying workloads to an AS it is resilient to any single rack-level failure such as a PSU or switch failure (providing you have two or more instances deployed to the AS). Likewise, since hosts live in a particular rack (fault domain) and so by using availability sets you are also ensuring workloads are on multiple nodes protecting from any single node or VM failure. Additional storage resiliency can be achieved with availability sets by combining with managed disks and using an aligned mode. Here each fault domain will use a different storage cluster from other fault domains in the availability set helping to also protect from any single storage cluster failure. VMs that deploy to an AS have an SLA of 99.95%. Availability sets also have an update domain property. This controls how workloads are distributed further and impact the percentage that are impacted during an update of the application (if PaaS) or the fabric itself (IaaS and PaaS). An update domain count of 5 means the workloads are spread over up to 5 update domains meaning for any update only 20% (1/5) would be impacted at a time. Figure 2 shows this. Note that when using Availability Zones each AZ acts as a fault and update domain and will ensure updates across zones do not happen at the same time.
Figure 2 – Fault and update domains in an Availability Set
Note you cannot ordinarily use AZ and AS together however it is possible to pin an AS to a specific AZ by utilizing a proximity placement group (PPG) which is used to ensure proximity between services. Because of the increased SLA and zero cost difference, AZ is preferred over AS if it can be used by the target resource.
In summary within a region the use of AZ provides the greatest resiliency from various types of failure and should be used across services. If AZs are not usable then AS should be utilized. Avoid the use of “regional” deployments if using AZs as you have no control where the actual deployment will land and what failures may impact it.
In addition to the use of AZ or AS within a region, deployments to multiple regions should also be architected for the highest level of resiliency in an active-passive or active-active configuration. Solutions like Azure Traffic Manager (DNS-based) and Azure Front Door (HTTP-based) can be used to balance external traffic between regions if required.
Below are some additional considerations and capabilities for various fabric layers.
Virtual networks are deployed to a region and are available across the entire region, i.e. they span AZs. Virtual networks are broken into virtual subnets that are also regional and are available across AZs. There is no concept of deploying a subnet to a single AZ. If a subnet needs to be aligned to a particular AZ this would have to achieved by logically allocating subnets to AZs and then ensuring resources placed in the subnet are deployed to the corresponding AZ. Any communication to the vnet, for example connections to on-premises via ExpressRoute, would be available to the entire vnet regardless of the AZ. Zone-redundant gateway options would be leveraged to ensure the vnet connectivity could tolerate any single zone failure.
Certain network resources support AZs, primarily the standard SKUs for example the Standard Load Balancer, Standard Public IP, Standard IP prefix and the App Gateway v2. These provide the ability to be zone-resilient and also some support zonal deployment. ExpressRoute Gateway can also deploy in a zone-redundant or zonal mode. When using a zone-redundant service it is automatically made resilient across zones by the Azure fabric and no manual steps are required once the zone-redundant option is configured. For example, a standard SKU public IP being using as the front end with a Standard Load Balancer will span zones and be resilient across any single zone failure. Services like NAT Gateway can be regionally deployed where no zonal promise is made and the deployment can be to any datacenter in the region or zonal but does not support zone-redundant deployments. When using combinations of solutions, it is important to architect accordingly. When you use a zone redundant component a single instance of the resource is deployed. When you use a zonal component and want the service in each AZ you must deploy an instance into each AZ, i.e. to cover 3 AZs you would deploy 3 instances of a zonal resource.
Figure 3 shows an example possible deployment combining a single zone redundant SLB front-end with zonal NAT Gateway deployments. Note here logically the subnets are mapped to AZs and implemented by using zonal VMSS deployments with each VMSS deploying to the logically mapped subnet for the AZ. Each AZs NAT Gateway is then connected to its corresponding subnet. Note zonal as opposed to zone redundant VMSS deployment is used to target specific subnets to enable the mapping of the NAT gateway. This also means three separate VMSS deployments are used, one for each AZ instead of 1 AZ spanning VMSS instance. In this example however all 3 VMSS instances are part of the same SLB back-end set and are all distributed from a single front-end IP. This model would apply to any other type of compute service.
Figure 3 – Example network solution using combinations of zonal and zone-redundant solutions.
If the additional complexity of requiring multiple zonal compute deployments to enable the use of NAT Gateway is not desirable the standard internal load balancer can have a public IP added to enable outbound internet connectivity for the backend set members and outbound rules used to control NAT behavior.
Given the additional complexity NAT Gateway introduces because of its zonal deployment it is important to understand why you would use this instead of just adding a public IP to the internal SLB along with outbound rules to control the SNAT (note that WITHOUT a public IP or NAT Gateway at the subnet, any machine behind an internal SLB has no Internet access). Key benefits of NAT Gateway over public IP on an internal SLB are described at https://docs.microsoft.com/en-us/azure/virtual-network/nat-gateway-resource#source-network-address-translation however some of the key points are:
- LB SNAT requires ahead of time knowledge, planning, and tweaking of worst-case scenario for any of the VM’s whereas NAT Gateway allocates on-demand as needed.
- Dynamic workloads or workloads which diverge from each other in intensity are difficult to accommodate with LB SNAT.
- You must explicitly join every VM NIC to the backend pool of the load balancer to use the SLB SNAT.
- Some customers also object to using something that also provides inbound functionality for outbound functionality.
- NAT is designed to be a much simpler outbound solution for entire subnet(s) of a virtual network.
Azure Storage accounts have different redundancy options. The base level of redundancy is Locally Redundant Storage (LRS) which has 3-copies of the data within a physical location. Zone Redundant Storage (ZRS) can be used to have the 3-copies of the data distributed across three AZs in the region. Geo-redundant/Geo-zone-redundant (GRS/GZRS) adds an additional 3 copies of the data at the paired region.
If utilizing GRS/GZRS for storage resiliency to another region it is important other service replication and failover is to the same paired region to ensure in the event of failover the services are in the same region with a functional latency between them, i.e. they are within a single region.
Database services have different options, for example Azure SQL Database premium and business critical are deployed in a zone-redundant configuration. Cosmos DB has an optional zone redundancy option to provide zone-redundant deployments. These configurations are transparent to the application using the service and are accessed through a single endpoint/DNS name which is prescribed by the data service. Additionally, Cosmos DB has different consistency models enabling multi-master configurations which would allow writable replicas in multiple regions.
For basic VMs it is possible to deploy to a region (99.9% SLA when using premium SSD or Ultra Disk), an availability set (99.95% SLA with 2+ instances) or availability zone (99.99% SLA with 2+ instances). When deploying to AZs it is important VMs are distributed over multiple AZs.
Virtual Machine Scale Sets support both zonal and cross-zone deployments. When deploying across zones different balance options are available which includes a best effort zone balance or strict zone balance. Best effort will attempt to keep balance across zones but will allow temporary imbalance. Strict will not allow scale actions if the balance would be broken. For most scenarios best-effort suffices.
AKS node pools can be deployed across zones at time of creation. The default node pool deployment controls if the AKS control plane components are deployed across zones. Note that while cross-zone node pools can be configured it is not recommended for large deployments that are stateful as race conditions can occur where the compute tries to start in AZ1 while storage may be in AZ3 because of scaling limits. The best practice is to create a node pool per zone for stateful workloads. Assuming networking for AKS is provided using Azure CNI (which enables integration with existing vnets) separate node pools per zone also allows a different subnet to be configured for each node pool which can then be AZ aligned. Deploy the service into each node pool which ensures the compute and storage scales together. Note for stateless services cross-zone node pools are fine unless you wish to use NAT gateway in which case once again you will need a node pool per zone to enable separate subnets to be configured to keep the NAT gateway zonal promise. When using NAT gateway the NAT gateway is aligned to the subnet and not to AKS node pools directly. If Nat gateway 1 is zonally deployed to AZ1 and linked to subnet 1 and subnet 1 is used by node pool 1 then node pool 1 would be a zonal deployment to AZ1 thus ensuring alignment. Nat gateway 2 and node pool 2 would be AZ2 with subnet 2 and so on.
App Service Environments can be deployed as zonal. Behind the scenes ASE uses zone redundant storage (ZRS) for the remote web application file storage. At time of writing App Service Plans do not support AZs. The best option would be to deploy to two near regions, e.g. East US and East US2 then balance across them using Azure Front Door (or Azure Traffic Manager).
A full list can be found at https://docs.microsoft.com/en-us/azure/availability-zones/az-overview.