Isolation and Resiliency Guidance in Azure

I recently created some guidance for my customer around isolation and resiliency and figured I would share if helpful for others. Warning, this gets kind of complex 🙂

Lets get to it!


Isolation is a key component in deploying resilient services. Understanding the various isolation options in Azure is critical to deploying services that are resilient to outages of various scale.

Figure 1 – Azure isolation constructs

Azure services are provided across a number of regions which are listed at A region is a set of physical locations that exist within a 2ms latency envelope. Many services are deployed to and exist within a specific region including core resources such as virtual networks. Regions are also paired to enable certain services to have resiliency from a regional outage by replicating data between regions, for example geo-redundant storage (GRS). Additionally, customers can choose to deploy multi-region deployments using these pairings which ensure similar Azure services are available in both regions and any fabric updates are not made simultaneously for regional pairs. These pairings are documented at For the greatest resiliency architectures should include at least two regions which may be used in a disaster recovery pattern with an active and passive deployment across regions, or in an active-active pattern with traffic distributed between regional deployments. The pattern chosen will depend on the workload and data platform used (many databases cannot support active-active across locations without a significant performance penalty).

Availability Zones (AZs) are available in many regions. An AZ is a physical location in a region that has independent power, cooling and networking from other AZs in the region. A failure in one AZ will not impact other AZs in the region which means having a service deployed across multiple AZs provides protection from an AZ failure.

Each Azure subscription has three logical AZs exposed for resource placement. Note that AZs are not consistent between subscriptions, for example AZ1 in one subscription is not the same as AZ1 in a different subscription. For services that require the greatest resiliency within a region, architectures should deploy instances across AZs. This ensure resources within an AZ are isolated from any issue in a different AZ. As an example, VMs that use AZs have an SLA of 99.99%, the highest available for VMs in Azure. Different services utilize AZs differently. Some are zonal which means they are deployed to a single, specific AZ that you specify. Others are zone-redundant which means the service automatically spans across multiple AZs providing resiliency from any single AZ failure. When architecting a solution it is important to:

  1. Identify all the Azure components/services that will be used. For example, an internal standard load balancer, VMSS, Azure SQL Database and NAT Gateway.
  2. Identify the AZ supported options for each service. In order of resiliency these are zone-redundant (deploys in a resilient manned across AZs), zonal (deploys in a specific AZ) or regional (no AZ interaction).
  3. You will need to architect to the lowest common denominator. For example, all the aforementioned services are zone-redundant except for NAT Gateway which is zonal. Therefore, the architecture will need to have a foundation of zonal for services that interact directly with subnets to keep the zonal promise of the zonal services. For example, NAT Gateway is zonal and is configured at a subnet level. Therefore, the architecture will require that resources deployed to subnets are AZ aligned, i.e. an AZ aligned subnet per AZ.
  4. Resources that are NOT directly linked to a zonal resource that have zone-redundant capabilities can still be leveraged. For example, NAT Gateway may be used which will require a deployment per AZ which in turn requires a separate VMSS deployed in each AZ in its own subnet (since VMSS deployments deploy to a single subnet). The Standard Load Balancer however can be deployed in a zone redundant manner which can then have all 3 VMSS instances as part of a single backend set.

Availability Sets (AS) exist within a single physical facility. When deploying workloads to an Availability Set the workload is automatically distributed among three Fault Domains. A Fault Domain can be thought of as a rack within the datacenter which has its own network switch, power supply unit etc. By deploying workloads to an AS it is resilient to any single rack-level failure such as a PSU or switch failure (providing you have two or more instances deployed to the AS). Likewise, since hosts live in a particular rack (fault domain) and so by using availability sets you are also ensuring workloads are on multiple nodes protecting from any single node or VM failure. Additional storage resiliency can be achieved with availability sets by combining with managed disks and using an aligned mode. Here each fault domain will use a different storage cluster from other fault domains in the availability set helping to also protect from any single storage cluster failure. VMs that deploy to an AS have an SLA of 99.95%. Availability sets also have an update domain property. This controls how workloads are distributed further and impact the percentage that are impacted during an update of the application (if PaaS) or the fabric itself (IaaS and PaaS). An update domain count of 5 means the workloads are spread over up to 5 update domains meaning for any update only 20% (1/5) would be impacted at a time. Figure 2 shows this. Note that when using Availability Zones each AZ acts as a fault and update domain and will ensure updates across zones do not happen at the same time.

Figure 2 – Fault and update domains in an Availability Set

Note you cannot ordinarily use AZ and AS together however it is possible to pin an AS to a specific AZ by utilizing a proximity placement group (PPG) which is used to ensure proximity between services. Because of the increased SLA and zero cost difference, AZ is preferred over AS if it can be used by the target resource.

In summary within a region the use of AZ provides the greatest resiliency from various types of failure and should be used across services. If AZs are not usable then AS should be utilized. Avoid the use of “regional” deployments if using AZs as you have no control where the actual deployment will land and what failures may impact it.

In addition to the use of AZ or AS within a region, deployments to multiple regions should also be architected for the highest level of resiliency in an active-passive or active-active configuration. Solutions like Azure Traffic Manager (DNS-based) and Azure Front Door (HTTP-based) can be used to balance external traffic between regions if required.

Below are some additional considerations and capabilities for various fabric layers.

Network Considerations

Virtual networks are deployed to a region and are available across the entire region, i.e. they span AZs. Virtual networks are broken into virtual subnets that are also regional and are available across AZs. There is no concept of deploying a subnet to a single AZ. If a subnet needs to be aligned to a particular AZ this would have to achieved by logically allocating subnets to AZs and then ensuring resources placed in the subnet are deployed to the corresponding AZ. Any communication to the vnet, for example connections to on-premises via ExpressRoute, would be available to the entire vnet regardless of the AZ. Zone-redundant gateway options would be leveraged to ensure the vnet connectivity could tolerate any single zone failure.

Certain network resources support AZs, primarily the standard SKUs for example the Standard Load Balancer, Standard Public IP, Standard IP prefix and the App Gateway v2. These provide the ability to be zone-resilient and also some support zonal deployment. ExpressRoute Gateway can also deploy in a zone-redundant or zonal mode. When using a zone-redundant service it is automatically made resilient across zones by the Azure fabric and no manual steps are required once the zone-redundant option is configured. For example, a standard SKU public IP being using as the front end with a Standard Load Balancer will span zones and be resilient across any single zone failure. Services like NAT Gateway can be regionally deployed where no zonal promise is made and the deployment can be to any datacenter in the region or zonal but does not support zone-redundant deployments. When using combinations of solutions, it is important to architect accordingly. When you use a zone redundant component a single instance of the resource is deployed. When you use a zonal component and want the service in each AZ you must deploy an instance into each AZ, i.e. to cover 3 AZs you would deploy 3 instances of a zonal resource.

Figure 3 shows an example possible deployment combining a single zone redundant SLB front-end with zonal NAT Gateway deployments. Note here logically the subnets are mapped to AZs and implemented by using zonal VMSS deployments with each VMSS deploying to the logically mapped subnet for the AZ. Each AZs NAT Gateway is then connected to its corresponding subnet. Note zonal as opposed to zone redundant VMSS deployment is used to target specific subnets to enable the mapping of the NAT gateway. This also means three separate VMSS deployments are used, one for each AZ instead of 1 AZ spanning VMSS instance. In this example however all 3 VMSS instances are part of the same SLB back-end set and are all distributed from a single front-end IP. This model would apply to any other type of compute service.

Figure 3 – Example network solution using combinations of zonal and zone-redundant solutions.

If the additional complexity of requiring multiple zonal compute deployments to enable the use of NAT Gateway is not desirable the standard internal load balancer can have a public IP added to enable outbound internet connectivity for the backend set members and outbound rules used to control NAT behavior.

Given the additional complexity NAT Gateway introduces because of its zonal deployment it is important to understand why you would use this instead of just adding a public IP to the internal SLB along with outbound rules to control the SNAT (note that WITHOUT a public IP or NAT Gateway at the subnet, any machine behind an internal SLB has no Internet access). Key benefits of NAT Gateway over public IP on an internal SLB are described at however some of the key points are:

  1. LB SNAT requires ahead of time knowledge, planning, and tweaking of worst-case scenario for any of the VM’s whereas NAT Gateway allocates on-demand as needed.
  2. Dynamic workloads or workloads which diverge from each other in intensity are difficult to accommodate with LB SNAT.
  3. You must explicitly join every VM NIC to the backend pool of the load balancer to use the SLB SNAT.
  4. Some customers also object to using something that also provides inbound functionality for outbound functionality.
  5. NAT is designed to be a much simpler outbound solution for entire subnet(s) of a virtual network.

Storage Overview

Azure Storage accounts have different redundancy options. The base level of redundancy is Locally Redundant Storage (LRS) which has 3-copies of the data within a physical location. Zone Redundant Storage (ZRS) can be used to have the 3-copies of the data distributed across three AZs in the region. Geo-redundant/Geo-zone-redundant (GRS/GZRS) adds an additional 3 copies of the data at the paired region.

If utilizing GRS/GZRS for storage resiliency to another region it is important other service replication and failover is to the same paired region to ensure in the event of failover the services are in the same region with a functional latency between them, i.e. they are within a single region.

Database services have different options, for example Azure SQL Database premium and business critical are deployed in a zone-redundant configuration. Cosmos DB has an optional zone redundancy option to provide zone-redundant deployments. These configurations are transparent to the application using the service and are accessed through a single endpoint/DNS name which is prescribed by the data service. Additionally, Cosmos DB has different consistency models enabling multi-master configurations which would allow writable replicas in multiple regions.

Compute Overview

For basic VMs it is possible to deploy to a region (99.9% SLA when using premium SSD or Ultra Disk), an availability set (99.95% SLA with 2+ instances) or availability zone (99.99% SLA with 2+ instances). When deploying to AZs it is important VMs are distributed over multiple AZs.

Virtual Machine Scale Sets support both zonal and cross-zone deployments. When deploying across zones different balance options are available which includes a best effort zone balance or strict zone balance. Best effort will attempt to keep balance across zones but will allow temporary imbalance. Strict will not allow scale actions if the balance would be broken. For most scenarios best-effort suffices.

AKS node pools can be deployed across zones at time of creation. The default node pool deployment controls if the AKS control plane components are deployed across zones. Note that while cross-zone node pools can be configured it is not recommended for large deployments that are stateful as race conditions can occur where the compute tries to start in AZ1 while storage may be in AZ3 because of scaling limits. The best practice is to create a node pool per zone for stateful workloads. Assuming networking for AKS is provided using Azure CNI (which enables integration with existing vnets) separate node pools per zone also allows a different subnet to be configured for each node pool which can then be AZ aligned. Deploy the service into each node pool which ensures the compute and storage scales together. Note for stateless services cross-zone node pools are fine unless you wish to use NAT gateway in which case once again you will need a node pool per zone to enable separate subnets to be configured to keep the NAT gateway zonal promise. When using NAT gateway the NAT gateway is aligned to the subnet and not to AKS node pools directly. If Nat gateway 1 is zonally deployed to AZ1 and linked to subnet 1 and subnet 1 is used by node pool 1 then node pool 1 would be a zonal deployment to AZ1 thus ensuring alignment. Nat gateway 2 and node pool 2 would be AZ2 with subnet 2 and so on.

App Service Environments can be deployed as zonal. Behind the scenes ASE uses zone redundant storage (ZRS) for the remote web application file storage. At time of writing App Service Plans do not support AZs. The best option would be to deploy to two near regions, e.g. East US and East US2 then balance across them using Azure Front Door (or Azure Traffic Manager).

A full list can be found at

Azure Infrastructure Update – March 2020

Just recorded the latest Azure Infrastructure update and below are the key updates. Note a weekly update is available via subscribe action on the right hand side of this side (hint click Subscribe) or at

Lets get to it.

Azure AD B2B – Unmanaged/viral tenants will not be created after 3/31/2021. Make sure you turn on one-time-passcode (OTP).

Azure Security Center

  • Now integrated with Windows Admin Center
    • Onboard OS instances to ASC via the WAC extension
    • Security alert and recommendations surfaced
  • Identity and Access recommendations in free tier
  • Azure Container Registry scanning
  • Azure Kubernetes Service (AKS) protection

Azure Cloud Shell – Now has additional regions (secondary) for the storage of the shell persistent data. The compute will still be in one of the primary regions but now the data-at-rest can be in a region that may help you meet certain compliance requirements.

Azure Networking

  • NAT Gateway GA (
  • Azure Storage and Azure SQL Database Private Link GA
  • Azure Data Explorer cluster deployment into custom virtual network now possible providing integration with NSG and other connectivity to the vnet.

Azure Front Door

  • Wildcard hosts/domains
  • Configurable idle timeout
  • Configurable minimum TLS versions
  • Health probe configuration
  • Lockdown with new X-Azure-FDID
  • Disable backend certificate name check

Azure Storage – Blob immutability has GA’d. Enables WORM to blobs (write once, read many) based on legal or time locks.

Azure Dedicated Host – Now has additional hardware types available focused on general use, memory optimized and storage-intensive. All based on the AMD EPYC processor except the Msv2 new addition.

Well that’s it! Please subscribe to the YouTube channel and take care in these crazy times and see you soon!


Mid-March 2020 Azure Infrastructure Update

Just recorded the latest Azure Infrastructure update and below are the key updates. Note a weekly update is available via subscribe action on the right hand side of this side (hint click Subscribe) or at

Lets get to it.

PowerShell 7 was released.

New Azure AD App experience is available. It can be enabled for all users or specific groups of users.

Azure Security Center has two updates.

  • Continuous export available
    • Trigger alerts via Azure Monitor through Log Analytics export
    • Can also send to 3rd party SIEM via event hub
  • Just-in-time experience updates
    • Justification field available during access request
    • Cleanup of redundant rules that used to be left behind as part of NSGs

Azure Backup now has Backup Reports to provide tracking and auditing of backup and restore jobs.

Two new Virtual Machine skus.

  • NDv2 GPU VMs GA
    • High-end deep learning training and HPC
    • 8 NVIDIA Tesla V100 NVLink interconnected GPUs
  • NVv4 VMs GA
    • GPU accelerated graphics applications and virtual desktops
    • AMD Radeon Instinct MI25 GPU
    • Various sizes with partial GPU support

VM Scale Sets (VMSS) has a number of new capabilities.

  • Automatic repair based on application health via load balancer health probe of the application health VMSS extension
  • Scale-in policy to control how scale-in is performed for example default (based on balancing first), newestVM or oldestVM
  • Instance protection to protect specific instances during scale-in or other types of action
  • Instance termination notifications through scheduled events along with configurable delay.

And finally Cosmos DB introduces a free tier that is available once per subscription giving 400 RU/s and 5GB of storage. Additionally dynamic scale is not available via autopilot which will make whatever RUs are needed in real-time available up to a defined limit. Autopilot can work with the free tier meaning you would only pay for RU/s over 400.

Well that’s it! Please subscribe to the YouTube channel and take care in these crazy times and see you soon!


Small Script to Grant Azure AD Roles to Groups

Today it is not possible to grant roles in Azure AD to groups and is not likely to support dynamic groups anytime soon. I created a little script that grants a role to all users in a group. It checks and only adds the role to users in the group who don’t already have it (by using the PowerShell Compare-Object command). Simply call the function passing name of the role to grant and the group whose members should be assigned the role. Note I do not remove the role if someone is removed from the group. That would be easy to do however it would remove anyone not in the group which may not be what you want since you may assign roles in other ways and not just via a single group membership.

For example:

Add-RoleToGroup “global reader” “group name”

Function Add-RoleToGroup
    param (
    #$roleName = "global reader"
    #$groupName = "group name"

    Write-Output "Granting $RoleName to $GroupName"

    $errorFound = $false

    #Note that only roles that are enabled, i.e. have at least one person in them will be found using this command so ensure at least one person is in the desired group
    $roleObject = Get-AzureADDirectoryRole | Where-Object {$_.displayName -eq $roleName}
    if($null -eq $roleObject)
        write-output "Cannot find role $roleName, it may be it is not enabled. Ensure the group already has at least one person in it"
        $errorFound = $true
    $groupObject = Get-AzureADGroup -SearchString "$groupName"
    if($null -eq $groupObject)
        write-output "Cannot find group $groupName"
        $errorFound = $true

        $groupMembers = Get-AzureADGroupMember -ObjectId $groupObject.ObjectId -All $true #| Select-Object -ExpandProperty UserPrincipalName
        $roleMembers = Get-AzureADDirectoryRoleMember -ObjectId $roleObject.ObjectId #| Select-Object -ExpandProperty UserPrincipalName

        $userDifferences = Compare-Object $groupMembers $roleMembers

        foreach($UserDifference in $UserDifferences)
            # if need to add
            if($UserDifference.SideIndicator -eq "<=")
                Write-Output "Adding $($UserDifference.InputObject.userprincipalname) to role"
                    {Add-AzureADDirectoryRoleMember -ObjectId $roleObject.ObjectId -RefObjectId $UserDifference.InputObject.ObjectId}
                catch { "Error adding role"}


Full Azure Data Engineer Associate Learning Track Available

Over Q2 and Q3 of 2019 I have been working on a series of courses to cover the content required to pass exams DP-200 and DP-201 that once passed award the Azure Data Engineer Associate certification. I completed the final two courses and now the 11 part learning track is available. It will shortly be available as a learning path on both Pluralsight and but wanted to link to all 10 courses and the order that I recommend you watch them in now. They are all free. You just need to sign up for a free Pluralsight account via

Note, I felt I should put my money where my mouth is so took both the exams yesterday back-to-back and pleased to say I passed them both (otherwise this would be a much sadder blog post) ;-). Also during the 45 minutes it took me to complete DP-200 there was a fire alarm test of my facility by the fire marshal and so the entire time I had a siren blaring and strobe light while trying to focus. That was fun. I felt like robin hood where he is trying to make the shot and everyone is distracting him 🙂 Also they shutdown the AC during the tests so it was 85 degrees.


Hell’s Cloud Ops

Been watching Hell’s Kitchen in the background while working on some projects and I think it would make an awesome cloud operations show and a fun way to communicate some core concepts. Imagine…..

Chef in calm voice – OK team, today we are working on providing a tasty SQL service for our customer that will be used from a fairly basic application. Off you go.

<contestants scurry off to their workstation areas>

<chef wanders over to Bob>

Chef angry – Bob, WHAT ARE YOU DOING?

Bob – I’m creating each VM to be part of the SQL cluster I’m creating

Chef furious – You’re creating each VM one at a time in the portal???? Oh my god! Is your computer made of red and yellow plastic with “My first” written on the top of it? At least I see you’re using Availability Sets for some resiliency but this is ridiculous. How will you ensure consistency? How will you scale to creating 50 instances of this? How would this integrate with DevOps. Start again, use Infrastructure as Code and if I see you in a portal that mouse will be going where the sun doesn’t shine.

Bob – Yes chef!

<15 minutes later Bob presents his template>

Chef – OK, nice template, good resources. oh no no no no. What have you done????? WHY HAVE YOU HARD CODED values in the resources section??? WHERE IS THE PARAMETER FILE?? How are you going to change control this? How will you deploy this between different environments, deploy between different instances. You donkey! Take environment specific values out of the template and get them in a parameter file! Then you have one, change controlled template. Environment, instance specific values are completely separate! IDIOT! FIX THIS!

<5 minutes later Bob returns>

Chef – Lets see. Good parameter use, lets look at the parameter file. DONKEY! Are you here to destroy the company??? WHYYYYY do you have the administrator password in the parameter file???

Bob – I needed it to join the machines to the domain via the domain join extension chef

Chef – And you felt the best way to do that was to place that password in the file that you then uploaded to a repository??? Your companies most important password is now known to everyone and a group of teenagers has taken over your company, your wife has left you and your kids pretend they are adopted they are so embarrassed. Good luck stocking vending machines after destroying your company. IDIOT! Where would be a better place do you think? CAN YOU THINK?

Bob – Azure Key Vault chef

Chef – Can you do that? are you capable. DO IT! And heaven help you if you forget to update the vault’s advanced access policy to allow use of the secret from ARM template deployments.

<5 minutes and Bob returns>

Chef – Lets see how you can ruin my day now. This is acceptable. Will work well. Nice use of secrets. I see you even created a release pipeline. Now tell me, why didn’t you just use Azure SQL database?

<A small tear rolls down Bob’s cheek and credits roll>

Using AD extensionAttributes in Azure AD

I had a value in one of my extensionAttributes in AD populated with a data I needed to leverage in Azure AD dynamic groups. The specific attribute was extensionAttribute5. Without doing anything else this attribute is replicated to Azure AD and can be used as part of a dynamic group. For example I created a rule:

(user.extensionAttribute5 -contains "Chief Technical Architect")

However I was unable to see this value by looking at users through PowerShell AzureAD module. They are visible through the Exchange Online PowerShell environment however I wanted to leverage Azure AD PowerShell. I therefore added the attributes as part of the Azure AD Connect replication. Note I also add one of the msDS-cloudExtensionAttributes to show another attribute available) :



Once replicated you are now able to view the values as shown:

PS Azure:\> Get-AzureADUser -ObjectId | Select-Object -ExpandProperty ExtensionProperty

Key              Value
---              -----
odata.type       Microsoft.DirectoryServices.User
createdDateTime  9/26/16 6:32:37 PM
userIdentities []
extension_391c602828_msDS_cloudExtensionAttribute1   Chief Technical Architect
extension_391c602828_extensionAttribute5             Chief Technical Architect

If you need a specific value then reference by it’s full name that is shown above (note your name will be different), for example:

(Get-AzureADUserExtension -ObjectId“extension_391c602828_extensionAttribute5”)

Deploying Agents to Azure IaaS VMs using the Custom Script Extension

In an ideal world organizations should try to avoid creating custom images with their own special agents and configurations. This means a lot of image management as each time an agent is updated the image has to be updated in addition to the normal patching of OS instances. The Azure marketplace has a large number of OS images that are kept up-to-date which should be used if possible and any customization performed on top of that. I recently had a Proof of Concept where a number of agents needed to be deployed post VM deployment along with other configurations. Items such as domain join can be done with the domain join extension but for the other agent installs we decided to use the Custom Script Extension to call a bootstrap script which would do nothing other than pull down all content from a certain container using azcopy.exe and then launch a master script. The master script would be part of the downloaded content and would then perform all the silent installations and customization’s required.

A storage account is utilized with two containers:

  • Artifacts – This contains the master script and all the agent installers etc. This could use a zip file to enable a structure to be maintained of the various agents and the master script could unzip at the start
  • Bootstrap – This contains azcopy.exe (in my case version 10) and the bootstrap.ps1 file that does nothing other than call azcopy to copy everything from the artifacts container to the c:\ root, then launch the master script from the local copy

Below is my example bootstrap.ps1 file. Notice it has one parameter, the URI of the container which will be the shared access signature enabling access.

Azcopy.exe was downloaded from and copied to the bootstrap container along with the bootstrap.ps1 file. In my case there is nothing sensitive in the file and so I made the container public. This would avoid having to have an access key as part of my ARM template that would ultimately call this script.

All the installers and the master script were uploaded to the artifacts container. For this container I wanted a shared access signature (SAS) that would give read and list rights. The idea would be some automation would generate a new SAS each week and write to a secret in key vault that only people that should deploy had access to. The SAS would have a lifetime of 2 weeks to have an overlap with the newly generated. In addition to generating and storing the complete SAS I needed a second version that was escaped for cmd.exe. This is because the SAS has & in it which was being interpreted during my testing breaking its use. I tried to use no parse (–%) but this did not work since it was being called by cmd.exe therefore the escape is to use ^&. The script below generates the SAS and the escaped SAS and writes both versions as secrets to key vault.

Once this was done I now had a SAS available in the key vault that would give read and list to the artifacts container. Remember to configure the Access Policy on the vault to enable use of secrets from ARM templates (advanced settings) and additionally for the users/groups to have access to the secret. A test of this process to my local machine worked, i.e.

Next I tried calling as I would via the Custom Script Extension which with the escaped version worked great (note its the escaped URL as this will get expanded in the template).

Initially my test was to an existing Azure VM so I used the following (note I’m getting the escaped version of the secret from Key Vault):

Once this worked I finally created an ARM template that included a reference to the secret and all worked as planned.

The parameter file (note I also get a secret to join the domain even though I’m not using the domain join extension in this example):

The actual template (note in the CSE extension at the end I need the single quotes around the URI or it once again tries to interpret it so you have to use two, i.e. ”, to get one ‘ when it actually executes):

And the execution (note the network and RG already existed in my environment).

Hope this helps!