Optimizing Your Azure Spend!

Also known as “how to save money” or if you are Mr Krabs “Me Money!!!!!”. Note, the video version of this is over at https://youtu.be/RjuTQvGm1zQ.

Spending money wisely is always important for every company but especially in challenging times it is even more critical to ensure we are spending as optimally as possible everywhere we can, including the cloud. In this blog I want to walk through ways you think about optimizing your spend in Azure.


Azure uses a consumption-based billing mode, you pay based on what you use. A key aspect of cloud use is its elasticity and as will be explored through this document you will always be looking at ways to dynamically scale your resource use based on that moment in times workload. There are many service options and configuration choices that will influence how consumption is tracked and the associated billing. Many organizations when migrating to Azure will initially perform an analysis of the current workload looking for peeks and averages of resource utilization to best size the cloud counterparts however this step, while very important, is just that, a first step on a journey that never ends as requirements change and service options evolve.

A key aspect of the cloud is scale up vs scale in, vertical vs horizontal scaling to achieve the resource elasticity. Often to provide a service with more capacity we make an instance bigger, more CPU, more memory etc. While this increases scale, it does not increase resiliency and often this type of vertical scaling requires restarts of the instance causing downtime. The preferred approach is to scale out/horizontally by adding additional instances. This increases scale while also increasing resiliency by having multiple instances distributed over racks/datacenters. This type of scaling can also be done without downtime as instances can be added/removed from the set and, if used, added/removed from a load balancer backend set to handle incoming requests. When optimizing compute, the focus is often on horizontal scaling and automations to drive the scale actions via scheduling or certain metric thresholds such as processor use percentages or queue depths. Scaling vertically can be an option in certain situations. For example, you may have a single VM workload that is a DR instance just receiving application transactions via replication. In this case it could be very small. In the event of a disaster and it becomes active it could be resized to provide greater capacity and the few minutes of additional downtime may be worth the overall cost saving for most of the time. Likewise, some data workloads can be resized which may make sense if often their workload is light, but an end of month batch requires 10x the normal processing. In this case resize the instance ahead of the large batch process and then shrink once complete.

When evaluating architecture take care to understand all the options available and use the Azure Pricing Calculator (http://aka.ms/azurepricingcalculator) to get a high-level idea of what the costs will look like to aid in initial planning. When calculating costs make sure to consider all factors including monitoring, backups and disaster recovery (DR). Note that the consumption-based nature of Azure provides a compelling DR story as most of the DR footprint would only be started and therefore billed during an actual failover (test or real).

How to view Azure Spend

The first step is to understand how you are spending money in Azure today. There are a number of ways to identify spending including access to billing APIs, exporting data and viewing in visualization and analysis tools (like PowerBI) however an initial starting point that gives great insight and investigation capabilities is the Azure Portal and more specifically the cost analysis which is a core part of Azure Cost Management. This will show current cost and a trend analysis of what will be spent based on current trend. Costs are broken down by service, location and resource group with the ability to add additional filters and grouping to best visualize for your needs.


Also within Azure Cost Management are budgets which allows a dollar (and metric) threshold to be created for a scope such as a subscription, resource group or even management group. Alerts can be configured that are triggered at a certain threshold of that budget, for example at 80% an alert is generated that could send an email, at 90% an email and a function trigger to send the owner a report of their spending. Action groups are used for the threshold actions which allows the full capability set of action groups to be utilized. This can be very useful to bring awareness. Note that stopping resources at 100% is typically avoided as this would impact the services provided but may be an option for test/dev and could be achieved by triggering an action group that calls an automation.

Notice that Azure Cost Management can scope to all the key organizational structure components such as management groups, subscriptions and resource groups and may influence how you structure resources. Additionally tags are a key attribute that can be used in cost analysis which means your structure does not have to match your billing analysis requirements, just ensure you have a good tagging standard that does provide metadata to meet filtering and identification requirements.

Always Free Services

Azure AD forms the foundation of authentication and authorization in Azure and a free SKU exists for users. For organizations using the “security defaults” configuration on their Azure AD tenant even free users can leverage MFA in a pre-configured manner. For more granular control the premium SKUs would be required.

Resources live in Resource Groups that reside in Subscriptions which in turn can be arranged in a Management Group hierarchy. These constructs have no associated costs and can be used for role-based access control (RBAC), policy assignment and budget which also have no associated costs i.e. RBAC, Azure Policy and Azure Cost Management are all free for use with Azure. Note that while there are no costs associated with subscriptions how they are used will impact resource architecture and deployment which could impact the overall cost of solutions. As an example, a virtual network is bound by region and subscription. If an architecture was used that used many subscriptions resulting in many virtual networks, then if those virtual networks needed to communicate then network peering would be utilized that has ingress and egress charges which vary in amount depending on if those networks are in the same or different regions (https://azure.microsoft.com/en-us/pricing/details/virtual-network/). If a single subscription was used, then a single virtual network could be used within the region with no ingress/egress for communication within that virtual network.

There are other services that are free or have a certain amount of free usage perpetually (unlike others that have free amounts for a limited time such as 12 months). This always free services are outlined at https://azure.microsoft.com/en-us/free/ but include (but not limited to) features such as:

  • 5 GB network egress
  • 1 million requests and 400K of resource consumption with Azure Functions
  • 100,000 operations for event publishing and delivery with Event Grid
  • 50,000 active B2C users
  • 5 users of Azure DevOps
  • 400 RUs and 5 GB storage with a free Cosmos DB account (1 per subscription)
  • Free policy assessment and recommendations with Azure Security Center
  • Recommendations and best practices via Azure Advisor

Migration Planning

One of the first steps to optimizing costs happens before moving anything to Azure. While familiarizing yourself with the service options in Azure is critical so too is having a good understanding of the workloads moving to Azure to ensure compatibility but also to right-size. Very often workloads on-premises are over provisioned than actual requirements as hypervisors allow over subscription of resources such as CPU, memory, storage and networking and tools may not exist to gain insight into true requirements.

Azure Migrate (http://aka.ms/azuremigrate) provides support for the entire migration workflow. This includes discovery of workloads, their dependencies, resource utilization and then tools to migrate OS instances and SQL databases.

Note that this initial rightsizing is the first step of the cost optimization. Once the resources are in Azure, either migrated, or new deployments,

Reserved Instances and Azure Hybrid Benefit

There are many ways to optimize the use of resources and therefore cost however there are two mechanisms directly related to pricing of services that should be understood and used where possible.

The first is the Azure Hybrid Benefit. For several Azure services that leverage Microsoft products such as VMs running Windows Server and/or SQL Server in addition to Azure SQL Database there are many organizations that already have licenses procured for on-premises. Applicable products that have software assurance enable the Azure Hybrid Benefit which reduces pricing on the corresponding Azure service. For example, a 16-core Windows Server Datacenter license enables two 8-core Windows Azure VMs to save up to 40%, essentially billing the same as Linux VM as the Windows Server license cost is removed. Note Datacenter licenses (unlike Standard) can still be used on-premises in addition to the Azure cost saving. Likewise, SQL Server licenses on software assurance enable reduced rates on vCore-based Azure SQL database offerings. Make sure if you have hybrid benefits you are using them with applicable Azure resources. For more information see https://azure.microsoft.com/en-us/pricing/hybrid-benefit/

The second vehicle for pricing reduction is Azure Reservations (fka Reserved Instances). The cloud is great for dynamic workloads where you pay only for what you use however most organizations will still have a certain amount of “steady state” resource that is always being used. Reserved Instances enables an organization to commit to using a certain quantity of a certain series/type of resource over a 1- or 3-year period. While initially this was for virtual machines it now includes many types of resource including compute, storage and database offerings as detailed in the below link. This leads to a pricing discount (based on the length of commitment) on running resources of the covered series up to the quantity committed to (additional resource use would not have the discount applied). There is no configuration required, it is purely a billing mechanism that runs on the hour and applies the discount automatically. Note that if you do NOT have the quantity running you are still billed for that reserved amount each hour which is why it is important to take time to identify the optimal commitment numbers however it is possible to convert reservations between series (where applicable) if required as needs evolve. You can track the utilization of reservations via the Reservations page in the Azure Portal. More information at https://docs.microsoft.com/en-us/azure/cost-management-billing/reservations/save-compute-costs-reservations


It is important to be able associate resources with their owner, project, and cost center for accountability, tracking and potentially charge-back purposes. Azure has several mechanisms to enable this. While subscriptions can be used as a boundary, resource groups are, by design, the best container for resources with a commonality, for example they are parts of the same application and will be created together, run together and ultimately decommissioned together. To enable assignment of resources metadata should be configured on resources, i.e. tags. Tags in Azure are key:value pairs that can be configured on all Azure (ARM) resources. These were previously mentioned to help in cost filtering but that can be used for much more. Very common uses of tags include storing metadata about:

  • Owner
  • Application Name
  • Project
  • Cost Center
  • Environment (for example dev, test, production)
  • Resource state (i.e. malware definition version, backup agent etc)

Full guidance can be found at https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/ready/azure-best-practices/naming-and-tagging#metadata-tags from the Azure cloud adoption framework. Using tags, optionally in combination with resource group/subscriptions, it is possible to track resource ownership for accountability purposes such as cost and adhering to regulatory/organizational requirements. To ensure required tags are populated Azure Policy should be used at parent management groups (to be inherited by the child resources) that mandate key tags to be populated along with legal values where appropriate. Azure Policy can also be used to configure resources to inherit tags from their resource group if required.

Optimizing Service Use

In this section a number of key Azure services will be explored and some key ways to optimize their use.

Note, where possible we think about

VM -> VMSS -> Containers -> AKS with containers -> App Service Plans -> Serverless (Functions/LogicApps)

The further to the right we move the less responsibility for OS components, platform components and eventually, with serverless, even resource instances. Note that additional opportunities for resource optimization are exposed in addition to indirect cost savings as less responsibility will also reduce operational expenses.

Virtual Machines and Virtual Machine Scale Sets

There are many different series and size of virtual machines available (https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes). Each series has different characteristics such as overall balanced CPU, memory, network and storage blend, some compute focused, some memory focused, some storage focused, some have special hardware such as GPUs. Take time to understand the combinations available that best suit your requirements. Additionally, if you have workloads that normally run at a fairly low steady state but need to burst for short periods of time to higher resource utilization the B, or burstable, series may be a good fit. The B series provides a portion of processor resources to the instance and if less than that portion is used, credit accrued which can then be used in these burst situations. For example, a domain controller typically has low resource utilization but will have a burst of activity at the start of the day when users login.

In addition to picking the correct size for workloads ensure they are only running when required. If it is a test/dev workload, make sure they shutdown at night and over the weekends. This can be automated through various Azure automation capabilities or directly as part of the VM configuration through the auto-shutdown configuration (which leverages Azure DevTest Labs behind the scenes which enables broader shutdown policies to be leveraged).

Remember that a virtual machine uses multiple resources. There is a network adapter, an OS managed disk, optionally several data managed disks, perhaps a public IP. Just deleting the VM will not delete these resources which means you are still paying. Make sure all associate resources are deleted if you are sure they are no longer required. Well organized resource groups and naming will help you identify related resources and those that may now be orphaned, i.e. not used. I have a basic script, Get-AzureUnusedResources.ps1 that is available on my GitHub that will identify unused managed disks and public IPs and once complete provide commands to remove them. Note you should check to ensure the resource really is unused! Some services like ASR will have unmapped disks (but it should have ASR in the name). You may have a public IP you are keeping as it has a static IP you want to keep.

Below are two Azure Resource Graph that find unmapped managed disks and public IPs not mapped to a VM/load balancer (but still check they are not used by a NAT gateway).

Resources | 
where type =~ 'Microsoft.Compute/disks' | 
where managedBy=~ '' | 
project name, resourceGroup, subscriptionId, location, tags, sku.name, id
Resources | 
where type =~ 'Microsoft.Network/publicIPAddresses' |
where properties.ipConfiguration=~ '' |
project name, resourceGroup, subscriptionId, location, tags, id

When using scale-out and multiple instances of a service to provide scale and resiliency ensure you are performing scaling actions to add and remove instances based on need. Many types of service have peak hours, days and even times of the year. Ensure the number of instances match the demand on the service. While automations can be created to perform this on regular VMs by looking at host and/or guest resource utilization and other metrics (such as queue depths), the built-in solution in Azure is the VM Scale Set or VMSS.

VM Scale Sets are a configuration that includes:

  • A base gold image, for example the base OS
  • Configuration of the VM, for example series, size and any extensions such as custom script extension to perform initial configuration
  • Instance and scale configuration which specifies minimum and maximum instance numbers in addition to triggers for scale activities to occur such as metric thresholds, like a CPU percentage, a schedule or other external trigger. More advanced scaling is possible through various engines which can include integration with technologies like cloud-init to customize instances during scale provisioning actions
  • IP configuration for linked load balancer
  • Optional advanced configurations such as model to use for scale-in actions, protection of certain instances and more

Understanding the criticality of different workloads is also important when optimizing cost. For example, there may be computation that needs to be completed but is not time critical and you would like to complete as cheaply as possible. Azure often has spare capacity and rather than have the capacity sitting idle it is offered at reduced rates. The exact rate depends on the amount of spare capacity of that VM series, size, region, time of day and other factors. This means the price is variable and constantly changing. This spare capacity is exposed through the Azure Spot instance option for VMs and VMSS. You can specify a maximum price you are willing to pay, and your spot instances will run until that price is exceeded or you can say run until the capacity is needed for regular workloads. Note you cannot mix spot instances and non-spot within a single VMSS. Instead you would have two VMSS instances that could be part of the same load balancer service. Azure Batch is also transitioning to enable the use of spot instances.

For stateless VMSS deployments that require shared storage you have several choices including Azure Files, Azure NetApp Files, shared managed disks and even Azure Blob (using Blobfuse to mount on Linux). As always you should examine the options to understand what meets requirements at the optimal price point. Additionally, if using stateless workloads you may be able to utilize ephemeral OS disks that removes the cost of the OS disk and instead uses the local storage on the host (like the temporary disk) but it means the state of the OS is lost on deprovision. For more information see https://docs.microsoft.com/en-us/azure/virtual-machines/windows/ephemeral-os-disks.

Application Platforms

Azure has several platforms for applications, i.e. Platform as a Service (PaaS). A key benefit of PaaS is your responsibility moves up from the operating system to the application. This means tasks (and associated costs) related to OS management are no longer your responsibility which can be a huge cost saving both in terms of effort and tooling required such as patching solutions.

Containers are often a common evolution from virtual machines for application hosting as the virtualization shifts from the hardware to the operating system. There are a number of container solutions available in Azure however the first component to look at is the orchestration which will provide the overall management of a container solution including the provisioning and high availability of container instances, integration with load balancers, health and remediation, resource control, monitoring, network and storage integration, balancing of workloads and more. The most commonly used orchestration solution for containers is Kubernetes. Kubernetes consists of a number of components that are broken into the control plane (including the etcd data store, apiserver, scheduler, controller) that could be self-deployed into virtual machines which would incur resource costs and support costs especially when deployed in a highly available configuration and then the actual (worker) nodes. Azure Kubernetes Service (AKS) provides a per-tenant, dedicated and managed control plane at no cost. As the customer you only pay for the worker nodes that run the container instances. This means straight away you save the money commonly associated with hosting and maintaining the Kubernetes control plane.

Scaling of the worker nodes can be automated using the cluster autoscaler which, based on the load, will dynamically scale the number of nodes and associated pods within a minimum and maximum number that is specified.

Additional capacity can also be achieved using Azure Container Instances which provide a “container as a service” offering and can be seen as an infinite scale node through a virtual kubelet that enables management via Kubernetes and therefore AKS.

App Service Plans were one of the first ever Azure services and have a number of different plans that control the features available and associated resources. Take time to review the plans available https://azure.microsoft.com/en-us/pricing/details/app-service/plans/ and pick the plan that meets your requirements. Also look at the limits at the maximum number of instances also vary based on the plan and these can be found at https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#app-service-limits. Note that auto-scale is included in the Standard and above plans which will enable the instances to be decreased and increased based on actual load requirements. Additionally, within each plan there are different instance sizes. Take time to understand the various combinations of plan, scale parameters and instance size to find the optimal configuration for your application.

Serverless will generally lend themselves to the most optimized solution as the bill based on what is used. Logic Apps and Functions are two key examples of Azure serverless compute services. For Azure Functions you basically pay for the CPU and memory used during executions (remember you get a certain amount free each month). For Azure Logic Apps you pay for executions over the connectors that enable Logic Apps to integrate with all the various services utilized.

Data Services

The most fundamental data service in Azure is the Azure Storage account which offers blob, queue, table and files services. Azure Storage accounts bill based on the amount of data stored in the service which helps ensure you pay only for what you are using however there are additional options to further optimize cost. For blob storage there are access tiers of hot, cool and archive. The storage cost decreases as you move from hot to cool to archive however transaction costs increase and with archive the data is not available live and must be moved back to hot or cool to access. Lifecycle management enables the use of policies to automatically move data between tiers and even to delete based on attributes such as last accessed. This enables data to be kept as required while paying the lowest applicable price for the storage. For blob and files there are also premium performance tiers that offer higher performance at increased cost. Once again ensure deployments match requirements. Another aspect is the resiliency. At minimum storage is stored locally redundant (LRS) which means there are 3 copies within the region however additional resiliency through copies stored across availability zones and even asynchronously replicated to a paired region (resulting in 6 total copies of the data)  are available. As resiliency levels are increased so to does the cost. Ensure the resiliency maps to requirements. For example if an application has instances in multiple regions and the application replicates the data between instances which is then written to local storage it is unlikely that storage needs to also be replicated across regions (since the application is already doing that) which means LRS (or ZRS, redundant across zones in a region) would meet requirements while optimizing cost. For organization with large file shares on-premises Azure File Sync can be used to replicate and even tier less used data to Azure Files as part of a complete windows file server solution. More information on storage accounts can be found at https://docs.microsoft.com/en-us/azure/storage/common/storage-account-overview

Building on Azure Storage are Azure Managed Disks which are used by VMs, VMSS, containers and more. An Azure Managed Disk abstracts the underlying storage account to remove the associated management and consideration of individual account limits. Managed Disks come in a number of types and sizes from Standard HDD through to Ultra SSD. Understand the associated performance of the different types and pick to meet requirements. Within a disk type the size is based around capacity and performance needs which for all types except Ultra scale together, i.e. performance increases with size. If you need to use a larger disk because higher performance is required ensure you understand the bursting option for Premium SSD disks. For disk sizes P20 and below it is possible to accrue performance credit when its regular provisioned performance is not used that, like the B-series VM, can then be used in a burst capacity for up to 30 minutes far exceeding its normal provisioned performance. For example, a 32GB P4 disk normally has provisioned performance of 120 IOPS and 25 MB/second but with credit accrued can burst to 3,500 IOPS and 170 MB/sec. If you had this type of short burst requirement you may be able to purchase smaller disks instead of a disk with a provisioned performance to meet the burst peek. https://azure.microsoft.com/en-us/pricing/details/managed-disks/

When looking at database services it is possible to install databases to VMs where all the resource considerations of a regular VM would apply or leverage managed, PaaS offerings. The PaaS offerings will generally provide a cheaper solution at equal or higher resiliency and performance while also reducing the overall management cost. For example, instead of installing PostgreSQL into IaaS VMs which would require two instances for high availability, a deployment to the managed PostgreSQL offering provides a cheaper solution with a 99.99% SLA. For open source hyperscale offerings automated scale options are available.

For SQL Server based solutions there are numerous offerings based on requirements such as pools of resource multiple databases can share to optimize use, hyperscale for large deployments and even serverless offerings.

For Cosmos DB a common pain in the past was ascertaining the correct Request Unit (RU) numbers to provision. Cosmos DB now features an autopilot mode. With autopilot the RUs will automatically scaled to meet demand up to a maximum you specify. This ensures you pay only for what you use. Note that care should still be taken how data is partitioned and how operations against are formed to ensure the RUs are used in the most efficient way. If operations must operate on multiple partitions because of poor design the RU cost will be higher than if the operation can run against a smaller number of partitions. https://docs.microsoft.com/en-us/azure/cosmos-db/provision-throughput-autopilot

Azure Advisor

There are many ways to continue the optimization process but a major part of this should be using Azure Advisor. Azure Advisor brings best practice guidance around a number of key areas: performance, operational excellence, high availability, security and cost. It is the cost recommendations that will be the focus of this section and details can be found at https://docs.microsoft.com/en-us/azure/advisor/advisor-cost-recommendations.

Azure Advisor constantly evaluates the resources deployed and their use. After a period of time the evaluation model provides recommendations based on the observed usage to save money and includes the estimated annual saving. Some types of remediation may be as simple as deleting an unused resource while others may require some additional checking and process, but all should be explored. At minimum look at the advisor weekly to ensure you are doing everything possible to eliminate unnecessary spending.

Below is an example screen shot to get an idea of what may be exposed and note the quick fix option for resources that can easily be removed.



This was not exhaustive, there are other considerations like network costs, gateways and more that would be influenced by architecture, but the key items touched can greatly help optimize your cost. Optimization is an ongoing effort and is something to be considered during initial planning and for the lifetime of a workload. Additionally, while consideration of all the types of optimization discussed is critical easy wins are exposed by Azure Advisor so ensure its review is a core part of any weekly process. Note that move advanced optimizations such as a move from IaaS based solution to a managed data offering will not be recommended by Azure Advisor which is why Azure Advisor should be considered just one piece of the overall optimization ongoing effort.

Isolation and Resiliency Guidance in Azure

I recently created some guidance for my customer around isolation and resiliency and figured I would share if helpful for others. Warning, this gets kind of complex 🙂

Lets get to it!


Isolation is a key component in deploying resilient services. Understanding the various isolation options in Azure is critical to deploying services that are resilient to outages of various scale.

Figure 1 – Azure isolation constructs

Azure services are provided across a number of regions which are listed at https://azure.microsoft.com/en-us/global-infrastructure/regions/. A region is a set of physical locations that exist within a 2ms latency envelope. Many services are deployed to and exist within a specific region including core resources such as virtual networks. Regions are also paired to enable certain services to have resiliency from a regional outage by replicating data between regions, for example geo-redundant storage (GRS). Additionally, customers can choose to deploy multi-region deployments using these pairings which ensure similar Azure services are available in both regions and any fabric updates are not made simultaneously for regional pairs. These pairings are documented at https://docs.microsoft.com/en-us/azure/best-practices-availability-paired-regions. For the greatest resiliency architectures should include at least two regions which may be used in a disaster recovery pattern with an active and passive deployment across regions, or in an active-active pattern with traffic distributed between regional deployments. The pattern chosen will depend on the workload and data platform used (many databases cannot support active-active across locations without a significant performance penalty).

Availability Zones (AZs) are available in many regions. An AZ is a physical location in a region that has independent power, cooling and networking from other AZs in the region. A failure in one AZ will not impact other AZs in the region which means having a service deployed across multiple AZs provides protection from an AZ failure.

Each Azure subscription has three logical AZs exposed for resource placement. Note that AZs are not consistent between subscriptions, for example AZ1 in one subscription is not the same as AZ1 in a different subscription. For services that require the greatest resiliency within a region, architectures should deploy instances across AZs. This ensure resources within an AZ are isolated from any issue in a different AZ. As an example, VMs that use AZs have an SLA of 99.99%, the highest available for VMs in Azure. Different services utilize AZs differently. Some are zonal which means they are deployed to a single, specific AZ that you specify. Others are zone-redundant which means the service automatically spans across multiple AZs providing resiliency from any single AZ failure. When architecting a solution it is important to:

  1. Identify all the Azure components/services that will be used. For example, an internal standard load balancer, VMSS, Azure SQL Database and NAT Gateway.
  2. Identify the AZ supported options for each service. In order of resiliency these are zone-redundant (deploys in a resilient manned across AZs), zonal (deploys in a specific AZ) or regional (no AZ interaction).
  3. You will need to architect to the lowest common denominator. For example, all the aforementioned services are zone-redundant except for NAT Gateway which is zonal. Therefore, the architecture will need to have a foundation of zonal for services that interact directly with subnets to keep the zonal promise of the zonal services. For example, NAT Gateway is zonal and is configured at a subnet level. Therefore, the architecture will require that resources deployed to subnets are AZ aligned, i.e. an AZ aligned subnet per AZ.
  4. Resources that are NOT directly linked to a zonal resource that have zone-redundant capabilities can still be leveraged. For example, NAT Gateway may be used which will require a deployment per AZ which in turn requires a separate VMSS deployed in each AZ in its own subnet (since VMSS deployments deploy to a single subnet). The Standard Load Balancer however can be deployed in a zone redundant manner which can then have all 3 VMSS instances as part of a single backend set.

Availability Sets (AS) exist within a single physical facility. When deploying workloads to an Availability Set the workload is automatically distributed among three Fault Domains. A Fault Domain can be thought of as a rack within the datacenter which has its own network switch, power supply unit etc. By deploying workloads to an AS it is resilient to any single rack-level failure such as a PSU or switch failure (providing you have two or more instances deployed to the AS). Likewise, since hosts live in a particular rack (fault domain) and so by using availability sets you are also ensuring workloads are on multiple nodes protecting from any single node or VM failure. Additional storage resiliency can be achieved with availability sets by combining with managed disks and using an aligned mode. Here each fault domain will use a different storage cluster from other fault domains in the availability set helping to also protect from any single storage cluster failure. VMs that deploy to an AS have an SLA of 99.95%. Availability sets also have an update domain property. This controls how workloads are distributed further and impact the percentage that are impacted during an update of the application (if PaaS) or the fabric itself (IaaS and PaaS). An update domain count of 5 means the workloads are spread over up to 5 update domains meaning for any update only 20% (1/5) would be impacted at a time. Figure 2 shows this. Note that when using Availability Zones each AZ acts as a fault and update domain and will ensure updates across zones do not happen at the same time.

Figure 2 – Fault and update domains in an Availability Set

Note you cannot ordinarily use AZ and AS together however it is possible to pin an AS to a specific AZ by utilizing a proximity placement group (PPG) which is used to ensure proximity between services. Because of the increased SLA and zero cost difference, AZ is preferred over AS if it can be used by the target resource.

In summary within a region the use of AZ provides the greatest resiliency from various types of failure and should be used across services. If AZs are not usable then AS should be utilized. Avoid the use of “regional” deployments if using AZs as you have no control where the actual deployment will land and what failures may impact it.

In addition to the use of AZ or AS within a region, deployments to multiple regions should also be architected for the highest level of resiliency in an active-passive or active-active configuration. Solutions like Azure Traffic Manager (DNS-based) and Azure Front Door (HTTP-based) can be used to balance external traffic between regions if required.

Below are some additional considerations and capabilities for various fabric layers.

Network Considerations

Virtual networks are deployed to a region and are available across the entire region, i.e. they span AZs. Virtual networks are broken into virtual subnets that are also regional and are available across AZs. There is no concept of deploying a subnet to a single AZ. If a subnet needs to be aligned to a particular AZ this would have to achieved by logically allocating subnets to AZs and then ensuring resources placed in the subnet are deployed to the corresponding AZ. Any communication to the vnet, for example connections to on-premises via ExpressRoute, would be available to the entire vnet regardless of the AZ. Zone-redundant gateway options would be leveraged to ensure the vnet connectivity could tolerate any single zone failure.

Certain network resources support AZs, primarily the standard SKUs for example the Standard Load Balancer, Standard Public IP, Standard IP prefix and the App Gateway v2. These provide the ability to be zone-resilient and also some support zonal deployment. ExpressRoute Gateway can also deploy in a zone-redundant or zonal mode. When using a zone-redundant service it is automatically made resilient across zones by the Azure fabric and no manual steps are required once the zone-redundant option is configured. For example, a standard SKU public IP being using as the front end with a Standard Load Balancer will span zones and be resilient across any single zone failure. Services like NAT Gateway can be regionally deployed where no zonal promise is made and the deployment can be to any datacenter in the region or zonal but does not support zone-redundant deployments. When using combinations of solutions, it is important to architect accordingly. When you use a zone redundant component a single instance of the resource is deployed. When you use a zonal component and want the service in each AZ you must deploy an instance into each AZ, i.e. to cover 3 AZs you would deploy 3 instances of a zonal resource.

Figure 3 shows an example possible deployment combining a single zone redundant SLB front-end with zonal NAT Gateway deployments. Note here logically the subnets are mapped to AZs and implemented by using zonal VMSS deployments with each VMSS deploying to the logically mapped subnet for the AZ. Each AZs NAT Gateway is then connected to its corresponding subnet. Note zonal as opposed to zone redundant VMSS deployment is used to target specific subnets to enable the mapping of the NAT gateway. This also means three separate VMSS deployments are used, one for each AZ instead of 1 AZ spanning VMSS instance. In this example however all 3 VMSS instances are part of the same SLB back-end set and are all distributed from a single front-end IP. This model would apply to any other type of compute service.

Figure 3 – Example network solution using combinations of zonal and zone-redundant solutions.

If the additional complexity of requiring multiple zonal compute deployments to enable the use of NAT Gateway is not desirable the standard internal load balancer can have a public IP added to enable outbound internet connectivity for the backend set members and outbound rules used to control NAT behavior.

Given the additional complexity NAT Gateway introduces because of its zonal deployment it is important to understand why you would use this instead of just adding a public IP to the internal SLB along with outbound rules to control the SNAT (note that WITHOUT a public IP or NAT Gateway at the subnet, any machine behind an internal SLB has no Internet access). Key benefits of NAT Gateway over public IP on an internal SLB are described at https://docs.microsoft.com/en-us/azure/virtual-network/nat-gateway-resource#source-network-address-translation however some of the key points are:

  1. LB SNAT requires ahead of time knowledge, planning, and tweaking of worst-case scenario for any of the VM’s whereas NAT Gateway allocates on-demand as needed.
  2. Dynamic workloads or workloads which diverge from each other in intensity are difficult to accommodate with LB SNAT.
  3. You must explicitly join every VM NIC to the backend pool of the load balancer to use the SLB SNAT.
  4. Some customers also object to using something that also provides inbound functionality for outbound functionality.
  5. NAT is designed to be a much simpler outbound solution for entire subnet(s) of a virtual network.

Storage Overview

Azure Storage accounts have different redundancy options. The base level of redundancy is Locally Redundant Storage (LRS) which has 3-copies of the data within a physical location. Zone Redundant Storage (ZRS) can be used to have the 3-copies of the data distributed across three AZs in the region. Geo-redundant/Geo-zone-redundant (GRS/GZRS) adds an additional 3 copies of the data at the paired region.

If utilizing GRS/GZRS for storage resiliency to another region it is important other service replication and failover is to the same paired region to ensure in the event of failover the services are in the same region with a functional latency between them, i.e. they are within a single region.

Database services have different options, for example Azure SQL Database premium and business critical are deployed in a zone-redundant configuration. Cosmos DB has an optional zone redundancy option to provide zone-redundant deployments. These configurations are transparent to the application using the service and are accessed through a single endpoint/DNS name which is prescribed by the data service. Additionally, Cosmos DB has different consistency models enabling multi-master configurations which would allow writable replicas in multiple regions.

Compute Overview

For basic VMs it is possible to deploy to a region (99.9% SLA when using premium SSD or Ultra Disk), an availability set (99.95% SLA with 2+ instances) or availability zone (99.99% SLA with 2+ instances). When deploying to AZs it is important VMs are distributed over multiple AZs.

Virtual Machine Scale Sets support both zonal and cross-zone deployments. When deploying across zones different balance options are available which includes a best effort zone balance or strict zone balance. Best effort will attempt to keep balance across zones but will allow temporary imbalance. Strict will not allow scale actions if the balance would be broken. For most scenarios best-effort suffices.

AKS node pools can be deployed across zones at time of creation. The default node pool deployment controls if the AKS control plane components are deployed across zones. Note that while cross-zone node pools can be configured it is not recommended for large deployments that are stateful as race conditions can occur where the compute tries to start in AZ1 while storage may be in AZ3 because of scaling limits. The best practice is to create a node pool per zone for stateful workloads. Assuming networking for AKS is provided using Azure CNI (which enables integration with existing vnets) separate node pools per zone also allows a different subnet to be configured for each node pool which can then be AZ aligned. Deploy the service into each node pool which ensures the compute and storage scales together. Note for stateless services cross-zone node pools are fine unless you wish to use NAT gateway in which case once again you will need a node pool per zone to enable separate subnets to be configured to keep the NAT gateway zonal promise. When using NAT gateway the NAT gateway is aligned to the subnet and not to AKS node pools directly. If Nat gateway 1 is zonally deployed to AZ1 and linked to subnet 1 and subnet 1 is used by node pool 1 then node pool 1 would be a zonal deployment to AZ1 thus ensuring alignment. Nat gateway 2 and node pool 2 would be AZ2 with subnet 2 and so on.

App Service Environments can be deployed as zonal. Behind the scenes ASE uses zone redundant storage (ZRS) for the remote web application file storage. At time of writing App Service Plans do not support AZs. The best option would be to deploy to two near regions, e.g. East US and East US2 then balance across them using Azure Front Door (or Azure Traffic Manager).

A full list can be found at https://docs.microsoft.com/en-us/azure/availability-zones/az-overview.

Azure Infrastructure Update – March 2020

Just recorded the latest Azure Infrastructure update and below are the key updates. Note a weekly update is available via subscribe action on the right hand side of this side (hint click Subscribe) or at https://savilltech.wordpress.com/azure-weekly-update-subscribe/.

Lets get to it.

Azure AD B2B – Unmanaged/viral tenants will not be created after 3/31/2021. Make sure you turn on one-time-passcode (OTP).

Azure Security Center

  • Now integrated with Windows Admin Center
    • Onboard OS instances to ASC via the WAC extension
    • Security alert and recommendations surfaced
  • Identity and Access recommendations in free tier
  • Azure Container Registry scanning
  • Azure Kubernetes Service (AKS) protection

Azure Cloud Shell – Now has additional regions (secondary) for the storage of the shell persistent data. The compute will still be in one of the primary regions but now the data-at-rest can be in a region that may help you meet certain compliance requirements.

Azure Networking

  • NAT Gateway GA (https://youtu.be/c685a1CiaIs)
  • Azure Storage and Azure SQL Database Private Link GA
  • Azure Data Explorer cluster deployment into custom virtual network now possible providing integration with NSG and other connectivity to the vnet.

Azure Front Door

  • Wildcard hosts/domains
  • Configurable idle timeout
  • Configurable minimum TLS versions
  • Health probe configuration
  • Lockdown with new X-Azure-FDID
  • Disable backend certificate name check

Azure Storage – Blob immutability has GA’d. Enables WORM to blobs (write once, read many) based on legal or time locks.

Azure Dedicated Host – Now has additional hardware types available focused on general use, memory optimized and storage-intensive. All based on the AMD EPYC processor except the Msv2 new addition.

Well that’s it! Please subscribe to the YouTube channel and take care in these crazy times and see you soon!


Mid-March 2020 Azure Infrastructure Update

Just recorded the latest Azure Infrastructure update and below are the key updates. Note a weekly update is available via subscribe action on the right hand side of this side (hint click Subscribe) or at https://savilltech.wordpress.com/azure-weekly-update-subscribe/.

Lets get to it.

PowerShell 7 was released.

New Azure AD App experience is available. It can be enabled for all users or specific groups of users.

Azure Security Center has two updates.

  • Continuous export available
    • Trigger alerts via Azure Monitor through Log Analytics export
    • Can also send to 3rd party SIEM via event hub
  • Just-in-time experience updates
    • Justification field available during access request
    • Cleanup of redundant rules that used to be left behind as part of NSGs

Azure Backup now has Backup Reports to provide tracking and auditing of backup and restore jobs.

Two new Virtual Machine skus.

  • NDv2 GPU VMs GA
    • High-end deep learning training and HPC
    • 8 NVIDIA Tesla V100 NVLink interconnected GPUs
  • NVv4 VMs GA
    • GPU accelerated graphics applications and virtual desktops
    • AMD Radeon Instinct MI25 GPU
    • Various sizes with partial GPU support

VM Scale Sets (VMSS) has a number of new capabilities.

  • Automatic repair based on application health via load balancer health probe of the application health VMSS extension
  • Scale-in policy to control how scale-in is performed for example default (based on balancing first), newestVM or oldestVM
  • Instance protection to protect specific instances during scale-in or other types of action
  • Instance termination notifications through scheduled events along with configurable delay. https://docs.microsoft.com/en-us/azure/virtual-machines/windows/scheduled-events

And finally Cosmos DB introduces a free tier that is available once per subscription giving 400 RU/s and 5GB of storage. Additionally dynamic scale is not available via autopilot which will make whatever RUs are needed in real-time available up to a defined limit. Autopilot can work with the free tier meaning you would only pay for RU/s over 400.

Well that’s it! Please subscribe to the YouTube channel and take care in these crazy times and see you soon!


Small Script to Grant Azure AD Roles to Groups

Today it is not possible to grant roles in Azure AD to groups and is not likely to support dynamic groups anytime soon. I created a little script that grants a role to all users in a group. It checks and only adds the role to users in the group who don’t already have it (by using the PowerShell Compare-Object command). Simply call the function passing name of the role to grant and the group whose members should be assigned the role. Note I do not remove the role if someone is removed from the group. That would be easy to do however it would remove anyone not in the group which may not be what you want since you may assign roles in other ways and not just via a single group membership.

For example:

Add-RoleToGroup “global reader” “group name”

Function Add-RoleToGroup
    param (
    #$roleName = "global reader"
    #$groupName = "group name"

    Write-Output "Granting $RoleName to $GroupName"

    $errorFound = $false

    #Note that only roles that are enabled, i.e. have at least one person in them will be found using this command so ensure at least one person is in the desired group
    $roleObject = Get-AzureADDirectoryRole | Where-Object {$_.displayName -eq $roleName}
    if($null -eq $roleObject)
        write-output "Cannot find role $roleName, it may be it is not enabled. Ensure the group already has at least one person in it"
        $errorFound = $true
    $groupObject = Get-AzureADGroup -SearchString "$groupName"
    if($null -eq $groupObject)
        write-output "Cannot find group $groupName"
        $errorFound = $true

        $groupMembers = Get-AzureADGroupMember -ObjectId $groupObject.ObjectId -All $true #| Select-Object -ExpandProperty UserPrincipalName
        $roleMembers = Get-AzureADDirectoryRoleMember -ObjectId $roleObject.ObjectId #| Select-Object -ExpandProperty UserPrincipalName

        $userDifferences = Compare-Object $groupMembers $roleMembers

        foreach($UserDifference in $UserDifferences)
            # if need to add
            if($UserDifference.SideIndicator -eq "<=")
                Write-Output "Adding $($UserDifference.InputObject.userprincipalname) to role"
                    {Add-AzureADDirectoryRoleMember -ObjectId $roleObject.ObjectId -RefObjectId $UserDifference.InputObject.ObjectId}
                catch { "Error adding role"}


Full Azure Data Engineer Associate Learning Track Available

Over Q2 and Q3 of 2019 I have been working on a series of courses to cover the content required to pass exams DP-200 and DP-201 that once passed award the Azure Data Engineer Associate certification. I completed the final two courses and now the 11 part learning track is available. It will shortly be available as a learning path on both Pluralsight and Azure.com but wanted to link to all 10 courses and the order that I recommend you watch them in now. They are all free. You just need to sign up for a free Pluralsight account via https://www.pluralsight.com/partners/microsoft/azure.

Note, I felt I should put my money where my mouth is so took both the exams yesterday back-to-back and pleased to say I passed them both (otherwise this would be a much sadder blog post) ;-). Also during the 45 minutes it took me to complete DP-200 there was a fire alarm test of my facility by the fire marshal and so the entire time I had a siren blaring and strobe light while trying to focus. That was fun. I felt like robin hood where he is trying to make the shot and everyone is distracting him 🙂 Also they shutdown the AC during the tests so it was 85 degrees.


Hell’s Cloud Ops

Been watching Hell’s Kitchen in the background while working on some projects and I think it would make an awesome cloud operations show and a fun way to communicate some core concepts. Imagine…..

Chef in calm voice – OK team, today we are working on providing a tasty SQL service for our customer that will be used from a fairly basic application. Off you go.

<contestants scurry off to their workstation areas>

<chef wanders over to Bob>

Chef angry – Bob, WHAT ARE YOU DOING?

Bob – I’m creating each VM to be part of the SQL cluster I’m creating

Chef furious – You’re creating each VM one at a time in the portal???? Oh my god! Is your computer made of red and yellow plastic with “My first” written on the top of it? At least I see you’re using Availability Sets for some resiliency but this is ridiculous. How will you ensure consistency? How will you scale to creating 50 instances of this? How would this integrate with DevOps. Start again, use Infrastructure as Code and if I see you in a portal that mouse will be going where the sun doesn’t shine.

Bob – Yes chef!

<15 minutes later Bob presents his template>

Chef – OK, nice template, good resources. oh no no no no. What have you done????? WHY HAVE YOU HARD CODED values in the resources section??? WHERE IS THE PARAMETER FILE?? How are you going to change control this? How will you deploy this between different environments, deploy between different instances. You donkey! Take environment specific values out of the template and get them in a parameter file! Then you have one, change controlled template. Environment, instance specific values are completely separate! IDIOT! FIX THIS!

<5 minutes later Bob returns>

Chef – Lets see. Good parameter use, lets look at the parameter file. DONKEY! Are you here to destroy the company??? WHYYYYY do you have the administrator password in the parameter file???

Bob – I needed it to join the machines to the domain via the domain join extension chef

Chef – And you felt the best way to do that was to place that password in the file that you then uploaded to a repository??? Your companies most important password is now known to everyone and a group of teenagers has taken over your company, your wife has left you and your kids pretend they are adopted they are so embarrassed. Good luck stocking vending machines after destroying your company. IDIOT! Where would be a better place do you think? CAN YOU THINK?

Bob – Azure Key Vault chef

Chef – Can you do that? are you capable. DO IT! And heaven help you if you forget to update the vault’s advanced access policy to allow use of the secret from ARM template deployments.

<5 minutes and Bob returns>

Chef – Lets see how you can ruin my day now. This is acceptable. Will work well. Nice use of secrets. I see you even created a release pipeline. Now tell me, why didn’t you just use Azure SQL database?

<A small tear rolls down Bob’s cheek and credits roll>

Using AD extensionAttributes in Azure AD

I had a value in one of my extensionAttributes in AD populated with a data I needed to leverage in Azure AD dynamic groups. The specific attribute was extensionAttribute5. Without doing anything else this attribute is replicated to Azure AD and can be used as part of a dynamic group. For example I created a rule:

(user.extensionAttribute5 -contains "Chief Technical Architect")

However I was unable to see this value by looking at users through PowerShell AzureAD module. They are visible through the Exchange Online PowerShell environment however I wanted to leverage Azure AD PowerShell. I therefore added the attributes as part of the Azure AD Connect replication. Note I also add one of the msDS-cloudExtensionAttributes to show another attribute available) :



Once replicated you are now able to view the values as shown:

PS Azure:> Get-AzureADUser -ObjectId johnsav@onemtc.net | Select-Object -ExpandProperty ExtensionProperty

Key              Value
---              -----
odata.metadata   https://graph.windows.net/32dc2feb-7fd6bf/$…
odata.type       Microsoft.DirectoryServices.User
createdDateTime  9/26/16 6:32:37 PM
userIdentities []
extension_391c602828_msDS_cloudExtensionAttribute1   Chief Technical Architect
extension_391c602828_extensionAttribute5             Chief Technical Architect

If you need a specific value then reference by it’s full name that is shown above (note your name will be different), for example:

(Get-AzureADUserExtension -ObjectId johnsav@onemtc.net).get_item(“extension_391c602828_extensionAttribute5”)