Also known as “how to save money” or if you are Mr Krabs “Me Money!!!!!”. Note, the video version of this is over at https://youtu.be/RjuTQvGm1zQ.
Spending money wisely is always important for every company but especially in challenging times it is even more critical to ensure we are spending as optimally as possible everywhere we can, including the cloud. In this blog I want to walk through ways you think about optimizing your spend in Azure.
Azure uses a consumption-based billing mode, you pay based on what you use. A key aspect of cloud use is its elasticity and as will be explored through this document you will always be looking at ways to dynamically scale your resource use based on that moment in times workload. There are many service options and configuration choices that will influence how consumption is tracked and the associated billing. Many organizations when migrating to Azure will initially perform an analysis of the current workload looking for peeks and averages of resource utilization to best size the cloud counterparts however this step, while very important, is just that, a first step on a journey that never ends as requirements change and service options evolve.
A key aspect of the cloud is scale up vs scale in, vertical vs horizontal scaling to achieve the resource elasticity. Often to provide a service with more capacity we make an instance bigger, more CPU, more memory etc. While this increases scale, it does not increase resiliency and often this type of vertical scaling requires restarts of the instance causing downtime. The preferred approach is to scale out/horizontally by adding additional instances. This increases scale while also increasing resiliency by having multiple instances distributed over racks/datacenters. This type of scaling can also be done without downtime as instances can be added/removed from the set and, if used, added/removed from a load balancer backend set to handle incoming requests. When optimizing compute, the focus is often on horizontal scaling and automations to drive the scale actions via scheduling or certain metric thresholds such as processor use percentages or queue depths. Scaling vertically can be an option in certain situations. For example, you may have a single VM workload that is a DR instance just receiving application transactions via replication. In this case it could be very small. In the event of a disaster and it becomes active it could be resized to provide greater capacity and the few minutes of additional downtime may be worth the overall cost saving for most of the time. Likewise, some data workloads can be resized which may make sense if often their workload is light, but an end of month batch requires 10x the normal processing. In this case resize the instance ahead of the large batch process and then shrink once complete.
When evaluating architecture take care to understand all the options available and use the Azure Pricing Calculator (http://aka.ms/azurepricingcalculator) to get a high-level idea of what the costs will look like to aid in initial planning. When calculating costs make sure to consider all factors including monitoring, backups and disaster recovery (DR). Note that the consumption-based nature of Azure provides a compelling DR story as most of the DR footprint would only be started and therefore billed during an actual failover (test or real).
How to view Azure Spend
The first step is to understand how you are spending money in Azure today. There are a number of ways to identify spending including access to billing APIs, exporting data and viewing in visualization and analysis tools (like PowerBI) however an initial starting point that gives great insight and investigation capabilities is the Azure Portal and more specifically the cost analysis which is a core part of Azure Cost Management. This will show current cost and a trend analysis of what will be spent based on current trend. Costs are broken down by service, location and resource group with the ability to add additional filters and grouping to best visualize for your needs.
Also within Azure Cost Management are budgets which allows a dollar (and metric) threshold to be created for a scope such as a subscription, resource group or even management group. Alerts can be configured that are triggered at a certain threshold of that budget, for example at 80% an alert is generated that could send an email, at 90% an email and a function trigger to send the owner a report of their spending. Action groups are used for the threshold actions which allows the full capability set of action groups to be utilized. This can be very useful to bring awareness. Note that stopping resources at 100% is typically avoided as this would impact the services provided but may be an option for test/dev and could be achieved by triggering an action group that calls an automation.
Notice that Azure Cost Management can scope to all the key organizational structure components such as management groups, subscriptions and resource groups and may influence how you structure resources. Additionally tags are a key attribute that can be used in cost analysis which means your structure does not have to match your billing analysis requirements, just ensure you have a good tagging standard that does provide metadata to meet filtering and identification requirements.
Always Free Services
Azure AD forms the foundation of authentication and authorization in Azure and a free SKU exists for users. For organizations using the “security defaults” configuration on their Azure AD tenant even free users can leverage MFA in a pre-configured manner. For more granular control the premium SKUs would be required.
Resources live in Resource Groups that reside in Subscriptions which in turn can be arranged in a Management Group hierarchy. These constructs have no associated costs and can be used for role-based access control (RBAC), policy assignment and budget which also have no associated costs i.e. RBAC, Azure Policy and Azure Cost Management are all free for use with Azure. Note that while there are no costs associated with subscriptions how they are used will impact resource architecture and deployment which could impact the overall cost of solutions. As an example, a virtual network is bound by region and subscription. If an architecture was used that used many subscriptions resulting in many virtual networks, then if those virtual networks needed to communicate then network peering would be utilized that has ingress and egress charges which vary in amount depending on if those networks are in the same or different regions (https://azure.microsoft.com/en-us/pricing/details/virtual-network/). If a single subscription was used, then a single virtual network could be used within the region with no ingress/egress for communication within that virtual network.
There are other services that are free or have a certain amount of free usage perpetually (unlike others that have free amounts for a limited time such as 12 months). This always free services are outlined at https://azure.microsoft.com/en-us/free/ but include (but not limited to) features such as:
- 5 GB network egress
- 1 million requests and 400K of resource consumption with Azure Functions
- 100,000 operations for event publishing and delivery with Event Grid
- 50,000 active B2C users
- 5 users of Azure DevOps
- 400 RUs and 5 GB storage with a free Cosmos DB account (1 per subscription)
- Free policy assessment and recommendations with Azure Security Center
- Recommendations and best practices via Azure Advisor
One of the first steps to optimizing costs happens before moving anything to Azure. While familiarizing yourself with the service options in Azure is critical so too is having a good understanding of the workloads moving to Azure to ensure compatibility but also to right-size. Very often workloads on-premises are over provisioned than actual requirements as hypervisors allow over subscription of resources such as CPU, memory, storage and networking and tools may not exist to gain insight into true requirements.
Azure Migrate (http://aka.ms/azuremigrate) provides support for the entire migration workflow. This includes discovery of workloads, their dependencies, resource utilization and then tools to migrate OS instances and SQL databases.
Note that this initial rightsizing is the first step of the cost optimization. Once the resources are in Azure, either migrated, or new deployments,
Reserved Instances and Azure Hybrid Benefit
There are many ways to optimize the use of resources and therefore cost however there are two mechanisms directly related to pricing of services that should be understood and used where possible.
The first is the Azure Hybrid Benefit. For several Azure services that leverage Microsoft products such as VMs running Windows Server and/or SQL Server in addition to Azure SQL Database there are many organizations that already have licenses procured for on-premises. Applicable products that have software assurance enable the Azure Hybrid Benefit which reduces pricing on the corresponding Azure service. For example, a 16-core Windows Server Datacenter license enables two 8-core Windows Azure VMs to save up to 40%, essentially billing the same as Linux VM as the Windows Server license cost is removed. Note Datacenter licenses (unlike Standard) can still be used on-premises in addition to the Azure cost saving. Likewise, SQL Server licenses on software assurance enable reduced rates on vCore-based Azure SQL database offerings. Make sure if you have hybrid benefits you are using them with applicable Azure resources. For more information see https://azure.microsoft.com/en-us/pricing/hybrid-benefit/
The second vehicle for pricing reduction is Azure Reservations (fka Reserved Instances). The cloud is great for dynamic workloads where you pay only for what you use however most organizations will still have a certain amount of “steady state” resource that is always being used. Reserved Instances enables an organization to commit to using a certain quantity of a certain series/type of resource over a 1- or 3-year period. While initially this was for virtual machines it now includes many types of resource including compute, storage and database offerings as detailed in the below link. This leads to a pricing discount (based on the length of commitment) on running resources of the covered series up to the quantity committed to (additional resource use would not have the discount applied). There is no configuration required, it is purely a billing mechanism that runs on the hour and applies the discount automatically. Note that if you do NOT have the quantity running you are still billed for that reserved amount each hour which is why it is important to take time to identify the optimal commitment numbers however it is possible to convert reservations between series (where applicable) if required as needs evolve. You can track the utilization of reservations via the Reservations page in the Azure Portal. More information at https://docs.microsoft.com/en-us/azure/cost-management-billing/reservations/save-compute-costs-reservations
It is important to be able associate resources with their owner, project, and cost center for accountability, tracking and potentially charge-back purposes. Azure has several mechanisms to enable this. While subscriptions can be used as a boundary, resource groups are, by design, the best container for resources with a commonality, for example they are parts of the same application and will be created together, run together and ultimately decommissioned together. To enable assignment of resources metadata should be configured on resources, i.e. tags. Tags in Azure are key:value pairs that can be configured on all Azure (ARM) resources. These were previously mentioned to help in cost filtering but that can be used for much more. Very common uses of tags include storing metadata about:
- Application Name
- Cost Center
- Environment (for example dev, test, production)
- Resource state (i.e. malware definition version, backup agent etc)
Full guidance can be found at https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/ready/azure-best-practices/naming-and-tagging#metadata-tags from the Azure cloud adoption framework. Using tags, optionally in combination with resource group/subscriptions, it is possible to track resource ownership for accountability purposes such as cost and adhering to regulatory/organizational requirements. To ensure required tags are populated Azure Policy should be used at parent management groups (to be inherited by the child resources) that mandate key tags to be populated along with legal values where appropriate. Azure Policy can also be used to configure resources to inherit tags from their resource group if required.
Optimizing Service Use
In this section a number of key Azure services will be explored and some key ways to optimize their use.
Note, where possible we think about
VM -> VMSS -> Containers -> AKS with containers -> App Service Plans -> Serverless (Functions/LogicApps)
The further to the right we move the less responsibility for OS components, platform components and eventually, with serverless, even resource instances. Note that additional opportunities for resource optimization are exposed in addition to indirect cost savings as less responsibility will also reduce operational expenses.
Virtual Machines and Virtual Machine Scale Sets
There are many different series and size of virtual machines available (https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes). Each series has different characteristics such as overall balanced CPU, memory, network and storage blend, some compute focused, some memory focused, some storage focused, some have special hardware such as GPUs. Take time to understand the combinations available that best suit your requirements. Additionally, if you have workloads that normally run at a fairly low steady state but need to burst for short periods of time to higher resource utilization the B, or burstable, series may be a good fit. The B series provides a portion of processor resources to the instance and if less than that portion is used, credit accrued which can then be used in these burst situations. For example, a domain controller typically has low resource utilization but will have a burst of activity at the start of the day when users login.
In addition to picking the correct size for workloads ensure they are only running when required. If it is a test/dev workload, make sure they shutdown at night and over the weekends. This can be automated through various Azure automation capabilities or directly as part of the VM configuration through the auto-shutdown configuration (which leverages Azure DevTest Labs behind the scenes which enables broader shutdown policies to be leveraged).
Remember that a virtual machine uses multiple resources. There is a network adapter, an OS managed disk, optionally several data managed disks, perhaps a public IP. Just deleting the VM will not delete these resources which means you are still paying. Make sure all associate resources are deleted if you are sure they are no longer required. Well organized resource groups and naming will help you identify related resources and those that may now be orphaned, i.e. not used. I have a basic script, Get-AzureUnusedResources.ps1 that is available on my GitHub that will identify unused managed disks and public IPs and once complete provide commands to remove them. Note you should check to ensure the resource really is unused! Some services like ASR will have unmapped disks (but it should have ASR in the name). You may have a public IP you are keeping as it has a static IP you want to keep.
Below are two Azure Resource Graph that find unmapped managed disks and public IPs not mapped to a VM/load balancer (but still check they are not used by a NAT gateway).
Resources | where type =~ 'Microsoft.Compute/disks' | where managedBy=~ '' | project name, resourceGroup, subscriptionId, location, tags, sku.name, id
Resources | where type =~ 'Microsoft.Network/publicIPAddresses' | where properties.ipConfiguration=~ '' | project name, resourceGroup, subscriptionId, location, tags, id
When using scale-out and multiple instances of a service to provide scale and resiliency ensure you are performing scaling actions to add and remove instances based on need. Many types of service have peak hours, days and even times of the year. Ensure the number of instances match the demand on the service. While automations can be created to perform this on regular VMs by looking at host and/or guest resource utilization and other metrics (such as queue depths), the built-in solution in Azure is the VM Scale Set or VMSS.
VM Scale Sets are a configuration that includes:
- A base gold image, for example the base OS
- Configuration of the VM, for example series, size and any extensions such as custom script extension to perform initial configuration
- Instance and scale configuration which specifies minimum and maximum instance numbers in addition to triggers for scale activities to occur such as metric thresholds, like a CPU percentage, a schedule or other external trigger. More advanced scaling is possible through various engines which can include integration with technologies like cloud-init to customize instances during scale provisioning actions
- IP configuration for linked load balancer
- Optional advanced configurations such as model to use for scale-in actions, protection of certain instances and more
Understanding the criticality of different workloads is also important when optimizing cost. For example, there may be computation that needs to be completed but is not time critical and you would like to complete as cheaply as possible. Azure often has spare capacity and rather than have the capacity sitting idle it is offered at reduced rates. The exact rate depends on the amount of spare capacity of that VM series, size, region, time of day and other factors. This means the price is variable and constantly changing. This spare capacity is exposed through the Azure Spot instance option for VMs and VMSS. You can specify a maximum price you are willing to pay, and your spot instances will run until that price is exceeded or you can say run until the capacity is needed for regular workloads. Note you cannot mix spot instances and non-spot within a single VMSS. Instead you would have two VMSS instances that could be part of the same load balancer service. Azure Batch is also transitioning to enable the use of spot instances.
For stateless VMSS deployments that require shared storage you have several choices including Azure Files, Azure NetApp Files, shared managed disks and even Azure Blob (using Blobfuse to mount on Linux). As always you should examine the options to understand what meets requirements at the optimal price point. Additionally, if using stateless workloads you may be able to utilize ephemeral OS disks that removes the cost of the OS disk and instead uses the local storage on the host (like the temporary disk) but it means the state of the OS is lost on deprovision. For more information see https://docs.microsoft.com/en-us/azure/virtual-machines/windows/ephemeral-os-disks.
Azure has several platforms for applications, i.e. Platform as a Service (PaaS). A key benefit of PaaS is your responsibility moves up from the operating system to the application. This means tasks (and associated costs) related to OS management are no longer your responsibility which can be a huge cost saving both in terms of effort and tooling required such as patching solutions.
Containers are often a common evolution from virtual machines for application hosting as the virtualization shifts from the hardware to the operating system. There are a number of container solutions available in Azure however the first component to look at is the orchestration which will provide the overall management of a container solution including the provisioning and high availability of container instances, integration with load balancers, health and remediation, resource control, monitoring, network and storage integration, balancing of workloads and more. The most commonly used orchestration solution for containers is Kubernetes. Kubernetes consists of a number of components that are broken into the control plane (including the etcd data store, apiserver, scheduler, controller) that could be self-deployed into virtual machines which would incur resource costs and support costs especially when deployed in a highly available configuration and then the actual (worker) nodes. Azure Kubernetes Service (AKS) provides a per-tenant, dedicated and managed control plane at no cost. As the customer you only pay for the worker nodes that run the container instances. This means straight away you save the money commonly associated with hosting and maintaining the Kubernetes control plane.
Scaling of the worker nodes can be automated using the cluster autoscaler which, based on the load, will dynamically scale the number of nodes and associated pods within a minimum and maximum number that is specified.
Additional capacity can also be achieved using Azure Container Instances which provide a “container as a service” offering and can be seen as an infinite scale node through a virtual kubelet that enables management via Kubernetes and therefore AKS.
App Service Plans were one of the first ever Azure services and have a number of different plans that control the features available and associated resources. Take time to review the plans available https://azure.microsoft.com/en-us/pricing/details/app-service/plans/ and pick the plan that meets your requirements. Also look at the limits at the maximum number of instances also vary based on the plan and these can be found at https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#app-service-limits. Note that auto-scale is included in the Standard and above plans which will enable the instances to be decreased and increased based on actual load requirements. Additionally, within each plan there are different instance sizes. Take time to understand the various combinations of plan, scale parameters and instance size to find the optimal configuration for your application.
Serverless will generally lend themselves to the most optimized solution as the bill based on what is used. Logic Apps and Functions are two key examples of Azure serverless compute services. For Azure Functions you basically pay for the CPU and memory used during executions (remember you get a certain amount free each month). For Azure Logic Apps you pay for executions over the connectors that enable Logic Apps to integrate with all the various services utilized.
The most fundamental data service in Azure is the Azure Storage account which offers blob, queue, table and files services. Azure Storage accounts bill based on the amount of data stored in the service which helps ensure you pay only for what you are using however there are additional options to further optimize cost. For blob storage there are access tiers of hot, cool and archive. The storage cost decreases as you move from hot to cool to archive however transaction costs increase and with archive the data is not available live and must be moved back to hot or cool to access. Lifecycle management enables the use of policies to automatically move data between tiers and even to delete based on attributes such as last accessed. This enables data to be kept as required while paying the lowest applicable price for the storage. For blob and files there are also premium performance tiers that offer higher performance at increased cost. Once again ensure deployments match requirements. Another aspect is the resiliency. At minimum storage is stored locally redundant (LRS) which means there are 3 copies within the region however additional resiliency through copies stored across availability zones and even asynchronously replicated to a paired region (resulting in 6 total copies of the data) are available. As resiliency levels are increased so to does the cost. Ensure the resiliency maps to requirements. For example if an application has instances in multiple regions and the application replicates the data between instances which is then written to local storage it is unlikely that storage needs to also be replicated across regions (since the application is already doing that) which means LRS (or ZRS, redundant across zones in a region) would meet requirements while optimizing cost. For organization with large file shares on-premises Azure File Sync can be used to replicate and even tier less used data to Azure Files as part of a complete windows file server solution. More information on storage accounts can be found at https://docs.microsoft.com/en-us/azure/storage/common/storage-account-overview
Building on Azure Storage are Azure Managed Disks which are used by VMs, VMSS, containers and more. An Azure Managed Disk abstracts the underlying storage account to remove the associated management and consideration of individual account limits. Managed Disks come in a number of types and sizes from Standard HDD through to Ultra SSD. Understand the associated performance of the different types and pick to meet requirements. Within a disk type the size is based around capacity and performance needs which for all types except Ultra scale together, i.e. performance increases with size. If you need to use a larger disk because higher performance is required ensure you understand the bursting option for Premium SSD disks. For disk sizes P20 and below it is possible to accrue performance credit when its regular provisioned performance is not used that, like the B-series VM, can then be used in a burst capacity for up to 30 minutes far exceeding its normal provisioned performance. For example, a 32GB P4 disk normally has provisioned performance of 120 IOPS and 25 MB/second but with credit accrued can burst to 3,500 IOPS and 170 MB/sec. If you had this type of short burst requirement you may be able to purchase smaller disks instead of a disk with a provisioned performance to meet the burst peek. https://azure.microsoft.com/en-us/pricing/details/managed-disks/
When looking at database services it is possible to install databases to VMs where all the resource considerations of a regular VM would apply or leverage managed, PaaS offerings. The PaaS offerings will generally provide a cheaper solution at equal or higher resiliency and performance while also reducing the overall management cost. For example, instead of installing PostgreSQL into IaaS VMs which would require two instances for high availability, a deployment to the managed PostgreSQL offering provides a cheaper solution with a 99.99% SLA. For open source hyperscale offerings automated scale options are available.
For SQL Server based solutions there are numerous offerings based on requirements such as pools of resource multiple databases can share to optimize use, hyperscale for large deployments and even serverless offerings.
For Cosmos DB a common pain in the past was ascertaining the correct Request Unit (RU) numbers to provision. Cosmos DB now features an autopilot mode. With autopilot the RUs will automatically scaled to meet demand up to a maximum you specify. This ensures you pay only for what you use. Note that care should still be taken how data is partitioned and how operations against are formed to ensure the RUs are used in the most efficient way. If operations must operate on multiple partitions because of poor design the RU cost will be higher than if the operation can run against a smaller number of partitions. https://docs.microsoft.com/en-us/azure/cosmos-db/provision-throughput-autopilot
There are many ways to continue the optimization process but a major part of this should be using Azure Advisor. Azure Advisor brings best practice guidance around a number of key areas: performance, operational excellence, high availability, security and cost. It is the cost recommendations that will be the focus of this section and details can be found at https://docs.microsoft.com/en-us/azure/advisor/advisor-cost-recommendations.
Azure Advisor constantly evaluates the resources deployed and their use. After a period of time the evaluation model provides recommendations based on the observed usage to save money and includes the estimated annual saving. Some types of remediation may be as simple as deleting an unused resource while others may require some additional checking and process, but all should be explored. At minimum look at the advisor weekly to ensure you are doing everything possible to eliminate unnecessary spending.
Below is an example screen shot to get an idea of what may be exposed and note the quick fix option for resources that can easily be removed.
This was not exhaustive, there are other considerations like network costs, gateways and more that would be influenced by architecture, but the key items touched can greatly help optimize your cost. Optimization is an ongoing effort and is something to be considered during initial planning and for the lifetime of a workload. Additionally, while consideration of all the types of optimization discussed is critical easy wins are exposed by Azure Advisor so ensure its review is a core part of any weekly process. Note that move advanced optimizations such as a move from IaaS based solution to a managed data offering will not be recommended by Azure Advisor which is why Azure Advisor should be considered just one piece of the overall optimization ongoing effort.