Azure

Azure Spring Clean - Managing your Non-Production Azure Environments

Azure Spring Clean - Managing your Non-Production Azure Environments

This post is part of the Azure Spring Clean series which is an Azure community event that is focused on Azure management and best practices for keeping your Azure environments nice and tidy! Shout out to Joe Carlyle and Thomas Thornton for coordinating the event and of course allowing me to make my own small contribution. You can find out more details alongside other great community contributions at https://www.azurespringclean.com] or over on twitter.

Pfff, it’s not production who cares???

Most organizations are rightly very focused on their production environments. They are, after all, PRODUCTION! We all spend the time making sure that we apply best practices for everything, from how we provision those resources to how we secure and monitor them. Unfortunately for those non-production environments, it’s generally a different story. We use a different set of rules, or sometimes none at all! For many organizations, it’s simply the wild west where everyone does whatever they please!

Wild West

To be fair, though, we do it for a reason. We generally need more flexibility when it comes to these environments. They are where our developers explore and try out new features or work through bug fixes. They may be very short-lived environments in the cases of labs or workshops where users are spending more hands-on time.

In almost all cases, some underlying requirements necessitate flexibility. This is something that most organizations will struggle at some point. In some unfortunate cases, they learn the hard way what happens when you don’t respect those non-production environments. But where do you draw the line?

Wild West

If don’t draw the line and you’re lucky it’s an unexpected bill, if you’re not it’s a security or data breach through one of your “non-production” environments and you have no idea… until its too late.

How can we reduce some risk and keep our flexibility?

At the end of the day it’s simply not feasible to treat these non-production environments with the same level of governance and controls. There is however one very important attribute that significally impacts everything from cost all the way down to security and this is how long the resources live for.

In general, the blast radius for costs and security can be mitigated by reducing how long these environments stay around. How many times have you forgotten to clean up after yourself when you’ve spun up a new App Service plan or an Azure SQL environment only to come back days, weeks, or months (Gasp!) later to have forgotten to clean up after yourself? Or maybe seeing other resources used by developers and QA teams, and you wonder to yourself if they are even being used?

The solution - Automatic Resource Cleanup

Implementing something as simple as automatic resource cleanup can significantly reduce the risk around these non-production environments. From the initial creation of these resources in Azure, the clock should be ticking! There are quite a few different ways we can address this, but in my experience, simplicity is key to any solution, and you can’t get much simpler than simply tagging resource groups for automatic cleanup.

Implementation Details

The complete source code for the provided examples is available at https://github.com/joshdcar/spring-clean-resource-cleanup

For our solution we’re going to start with the premise that we will tag our resources that we would like cleaned up. Something as simple as a couple tags such as:

  • expiration-tag: A tag value such as “Lab” identifieing the type of environment in case we want different rules for different environments.
  • expiration-date: The date the resource group should be automatically cleaned up by deletion

Azure Tags

Azure Functions, specifically a timer trigger, is a natural fit for executing a task on a scheduled basis and doing so at an exceedingly low cost (bordering on free for our scenario). Let’s see what this might look like in a function.


       [FunctionName(nameof(ResourceCleanupTimerTrigger))]
       public async Task ResourceCleanupTimerTrigger([TimerTrigger("0 */5 * * * *")]TimerInfo myTimer, 
           ILogger log)
       {

           log.LogInformation($"Executing Resource Cleanup at : {DateTime.Now}");

           var subscriptionId = Environment.GetEnvironmentVariable("SubscriptionId");
           var expireTagKey = Environment.GetEnvironmentVariable("ExpireTagKey");

           var credentials = new DefaultAzureCredential();
           var resourceClient = new ResourcesManagementClient(subscriptionId, credentials);
           
           var resourceGroups = resourceClient.ResourceGroups;
           var resourceGroupPages = resourceGroups.ListAsync($"tagName eq 'expiration-tag' and tagValue eq '{expireTagKey}'").AsPages();

           await foreach (Azure.Page<ResourceGroup> groupPage in resourceGroupPages)
           {
               foreach (var group in groupPage.Values)
               {
                   log.LogInformation($"Resource Group Name: {group.Name}");

                   var expireDateTagExists = group.Tags.ContainsKey("expiration-date");

                   if(expireDateTagExists){

                       var expireDateTag = group.Tags["expiration-date"];

                       var expireDate = default(DateTime);
                       var validDate = DateTime.TryParse(expireDateTag, out expireDate);

                       if(validDate){

                           if(DateTime.Now > expireDate) {
                               
                               log.LogInformation($"{group.Name} resource group expired. Expires {expireDate} and todays date is {DateTime.Now}. Deleting Resource.");

                               await resourceGroups.StartDeleteAsync(group.Name);

                               log.LogInformation($"{group.Name} resource group successfully deleted.");
                           }
                           else{
                               log.LogInformation($"{group.Name} not expired yet. Expires {expireDate}");
                           }
                           
                       }
                       else{
                            log.LogInformation($"{group.Name} resource group expiration-date value {expireDateTag} is not a valid date.");
                       }
                   }
                   else{
                       log.LogInformation($"{group.Name} resource group 'expiration-tag' missing.");
                   }
               }

           }

       }

Let’s breakdown what this function is doing:

  1. The function executes on a scheduled timer. For demonstration purposes this runs every 5 minutes but once a day is generally adaquete.
  2. The function search for resource groups with the configured tag. In our case expiration-tag = Lab
  3. The function then pulls the value of expiration-date and compares it to the current date. If it is past the expirate date the resource group is deleted.

It’s That Easy!

Easy

With just a few lines of code in an Azure Function and some due diligence tagging your resources, it’s really that easy to ensure your resources aren’t staying alive longer than they should be.

But Wait - I’m Not Done Yet!

There are some limits, however, to this overly simplistic solution. There are scenarios where you may not necessarily be done with your resources, although they are set to expire. Deleting those resources may have significant ramifications, especially if you’re mid-process on a feature or bug fix and your development environment has already been put to rest!

Funeral

Let’s go ahead and build on this solution with something that offers more of the flexibility we need for these scenarios. In addition to our existing tags, we’re going to be adding an additional tag to the mix:

  • expiration-email: a contact email to confirm and extend the expiration BEFORE deleting the resources

If you think this sounds like a workflow, then you’d be absolutely correct! And what better platform to build workflows on, especially a workflow that has human interaction, then with Durable Functions!

We will start with the same timer trigger as before but instead of deleting resources we will kick off a durable function orchestration.

 await client.StartNewAsync("ExtendExpirationOrchestrationTrigger",param);

The Workflow

This will then trigger our Orchestration for Extending the Expiration. Our orchestration is responsible for the coordination of a simple workflow to:

  1. Send an email notification with a link allowing a user to extend their expiration by a short period of time.
  2. If the user does not extend within a configured time the resource group is deleted.
  3. If the expiration is extended the cycle starts new again when the new expiration is triggered

        [FunctionName(nameof(ExtendExpirationOrchestrationTrigger))]
        public static async Task ExtendExpirationOrchestrationTrigger(
            [OrchestrationTrigger] IDurableOrchestrationContext context)
        {
            var param = context.GetInput<ExtendModel>();
            param.InstanceId = context.InstanceId;

            await context.CallActivityAsync("SendExtensionRequest", param);

            using (var timeoutCts = new CancellationTokenSource())
            {
                param.ResponseExpires = context.CurrentUtcDateTime.AddHours(param.ExtendHours);

                DateTime dueTime = param.ResponseExpires;
                Task durableTimeout = context.CreateTimer(dueTime, timeoutCts.Token);

                Task<bool> extendEvent = context.WaitForExternalEvent<bool>("ExtendExpiration");
                
                if (extendEvent == await Task.WhenAny(extendEvent, durableTimeout))
                {
                    timeoutCts.Cancel();

                    //extend the time of the expiration by configured amount
                    await context.CallActivityAsync("ExtendExpiration",param);
                }
                else
                {
                    //delete the resource
                    await context.CallActivityAsync("DeleteResource", param);
                }
            }
        }

In the case of the sample below, we’re using SendGrid and, specifically, the SendGrid Bindings for Azure Functions to manage the email communications side of the equation. The email contains a link to an Azure Function Http Trigger that will start the process for extending the expiration.


        [FunctionName(nameof(SendExtensionRequest))]
        public static async Task SendExtensionRequest([ActivityTrigger] ExtendModel settings,
                                                [SendGrid(ApiKey = "SendGridApiKey")] IAsyncCollector<SendGridMessage> messageCollector,
                                                ILogger log)
        {
            log.LogInformation($"Sending Extension Request email for {settings.ResourceGroupName} to {settings.ExpirationEmail}.");

            var linkUrl = Environment.GetEnvironmentVariable("WEBSITE_HOSTNAME");
            var fromEmail = Environment.GetEnvironmentVariable("FromEmail");
            var extendUrl = $"{linkUrl}/api/extend/{settings.InstanceId}";
            var subject = $"Your Resource Group {settings.ResourceGroupName} is about to be deleted.";
            var messageBody = $@"FYI: Your Resource Group {settings.ResourceGroupName} is scheduled to be deleted. 
                                If you don't respond by {settings.ResponseExpires} (UTC) your resource group will be deleted.  
                                Visit {extendUrl} to extend your resource group for another {settings.ExtendHours} hours.
                                No action is neccesary if you would like these resources deleted. ";

            var message = new SendGridMessage();
            message = new SendGridMessage();
            message.AddTo(settings.ExpirationEmail);
            message.AddContent("text/html", messageBody);
            message.SetFrom(new EmailAddress(fromEmail));
            message.SetSubject(subject);

            await messageCollector.AddAsync(message);
            
        }

If the Event is triggered by the user selecting the email link, then our ExtendExpiration Activity is executed, and we simply update the expiration-date tag with the new expiration date, which is based on configuration.


        [FunctionName(nameof(ExtendExpiration))]
        public static async Task ExtendExpiration([ActivityTrigger] ExtendModel settings, 
                                                ILogger log)
        {

            log.LogInformation($"Extending Expiration for {settings.ResourceGroupName} to {settings.ExpirationEmail}.");

            var subscriptionId = Environment.GetEnvironmentVariable("SubscriptionId");
            
            var credentials = new DefaultAzureCredential();
            var resourceClient = new ResourcesManagementClient(subscriptionId, credentials);
            
            var resourceGroupResult = await resourceClient.ResourceGroups.GetAsync(settings.ResourceGroupName);
            var resourceGroup = resourceGroupResult.Value;

            var expireDateTag = resourceGroup.Tags["expiration-date"];
            var expireDate = DateTime.Parse(expireDateTag);

            //Update the Resource Group with the new expiration date
            var newExpireDate = expireDate.AddHours(settings.ExtendHours);
            resourceGroup.Tags["expiration-date"] = newExpireDate.ToLongTimeString();

            await resourceClient.ResourceGroups.CreateOrUpdateAsync(settings.ResourceGroupName,resourceGroup);

        }

Eventually if the expiration reminder is ignored then the resource group will be deleted.


        [FunctionName(nameof(DeleteResource))]
        public static async Task DeleteResource([ActivityTrigger] ExtendModel settings, 
                                                ILogger log)
        {
            log.LogInformation($"Deleting {settings.ResourceGroupName} to {settings.ExpirationEmail}.");

            var subscriptionId = Environment.GetEnvironmentVariable("SubscriptionId");
            
            var credentials = new DefaultAzureCredential();
            var resourceClient = new ResourcesManagementClient(subscriptionId, credentials);
            
            await resourceClient.ResourceGroups.StartDeleteAsync(settings.ResourceGroupName);
            
        }

We might consider a custom web page with some options for extending the expiration for a more polished user experience. In our case, we wanted this to be simple and inexpensive, so we’re simply using a “get” endpoint that can be triggered from the browser by a user, and we’re returning some simple HTML in the response.


        [FunctionName(nameof(ExtendExpirationHttpTrigger))]
        public static async Task<IActionResult> ExtendExpirationHttpTrigger(
            [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route="extend/{instanceid}")] HttpRequestMessage req,
            [DurableClient] IDurableOrchestrationClient client,
            string instanceid,
            ILogger log)
        {

            int extensionHours = int.Parse(Environment.GetEnvironmentVariable("ExtendHours"));

            await client.RaiseEventAsync(instanceid, "ExtendExpiration");

            var message = $@"<html><head>Resource Group Expiration Extended</head>
                             <body><h2>Your resource group expiration has been increased by {extensionHours} hours. </h2> <p>You will receive another
                             notification again prior to deletion.</p></body><html>";
            
            return new ContentResult { Content = message, ContentType = "text/html" };
   
        }

In Summary

Non-Production environments should not be ignored. The risk of unexpected costs alongside security considerations is not inconsequently. The good thing is that with just a little due diligence alongside using simple tools and techniques such as automatically cleaning up your resources, you can significantly reduce that risk.

What’s Next?

There are additional options to better get a rangle on those non-production environments, especially with powerful options such as Azure Policy to prevent specific types of resources from being provisioned or requiring resources to be configured in specific ways to both reduce costs and risks. Using Policy and specific ARM templates can be especially useful in Lab and Workshop environments. I encourage you to check out Microsoft Learn and see how they approach creating a sandbox for their users. Do you have some interest in learning more about these approaches and how I’ve implemented similar, let me know!