Home Insights Reduce Cloud Costs by Automating EC2 Images/AMIs Cleanup with AWS Lambda

Reduce Cloud Costs by Automating EC2 Images/AMIs Cleanup with AWS Lambda

Insights

03 Dec 2025

Aleks S.
Consultant, Cloud Services

Amazon Machine Images (AMIs) are one of the cloud resources most likely to accumulate unnoticed. Left unmanaged, they drive up storage costs and create operational clutter across accounts and regions.

“With over 10 years of IT professional experience, half of those engaged in Public Cloud projects, I’ve been involved in designing, building, supporting and improving hybrid and multi-cloud environments across both AWS and Azure. I’ve seen repeatedly how, as teams iterate on infrastructure and deployments, hundreds of outdated AMIs and related Amazon Elastic Block Stores (EBS) snapshots can accumulate and quietly inflate storage costs for cloud customers. And I’ve seen how using Amazon Data Lifecycle Manager (ADLM) is never the complete solution.“

The Challenge

Unmanaged AMIs accumulate fast – and without automation, they become an invisible and expensive problem across AWS accounts.

We needed a safe, automated way to identify, report, and remove old EC2 images – regardless of origin – without risking active resources.

While ADLM is valuable, it only manages snapshots and AMIs created through its own policies. It does not touch images that were created manually, via CI/CD pipelines, with custom automation, or those that are shared or replicated in cross-account backup workflows. In every one of the customer environments for which Phi provides support, the majority of EC2 images and snapshots fall into these categories – making it impossible for us to use ADLM to enforce lifecycle policies consistently.

We needed a custom solution that could discover all AMIs and related snapshots regardless of origin, evaluate their creation dates and associations and remove only those that were safe to delete. In this blog we will share how Phi addressed this challenge with a modular, Lambda-based solution that:

Runs on a scheduled basis using Amazon EventBridge
Indexes every AMIs and snapshot across regions
Applies safe and configurable retention logic
Removes unused resources automatically
Generates reports for transparency and audit
Deploys consistently across accounts using Terraform
By breaking the workflow into discrete steps, we monitored, evaluated, and improved each component independently – while maintaining transparency for our customers.

The following End-to-End Architecture Diagram demonstrates how all the components worked together:

Implementation

Our solution consisted of three AWS Lambda functions, each using a dedicated Python script to perform a distinct role in the lifecycle management process.

Inventory and Indexing

The Index Lambda creates an inventory (index) of all EC2 AMIs in each AWS region. It captures details including:

Name
AMI ID
Creation date
Associated snapshot IDs
Owning account and region

Once all of the data is gathered, it is uploaded as a structured JSON report to S3.

Remediation

The Remediation Lambda uses the JSON report to identify stale, unused AMIs and deregister them automatically (or simulate this if dry-run was enabled). It filtered AMIs that:

are owned by this account (not shared)
match a configured name pattern.
are older than the allowed age.
are not in use.

After cleaning is complete, it produces a remediation log CSV for reporting to an S3 bucket.

Reporting

Once the Index and Remediation Lambdas have finished, the Report Lambda merges the AMI index with remediation logs into a daily CSV report, as follows:

Read the log file (produced by Remediation script) for that region and date.
Parse it into a dictionary of remediation summaries
Read the AMI index JSON created by Index script.
Merge each AMI with remediation info
Produce a daily CSV report

The reports are uploaded to S3 and/or emailed to administrators for further analysis.

Setup & Deployment

Building a solution was only the first part of the project. An ability to deploy it consistently and securely across multiple customer environments brought the real impact. To automate our deployment, we used Terraform to provision and configure every component: the Lambda functions, triggers, IAM roles, networking, and alerting.

Modular and Parameterized Design

At the top level, we defined an input object that captures all environment-specific details including:

Account alias and tags
VPC subnets and security groups
Notification settings (SMTP, alerts)
Reporting bucket
Supported regions

This let us deploy the same Lambda-based cleanup solution into multiple AWS accounts with minimal manual changes, making the deployment fully reusable – ideal for multi-account or managed-service environments.

Function Packaging and Deployment

Each Lambda (indexing, remediation, and reporting) was packaged using the Terraform archive_file data source. Each ZIP package was then uploaded and deployed as a Lambda function with common defaults :

Runtime: Python 3.13
Timeout: 900 seconds
Memory: 1 GB
VPC configuration: attached to customer-specific subnets and security groups

All Lambdas share a consistent set of environment variables (defined in local.defaults.env_vars) to make the code portable and configuration-driven.
These include SMTP settings (for push notifications of affected users), AWS region details, reporting bucket, CMDB tags, and proxy configuration.

Scheduled Triggers with EventBridge

Each function is triggered automatically through Amazon EventBridge (CloudWatch Events), with an EventBridge rule linked to the Lambda function via a target and permission block.

To better use our Compute Savings Plan for cycling workloads, we run non-time sensitive compute operations outside business hours. Indexing begins at 22:00 UTC, Remediation at 22:15 and Reports at 22:30.

The result is a regular, predictable cleanup cycle – with no manual intervention.

Monitoring and Alerts

For operational visibility, each Lambda was given its own CloudWatch Log Group and an error alarm integrated with Amazon SNS.

If any function fails or raises errors, an SNS alert is immediately sent to the configured endpoint (email, webhook, or monitoring system).

Such a setup ensures no silent failures – every unexpected issue triggers an alert and can be investigated promptly.

Deployment Workflow

Here’s how a typical deployment runs from start to finish:

Code preparation – Python scripts are stored in each path, for example: src/functions/ and packaged automatically.

Terraform apply – Deploy all the infrastructure including Lambdas, EventBridge schedules, and monitoring resources.

Lambda execution:

Index function inventories AMIs/snapshots
Remediation function deletes unused resources
Report function publishes a summary to S3

Alerting and visibility – Reported land in S3, errors go to SNS, and logs are stored in CloudWatch.

This infrastructure-as-code approach gave us the reliability of AWS-native automation combined with the flexibility of custom logic.

Results

Only a couple of months after we deployed our solution in different customer environments, the cost and usage reports show great achievements in the reports:

AMI remediation report:

Snapshot remediation report:

The cost reduction was impressive. The solution was rolled out from July 2025. Across the relevant environments we reduced AMI- and snapshot-related storage costs by up to 97%, eliminated hundreds of obsolete images, and improved operational hygiene across multiple regions:

Conclusion

By building our own Lambda-based cleanup automation, we achieved three clear outcomes:

Cost Control by automatically pruning unused AMIs
Scalability across accounts and regions through parameterized infrastructure
Operational Visibility with automated reporting and alerting

AWS Data Lifecycle Manager remains valuable, but it doesn’t cover the full lifecycle in real-world, multi-account environments. Our approach shows how combining native AWS services with lightweight Python automation can achieve precise, reliable lifecycle management – without the limitations of pre-built tools.

In cloud operations, efficiency comes from taking ownership of your own automation. That is exactly what my team helps our clients achieve.

By pairing native AWS services with lightweight, reusable automation, Phi helps organisations save money, gain visibility, and enforce operational discipline – keeping cloud estates lean. If you’re looking to strengthen your FinOps practices or automate cloud cost optimisation, we would be happy to help.