Ever pushed out a change that unexpectedly broke things? Don't panic! With a solid change restoration process, you can quickly roll things back to a known good state. It's like an "undo" button for your cloud environment.
Where did this come from?
This control comes from the CSA Cloud Controls Matrix v4.0.10 published on 2023-09-26. You can download the full matrix here. The matrix provides a handy reference of key security controls to consider when using cloud services.
Who should care?
This one is super relevant for:
- DevOps Engineers responsible for deploying changes to production
- IT Managers who need to ensure service stability
- Developers pushing out new code releases
- SREs who care about system reliability and recovery
What is the risk?
Without a way to quickly revert bad changes, you could be stuck with:
- Broken functionality impacting customers
- Security vulnerabilities introduced by faulty code
- Noncompliance with regulatory requirements
- Extended downtime and lost productivity
Rolling back is often the fastest path to recovery. The alternative is trying to debug and fix forward which can take much longer.
What's the care factor?
For orgs with complex systems and rapid change velocity, this control is critical. An automated rollback process can make the difference between a brief glitch and extended outage. Solid change restoration is table stakes for high-performing IT.
Smaller, more static environments may be able to get by with manual rollbacks. But implementing automated rollback is never a bad idea. It's like an insurance policy against bad deploys.
When is it relevant?
Change restoration should be on your radar anytime you're:
- Updating existing systems and applications
- Deploying new services and components
- Patching servers and updating packages
- Modifying network configurations
- Applying security policies and rulesets
It's less applicable for:
- Stateless, immutable infrastructure
- Serverless functions
- Purely additive changes that don't modify existing systems
What are the tradeoffs?
Implementing automated change restoration does require some upfront effort:
- Defining rollback procedures
- Ensuring all changes are reversible
- Maintaining pre-change state/config backups
- Testing and validating rollback processes
This can slow down velocity a bit as changes need to be more carefully planned and tested. But it pays off in faster recovery times.
There's also some risk that a rollback doesn't fully cleanup or restore state. So monitoring post-rollback is important too.
How to make it happen?
Some key practices:
- Use Infrastructure-as-Code tools like Terraform, CloudFormation, ARM templates to model intended state
- Always keep previous deployed version to roll back to
- Backup key configuration before changes (AMIs, database snapshots, etc)
- Automate deployment/rollback with CI/CD pipelines (Jenkins, GitHub Actions, etc.)
- Define rollback steps in pipeline, triggered if post-deploy tests fail
- After rollback, run tests to validate all restored as intended
- Log and monitor rollback events for audits and learning
What are some gotchas?
Watch out for:
- Database schema changes - rolling back code without reversing DB updates
- Stateful systems - restoring a previous state may lose interim data
- Cascading changes - a rollback in one component may break dependencies
- Outdated golden images - ensure VM/container images are kept up-to-date
Be sure your deployment tooling has rights to perform rollbacks. Key permissions:
What are the alternatives?
Some other common approaches:
- Blue-green deployments - cutover between separate new/old prod environments instead of in-place update
- Canary releases - slowly expose change to segment of traffic for early validation before full cutover
- Feature flags - toggle new code in/out for more granular rollout control
Explore further
- Check out the CIS Benchmarks for AWS, Azure, and GCP for more cloud security goodness
- Good change mgmt practices in ITIL 4: Change Control