Data retention and deletion is a critical aspect of data security and privacy that organizations need to manage carefully. This involves establishing clear policies and procedures for how long data should be kept, when it needs to be archived, and when and how it should be permanently deleted. Getting data retention and deletion right is key to complying with various laws and regulations while meeting business needs.
Where did this come from?
This article was inspired by Control DSP-16 from the CSA Cloud Controls Matrix v4.0.10 published on 2023-09-26. You can download the full Cloud Controls Matrix here. For more information on data retention and deletion in AWS specifically, check out the AWS Documentation on Data Retention.
Who should care?
- Compliance officers with responsibility for ensuring the organization meets its regulatory obligations
- Privacy managers with a need to protect customer data and respect data subject rights
- Data owners with accountability for governing access to the data they are responsible for
- IT managers with the job of implementing retention and deletion processes
- Legal counsel with a duty to advise the company on relevant laws and risks
What is the risk?
Failing to properly manage data retention and deletion can lead to:
- Violating record keeping regulations by deleting data too soon
- Violating privacy regulations by keeping personal data longer than needed
- Wasting money on storage costs for data no longer required
- Increasing exposure in the event of a data breach
- Being unable to respond to litigation/audit requests for historical data
The DSP-16 control helps mitigate these risks by ensuring retention policies are defined, documented, and consistently followed. However, it relies on accurately mapping all relevant retention requirements to data.
What's the care factor?
For most organizations, data retention and deletion should be a high priority because:
- Regulatory fines for non-compliance can be severe
- Reputational damage from mishandling customer data can be hard to recover from
- Keeping data longer than needed increases the "blast radius" of a breach
- Storage costs for stale data can really add up over time
That said, organizations with very limited personal data and subject to few regulations may be able to treat this as a lower priority. It's a sliding scale based on risk exposure.
When is it relevant?
Data retention and deletion is relevant for:
- Structured data stored in databases
- Unstructured data like documents, images and videos
- Backup copies and archived data
- Log files and audit records
- Customer data, employee data, and corporate data
It's less relevant for:
- Transient/temporary data like web session data
- Derived data that can be regenerated from other data
- Public data not subject to record keeping laws
What are the trade offs?
Implementing rigorous retention and deletion has costs:
- Increased complexity of tracking retention periods
- Potential for accidental deletion or premature deletion
- Reduced ability to leverage old data for analytics/ML
- Slower data retrieval if old data is archived to cheap storage
So the retention policies need to be tuned to balance legal requirements, business value and cost of retention. Don't keep data longer than needed but also don't delete prematurely if it's required for law, the business, or to support customer rights.
How to make it happen?
- Catalog all the types of data held by the organization
- Map each data type to the relevant retention requirements (laws, regulations, contracts, policies)
- Determine the required retention period for each data type based on those requirements
- Tag each dataset with its retention period and data owner
- Put datasets on the appropriate storage tier based on retention period (e.g. long-term data on cheap object storage)
- Implement automated rules to archive data to cheaper storage when the retention period is nearly expired
- Implement automated deletion rules to permanently delete data once the retention period fully expires
- For high-risk data, require approval workflow before automated deletion
- Maintain an auditable record of when datasets were archived and deleted
- Periodically audit a sample of datasets to ensure retention periods are accurate and deletions are happening on schedule
What are some gotchas?
- Applying retention periods requires the ability to tag data which may not exist for all datasets
- Automated deletion rules require permissions to modify/delete data in each relevant repository. For S3 this means allowing actions like
s3:DeleteObject
(docs) - Data in backups also needs to have retention periods applied which can be tricky
- Regulations like GDPR give data subjects the right to request early deletion which needs to be accommodated
- Legal holds may require retention beyond the normal period for specific datasets involved in litigation
What are the alternatives?
Rather than building a custom retention and deletion system, consider:
- Using AWS Backup's built-in retention period features to automate retention for supported services
- Deploying a COTS data governance tool that provides policy-based retention and deletion workflows across multiple repositories
Explore further
- ISO/IEC 27701 has relevant guidance on data minimization and retention (link)
- NIST SP 800-88 goes deep on data sanitization techniques (link)
- AWS Whitepaper: Data Lifecycle Management (link)
- CIS Control 13.3 covers automated data purging (link)