Data retention policies and archiving strategies

These concepts aren’t just important for passing the exam; they are essential for anyone working with data on AWS or any other platform.

Table of Contents

Understanding Data Retention

Data retention policies govern how long an organization retains its data. Different types of data need to be kept for varying lengths of time depending on regulatory requirements, business needs, and various other factors.

AWS offers several services that can help implement effective data retention policies:

Amazon S3 Lifecycle Policies: You can use these policies to automate moving your S3 data to different storage classes at set times or delete it altogether. For example, you may define a lifecycle policy to move data older than 30 days to S3 Standard-IA (Infrequent Access) to save on storage costs and then delete the data altogether after 365 days.
Amazon Glue Data Catalog: The Glue Data Catalog is a centralized repository that stores metadata from different AWS data stores. With it, you can easily keep track of the lifecycle of your different datasets.
AWS Backup: AWS Backup offers a centralized service for backing up data across AWS services. You can set backup policies that define how frequently backups should occur and how long they should be retained.

Data Archiving Strategies

Data archiving consists of moving data that is no longer actively used to a system designed for long-term retention. AWS provides multiple services to help implement archiving, the most notable being Amazon S3 Glacier and Amazon S3 Glacier Deep Archive.

Amazon S3 Glacier is a storage class within Amazon S3 that provides secure and durable storage for data archiving and backup. With S3 Glacier, you can store data for as little as $0.004 per gigabyte per month.

Amazon S3 Glacier Deep Archive is another storage class within Amazon S3, designed to be the lowest cost cloud storage service. It allows you to archive data that might only need to be retrieved a few times a year, such as regulatory archives.

S3 Glacier	S3 Glacier Deep Archive
Ideal for data that is accessed infrequently, but requires rapid access when needed	Suited for data that is rarely accessed and for which retrieval times of several hours are acceptable
Retrieval times from a few minutes to several hours	Retrieval times within 12 hours
Data stored for as low as $0.004 per gigabyte per month	Data stored do as low as $0.00099 per gigabyte per month

In your archiving strategy, you should also consider using lifecycle policies to automate the transition of data to Glacier or Glacier Deep Archive.

At the end of the day, it’s important to remember that data retention and archiving policies are not one-size-fits-all. The correct approach will depend on the specifics of the data and the business needs of the organization. As a prospective AWS Certified Data Engineer, it’s critical that you understand these concepts and how they can be implemented on AWS.

Practice Test

True or False: AWS provides a service named Glacier which is used for long-term data archiving.

Answer: True

Explanation: AWS Glacier is a secure, durable, and extremely low-cost storage service for data archiving and long-term backup.

Which of the following is NOT a common reason companies set data retention policies?

a) To reduce storage costs
b) To comply with legal requirements
c) To improve system performance
d) To accumulate more data

Answer: d) To accumulate more data

Explanation: Instead of accumulating more data, data retention policies often include deleting data after a set period to reduce storage costs, comply with legal requirements, and improve system performance.

Which AWS service enables you to automate the lifecycle of your S3 objects?

a) AWS Lifecycle Manager
b) AWS S3 Lifecycle
c) AWS S3 Lifecycle Configuration
d) AWS Lifecycle Configuration

Answer: c) AWS S3 Lifecycle Configuration

Explanation: AWS S3 Lifecycle Configuration enables automation of moving objects between different storage classes or archiving them to the S3 Glacier storage class.

True or False: In AWS, it is generally more cost-effective to store frequently accessed data in S3 Glacier.

Answer: False

Explanation: S3 Glacier is designed for data that is not frequently accessed, and retrieving data can be more costly and time-consuming. S3 Standard is better for data that needs to be accessed frequently.

Which AWS storage class is designed for long-term storage, rarely accessed workloads?

a) S3 Standard
b) S3 Intelligent-Tiering
c) S3 Glacier
d) S3 One Zone-IA

Answer: c) S3 Glacier

Explanation: AWS S3 Glacier is designed for long-term storage, rarely accessed workloads, providing comprehensive security and compliance capabilities.

True or False: Data retention policies should be the same across all types of data in an organization.

Answer: False

Explanation: Data retention policies may vary depending on the type of data, its use, and regulatory requirements.

In the context of AWS S3, what does archiving principle “Cohort” mean?

a) Group of users
b) Group of objects
c) Group of services
d) Group of databases

Answer: b) Group of objects

Explanation: In AWS S3, a Cohort represents a group of objects that are managed together as a unit.

Which AWS service offers a policy-based system for the automatic migration of objects to other storage classes?

a) Amazon Athena
b) AWS Data Pipeline
c) S3 Lifecycle policies
d) AWS Elastic Beanstalk

Answer: c) S3 Lifecycle policies

Explanation: AWS S3 Lifecycle policies allow for the automatic migration of objects to other, more cost-effective storage classes based on defined rules.

True or False: The Amazon Glacier storage class provides expedited retrievals that can return data in 1-5 minutes.

Answer: True

Explanation: Although Amazon Glacier is designed for infrequently accessed data, it provides three retrieval options including expedited, which retrieves data in 1-5 minutes.

What is the primary purpose of a data archiving strategy in AWS?

a) To delete data after a certain duration.
b) To backup data for immediate retrieval.
c) To move data that is not needed on an immediate basis to low-cost storage layers.
d) To replicate data across multiple regions.

Answer: c) To move data that is not needed on an immediate basis to low-cost storage layers.

Explanation: The main aim of data archiving is to move infrequently accessed data to a low-cost storage class like AWS S3 Glacier to reduce storage costs, yet also preserving the data for regulatory, compliance or other business needs.

Interview Questions

Question 1: What is AWS’s data retention policy?

Answer 1: AWS’s data retention policy allows customers to define how long their data is retained in AWS services. The specifics vary according to each AWS service. For instance, Amazon S3 lifecycle policies can be set to automatically move data to cheaper storage classes or delete data at the end of its life cycle.

Question 2: What is the definition of data archiving in AWS context?

Answer 2: Data archiving in AWS involves moving data, that is infrequently accessed and unlikely to be modified, to low-cost storage classes like Amazon Glacier or S3 Glacier Deep Archive for long-term storage.

Question 3: What are the benefits of using Amazon S3 Glacier for data archiving?

Answer 3: Amazon S3 Glacier provides a secure, durable, and extremely low-cost storage service for data archiving and long-term backup. It is designed to deliver 99.999999999% durability and is a cost-effective solution for storing data for months, years, or even decades.

Question 4: How does AWS help comply with data retention regulations?

Answer 4: AWS provides necessary tools to manage data retention effectively in accordance with regulations. AWS provides features like lifecycle policies, versioning, and encryption to ensure data integrity, confidentially, and availability.

Question 5: What is an Amazon S3 Lifecycle policy?

Answer 5: Amazon S3 lifecycle policies enable organizations to manage objects so that they are automatically transitioned to different storage classes or expired(deleted) at end-of-life.

Question 6: Can data be restored from AWS Glacier?

Answer 6: Yes, data archived to AWS Glacier can be restored. However, the speed of restoration varies depending on the pricing tier – expedited, standard, or bulk.

Question 7: What is data immutability in Amazon S3?

Answer 7: Data immutability in Amazon S3 ensures that once the data is stored it can’t be changed or deleted during a defined retention period, which plays a key role in ensuring the integrity and reliability of the data.

Question 8: How can data encryption be managed in AWS services?

Answer 8: AWS offers several encryption features and services to provide data security. For instance, data at rest in S3 buckets can be automatically encrypted using S3 bucket policies, using either S3 managed keys (SSE-S3), AWS key management service keys, or a custom key management service.

Question 9: What is AWS Backup?

Answer 9: AWS Backup is a fully managed backup service that makes it easy to centralize and automate backing up of data across AWS services according to policies defined by the organization.

Question 10: What role does AWS Glacier play in AWS data lifecycle management?

Answer 10: AWS Glacier plays a pivotal role in the data lifecycle management as it offers a very low-cost, highly durable, and secure archival solution for long-term data retention, thus helping in efficient storage management and cost reduction.

Question 11: How can you ensure data recovery in case of accidental deletion in AWS?

Answer 11: AWS offers versioning in S3 buckets, which allows to store and retrieve all versions of an object including all writes and deletes. Additionally, using MFA (Multi-factor Authentication) Delete capability can add an extra layer of security.

Question 12: Can you modify an AWS Glacier Vault Lock policy after it is locked?

Answer 12: No, you can’t modify an AWS Glacier Vault Lock policy after it’s locked. Vault Lock policy ensures WORM (write once, read many) model, providing a strong enforcement for regulatory compliance.

Question 13: When should you use AWS Storage Gateway service?

Answer 13: AWS Storage Gateway is used when there is a need to connect on-premises environments with AWS storage, such as when you want to shift tape backup infrastructures to the cloud, move volumes to the cloud or access storage for computations in cloud.

Question 14: What is a snapshot in AWS?

Answer 14: A snapshot is a point-in-time copy of your data. AWS supports snapshots for EBS volumes, RDS DB instances, and Redshift clusters.

Question 15: What is cross-region replication in the context of AWS S3?

Answer 15: Cross-region replication (CRR) is a feature in Amazon S3 that automatically replicates data across different AWS geographical regions. This helps customers meet compliance requirements and minimize latency by keeping copies close to the source of request.