AWS offers a variety of storage services, including Amazon S3, Amazon EBS, Amazon EFS, Amazon FSx, and others. Each service offers unique features and benefits that are catered to certain workloads and use cases.
Amazon S3 (Simple Storage Service)
Amazon S3 is an object storage service that offers industry-leading scalability, data availability, and security. It is designed to store and retrieve any amount of data from anywhere on the web.
- Scalability: S3 can handle massive amounts of data, virtually unlimited. It can accept data from anywhere on the web.
- Durability: It offers 99.999999999% (11 9’s) durability and 99.99% availability of objects over a given year.
- Security: S3 supports data encryption at rest and in transit, as well as various security policies and IAM roles.
- Performance: S3 Performance can be optimized by enabling features like Transfer Acceleration and multi-part uploads.
Amazon Elastic Block Store (EBS)
EBS provides block-level storage volumes for use with Amazon EC2 instances. It is suited for applications that require a file system or a database.
- Performance: EBS volumes come in several types, each optimized for different use cases. Provisioned IOPS (io1 and io2) for high-performance SSD, GP3 for general purpose SSD, ST1 for throughput-optimized HDD, and SC1 for Cold HDD.
- Durability: Amazon EBS provides a snapshot feature, which allows point-in-time copies of volumes to be replicated across multiple regions for backup.
Amazon Elastic File System (EFS)
EFS is a scalable file storage for use with Amazon EC2 instances. It allows for concurrent connections from multiple EC2 instances, making it a fitting choice for applications that need shared access to files.
- Performance: EFS offers two performance modes – General Purpose (default) and Max I/O. The former is recommended for most applications, while the latter is for scale-out applications that require a higher amount of I/O.
- Durability: All files and directories are redundantly stored across multiple Availability Zones (AZs) for resilience.
Amazon FSx
Amazon FSx provides fully managed third-party file systems. FSx for Windows File Server allows you to move your Windows-based applications that require file storage to AWS. FSx for Lustre is designed for high-performance applications such as machine learning, high-performance computing (HPC), video processing, etc.
- Performance: FSx provides SSD-based storage that can support high levels of IOPS, throughput, and low latencies.
- Durability: Amazon FSx automatically replicates data across multiple AZs in a region.
Understanding these services and their performance is fundamental for the Data engineer Associate. Therefore, it is encouraged to gain hands-on experience with configuring and managing these services.
Performance-driven Configurations
Each AWS storage service can be adjusted and configured to meet specific performance requirements. Below are some performance-driven configurations for specific AWS storage services.
Amazon S3
To achieve high performance with Amazon S3:
- Use multi-part upload for large files.
- Use CloudFront to cache frequently accessed objects close to the viewers.
- Enable S3 Transfer Acceleration to speed up the transfer of files over long distances.
Amazon EBS
For higher performance with EBS:
- Choose the right volume type based on your needs. E.g., Provisioned IOPS SSD (io1 and io2) for latency-sensitive transactional workloads.
- Increase the IOPS provisioned for a volume, or add more volumes and distribute the I/O loads.
- Enable EBS-optimized instances to ensure dedicated capacity for EBS I/O.
Amazon EFS
Optimize EFS performance by:
- Choosing the right EFS performance mode (General Purpose or Max I/O) and throughput mode (Bursting or Provisioned) based on your workload.
- Mounting the file system using Amazon EFS Mount Helper.
Amazon FSx
To increase FSx performance:
- Choose either SSD or HDD storage, based on your workload.
- Specify the throughput capacity for the file system to ensure consistent performance.
By understanding how to configure AWS storage services to meet specific performance demands, you will have a substantial advantage in preparing for the AWS Certified Data Engineer – Associate (DEA-C01) exam. Hands-on experience combined with theoretical knowledge will provide a solid foundation on which to build upon. Practice and explore each service will be vital to excel in the exam and your AWS journey.
Practice Test
True/False: Amazon S3 is a block storage service suitable for EC2 instances.
- False
Answer: False
Explanation: Amazon S3 is an object storage service. Amazon EBS is a block storage service suitable for EC2 instances
True/False: Amazon Redshift is optimized for online transaction processing (OLTP).
- False
Answer: False
Explanation: Amazon Redshift is optimized for online analytical processing (OLAP) not for online transaction processing (OLTP).
Single Select: Which of the following is a durable, block-level storage device?
- a) Amazon EC2
- b) Amazon Lambda
- c) Amazon EBS
- d) Amazon S3
Answer: c) Amazon EBS
Explanation: Amazon EBS is a high-performance block storage service designed for use with Amazon EC2 for both throughputs and transaction-intensive workloads.
Single Select: What is the main benefit of using Amazon Glacier for data storage?
- a) Real-time data access
- b) Low cost data archiving
- c) High-performance computing
- d) Object-level storage
Answer: b) Low cost data archiving
Explanation: Amazon Glacier is a secure, durable, and low-cost storage service for data archiving and long-term backup.
True/False: Throughput Optimized HDD (st1) EBS volumes are the best choice for boot volumes.
- False
Answer: False
Explanation: For boot volumes, General Purpose SSD (gp2) or Provisioned IOPS SSD (io1) are the preferred EBS volume types.
Single Select: Which Amazon service is best suited for Big Data workloads and analytics?
- a) Amazon Redshift
- b) Amazon RDS
- c) Amazon DynamoDB
- d) Amazon S3
Answer: a) Amazon Redshift
Explanation: Amazon Redshift is a fast, scalable data warehouse that makes it simple and cost-effective to analyze all your data across your data warehouse and data lake.
Multiple Select: What are the benefits of using Amazon S3 for data storage?
- a) Scalability
- b) Durability
- c) Object-level storage
- d) Real-time data access
Answer: a) Scalability, b) Durability, c) Object-level storage
Explanation: Amazon S3 provides scalable, durable and object-level storage, but it doesn’t provide real-time data access like block storage does.
True/False: SSD storage is always faster than HDD storage on AWS.
- False
Answer: False
Explanation: SSD storage does not always outperform HDD. Throughput Optimized HDD (st1) and Cold HDD (sc1) can offer higher throughputs than General Purpose SSD (gp2) and Provisioned IOPS SSD (io1).
Single Select: When would you choose Amazon EFS for your application storage?
- a) When you need a file system that can be shared across multiple EC2 instances.
- b) When you need to store relational databases.
- c) When you need to run an operating system.
- d) When you require object-level storage.
Answer: a) When you need a file system that can be shared across multiple EC2 instances.
Explanation: Amazon Elastic File System (EFS) is a simple, scalable, fully managed elastic NFS file system for use with AWS Cloud services and on-premises resources.
Multiple Select: Amazon DynamoDB is suitable for which workloads?
- a) High Scale
- b) High Velocity
- c) Low latency
- d) Strategic Analysis
Answer: a) High Scale, b) High Velocity, c) Low latency
Explanation: Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance. It is suited for high scale, velocity and low latency workloads. It is not designed for strategic analysis which involves complex queries and often requires a warehouse solution like Redshift.
Interview Questions
What is schema evolution in the context of data engineering?
Schema evolution refers to the ability to modify a database schema in a manner that does not disrupt the existing data and its associated applications.
What is the main advantage of schema evolutions?
The main advantage of schema evolutions is that they allow developers to modify databases over time in response to changing requirements, without requiring significant downtime or disrupting applications that rely on the database.
Mention one common scenario where schema evolution is required?
Schema evolution scenarios often occur when database tables need to be extended with new columns.
In AWS Glue, how is schema evolution handled?
In AWS Glue, schema evolution is handled by enabling the ‘Update table definition in the data catalog’ option. The schema changes are then automatically handled using the UPDATE and ADD column changes from the source tables.
While working with DynamoDB on AWS, how is schema evolution achieved?
With DynamoDB, schema evolution is simple because it is a schema-less NoSQL database service. You can add or remove attributes from items in a table without altering the table’s schema.
What is the primary challenge in managing schema evolution?
The primary challenge in schema evolution is maintaining the validity and integrity of existing data when altering a database schema.
What is backward compatibility in schema evolution?
Backward compatibility in schema evolution means that new versions of the schema are designed such that they can read, write, and validate instances of data produced by the previous schema versions.
What is Avro’s approach to managing schema evolution?
Avro, a popular data serialization system, deals with schema evolution by storing the schema used to write data alongside the data itself. This method allows data to be read later using a different version of that schema.
What is ‘schema on read’ and how does it aid in schema evolution?
‘Schema on read’ is a strategy that infers schema only when the data is read. This tactic allows for more flexibility, as the data can be stored in a raw form without having to define the schema upfront.
What is ‘schema on write’ and how does it differ from ‘schema on read’?
‘Schema on write’ is a strategy where the schema is enforced when the data is written into the database. While this can ensure consistent data, it offers less flexibility for schema evolution compared to ‘schema on read’.