When dealing with data, understanding its structure is critical, particularly when working towards becoming an AWS Certified Data Engineer- Associate (DEA-C01). One of the vital skills is the ability to model structured, semi-structured, and unstructured data. Therefore, this post will dive into the methods and best practices for modeling these three types of data.
Structured Data
Structured data refers to information with a high degree of organization, readily searchable by simple, straightforward search engine algorithms or other search operations. It conforms to a certain predefined model or is organized in a defined manner. This format lends itself to being very easily entered, stored, queried, and analyzed.
Mostly the structured data comes from relational databases (RDBMS) and spreadsheets in a tabular form with rows and columns. Examples include employee information, customer details, or transaction data.
Working with structured data in AWS involves using services such as Amazon RDS, Redshift for storage, and querying the data using SQL.
Semi-structured Data
Semi-structured data is a form of structured data that does not adhere to the formal structure of data models associated with relational databases or other forms of data tables but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data.
Semi-structured data can come in the formats like JSON (JavaScript Object Notation), XML (eXtensible Markup Language), and HTML (HyperText Markup Language).
Working with this data in AWS involves using services like DynamoDB for NoSQL data, and Amazon Athena for querying.
Unstructured Data
Unstructured data includes any data that doesn’t fit in a structured, tabular format, such as images, text, social media posts, and so on. It’s estimated that 80% of the world’s data is unstructured, it’s also the data that holds a lot of business value but is harder to analyze due to its lack of structure.
In AWS, one key service instrumental in dealing with this form of data is Amazon S3 (Simple storage services), an object storage service that offers industry-leading scalability, data availability, security, and performance.
Now that we understand what structured, semi-structured, and unstructured data are let’s delve into how you can model them.
Data Modeling for Structured, Semi-structured, and Unstructured Data
-
Structured Data Modeling: To model structured data, the tool commonly used is an ERD (Entity Relationship Diagram). Often, normalization techniques are used to eliminate redundant data. In AWS, SQL queries and AWS Glue can be used to manipulate and transform structured data.
For instance, if you have an employee table stored in Amazon RDS, you might create an ERD model like this:
EMPLOYEE —-> EMPLOYEE_ID , FIRST_NAME, LAST_NAME, DEPARTMENT_ID
-
Semi-Structured data modeling: Semi-Structured data implements tags or other types of markers to organize what would otherwise be unstructured data. This type of modeling doesn’t require a rigorous model like structured data; it uses trees, graphs, and helps data analysts to understand patterns and insights from a heterogeneous kind of data. JSON and XML files are good examples.
In AWS, tools like the DynamoDB DocumentClient offered by the AWS SDK allow you to model semi-structured data.
For example, a JSON document with customer reviews might look like this:
{
"customerReviews": {
"reviewId": "1",
"productId": "123",
"reviewBody": "Great product.",
"reviewDate": "2021-09-01"
}
} -
Unstructured data modeling: Modeling unstructured data is more focused on how the data is stored and retrieved rather than the relationships between the data. Common techniques used for modeling unstructured data include text analytics, NLP(Natural Language Processing), and machine learning algorithms.
In AWS, you might store unstructured data in an S3 bucket and use services like Amazon Athena to query the data, or Amazon Comprehend for text analytics.
Conclusion
Understanding how to model data is an important skill for any data engineer, and especially those preparing for the AWS Certified Data Engineer – Associate (DEA-C01) exam. Different types of data require different strategies and tools, and AWS provides a suite of services that can handle structured, semi-structured, and unstructured data. By understanding the differences between these types of data and how to approach them, you can take great strides in your path towards becoming a top-tier data engineer.
Practice Test
True/False: All types of data, whether structured, semi-structured or unstructured, can be stored in the same way in AWS.
- True
- False
Answer: False
Explanation: Each type of data requires a different approach for storage. Structured data is stored in RDBMS or Redshift, semi-structured in NoSQL databases like DynamoDB, and unstructured data in S3 buckets.
In AWS, how would you primarily store unstructured data?
- a. Amazon RDS
- b. Amazon Redshift
- c. Amazon S3
- d. Amazon DynamoDB
Answer: c. Amazon S3
Explanation: Amazon S3 (Simple Storage Service) is an object storage service that is primarily used to store unstructured data in the cloud.
What type of data schema is used in a semi-structured data model?
- a. Fixed schema
- b. Relational schema
- c. Dynamic schema
- d. No schema is needed
Answer: c. Dynamic schema
Explanation: Semi-structured data allows the possibility of realizing a structure within a certain level, enabling dynamic schema.
True/False: Amazon Redshift is best suited for handling unstructured data.
- True
- False
Answer: False
Explanation: Amazon Redshift is a relational database service which is best suited for structured data, not unstructured data.
Which AWS service is ideal for modeling semi-structured data?
- a. Amazon RDS
- b. AWS Glue
- c. Amazon DynamoDB
- d. Amazon S3
Answer: c. Amazon DynamoDB
Explanation: DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale, ideal for handling semi-structured data.
True/False: Structured data requires a significant amount of preprocessing before it can be stored in AWS.
- True
- False
Answer: False
Explanation: Structured data follows a strict schema, so it often requires less preprocessing than semi-structured or unstructured data.
All of the following are examples of unstructured data, EXCEPT:
- a. Social media posts
- b. Text files
- c. Video files
- d. Database table
Answer: d. Database table
Explanation: A database table is an example of structured data as it has a defined schema.
In the context of AWS, what is the best match for structured data?
- a. Amazon Aurora
- b. AWS Lambda
- c. Amazon EMR
- d. AWS Elastic Beanstalk
Answer: a. Amazon Aurora
Explanation: Amazon Aurora is a relational database service that is perfect for structured data.
True/False: Semi-structured data combines aspects of both structured and unstructured data.
- True
- False
Answer: True
Explanation: Semi-structured data, such as XML or JSON, includes both raw data (like unstructured data) and a certain level of organizational formatting (like structured data).
Semi-structured data:
- a. Does not have a pre-defined schema
- b. Has a very rigid schema
- c. Can only be stored in a relational database
- d. Cannot be stored in AWS
Answer: a. Does not have a pre-defined schema
Explanation: Semi-structured data is flexible and doesn’t require a pre-defined schema to be saved or manipulated.
True/False: Unstructured data cannot be queried effectively.
- True
- False
Answer: False
Explanation: Although more challenging, unstructured data can be queried effectively using tools and services designed to handle such data, such as AWS Athena for S3 files.
AWS Glue is used primarily for:
- a. Storing structured data
- b. ETL operations on semi-structured and unstructured data
- c. Querying unstructured data
- d. Real-time data streaming
Answer: b. ETL operations on semi-structured and unstructured data
Explanation: AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analysis.
Can Amazon DynamoDB handle unstructured data?
- a. Yes
- b. No
Answer: a. Yes
Explanation: Although DynamoDB is primarily used for semi-structured data, it can also handle certain types of unstructured data.
True/False: A Data Engineer can use AWS Quicksight to visualize unstructured data.
- True
- False
Answer: True
Explanation: AWS Quicksight can create visualizations from various data sources, including unstructured data, after hydrating it and transforming it into a structured format.
The process of ‘data modeling’ refers to:
- a. Only how to set up AWS services
- b. The act of organizing data into a database
- c. The process of turning data into information
- d. Configuring network and security settings
Answer: c. The process of turning data into information
Explanation: Data modeling is the process of creating a data model for the data to be stored in a database. This data model is a conceptual representation of Data objects, the associations between different data objects, and the rules.
Interview Questions
What is structured data?
Structured data is highly organized and formatted in a way so it’s easily searchable in relational databases. Examples include data in relational databases and spreadsheets.
How is semi-structured data different from structured data?
Semi-structured data is a type of structured data, but it does not conform with the formal structure of data models associated with relational databases or other forms of data tables. It contains tags or other markers to separate data elements and enforce hierarchies of records and fields within the data.
What is unstructured data?
Unstructured data is information that isn’t arranged according to a predefined model or schema, and therefore can’t be stored in a traditional relational database directly. Examples include text files, images, videos, etc.
How does AWS handle structured data?
AWS provides various services for handling structured data, including Amazon RDS (Relational Database Service) for MySQL, PostgreSQL, Oracle, and other relational databases, Amazon Redshift for data warehousing, and Amazon DynamoDB for NoSQL databases.
How can Amazon S3 be used to store semi-structured and unstructured data?
Amazon S3 is an object storage service that is ideal for storing semi-structured and unstructured data. You can use it to store large amounts of data in its native format without a defined schema.
Which AWS service can be used to analyze unstructured data?
AWS Glue can be used to analyze unstructured data. It is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics.
What is the importance of data modeling in AWS?
Data modeling is crucial as it helps to highlight the necessary information from the data and also ensures the data is accurate, consistent, and reliable. This aids in designing and managing applications effectively in AWS.
How is data modeling for structured data different from that for unstructured data?
Structured data modeling involves defining a schema for the data before storing it, while unstructured data does not need a predefined schema. Instead, the structure in unstructured data is often discovered at processing time.
What is AWS Lake Formation?
AWS Lake Formation is a service that makes it easy to set up, secure, and manage your data lake. It can catalog your data, clean it, enforce security policies, and transform your data into a format ready for analysis.
Can AWS Athena query unstructured data stored in S3?
No, AWS Athena is designed to query structured or semi-structured data in Amazon S3 using standard SQL. Unstructured data would first need to be processed into a structured or semi-structured format.
How does AWS handle real-time streaming data?
AWS provides the Kinesis suite for handling real-time streaming data. Kinesis Streams can capture, store, and process streaming data, whilst Kinesis Firehose can prepare and load the data to AWS data stores for analysis.
Which AWS service is best suited for NoSQL databases and why?
Amazon DynamoDB is best suited for NoSQL databases as it provides fast and predictable performance with seamless scalability. It’s designed for applications that need consistent, single-digit millisecond latency at any scale.
What is Amazon Redshift, and how does it handle structured data?
Amazon Redshift is a fully managed, petabyte-scale data warehousing service. It uses columnar storage, data compression, and zone maps to reduce the amount of IO needed to perform queries, making it an efficient solution for analyzing structured data.
How does AWS Glue handle schema discovery?
AWS Glue can automatically generate the schema of your data as it crawls your data source. It then stores these metadata in the AWS Glue Data Catalog, making it available for ETL jobs and queries.
How can you process and model unstructured data using AWS?
For unstructured data, you can use Amazon Comprehend to uncover insights from text, or Amazon Rekognition for image and video analysis. Once the data is processed, it can be stored in Amazon S3 for further analysis.