These file formats store and present data in ways that support various uses. For anyone preparing for the DP-900 Microsoft Azure Data Fundamentals exam, it is crucial to understand the fundamental features and use-cases for these common formats.
1. CSV (Comma-Separated Values)
CSV stands for Comma-Separated Values. It is a simple file format used to save tabular data, such as a spreadsheet or database. These files are often used for data imports or exports in various applications, including Microsoft Azure.
In a CSV file, each line is a data record that consists of one or more fields, separated by commas. Each line is a separate row, and each comma-separated value represents a column.
FirstName, LastName, Title
John, Doe, Manager
Jane, Smith, Engineer
2. JSON (JavaScript Object Notation)
JSON is a lightweight data-interchange format that is easy to read and write. JSON’s structure is a key-value pair. It is widely used in web applications to transfer data. In Azure, JSON is often used when leveraging REST APIs or working with NoSQL databases like CosmosDB.
A simple JSON file could look like this:
{
“FirstName”: “John”,
“LastName”: “Doe”,
“Title”: “Manager”
}
3. XML (eXtensible Markup Language)
XML is designed to store and transport data. It is both human-readable and machine-readable. Unlike CSV and JSON, XML data structures are fully customizable through the use of tags (<>). This key characteristic of XML provides a versatile, self-descriptive way of representing complex hierarchical and nested data structures.
A simple XML example could be:
4. Parquet
Parquet is a column-oriented binary file format optimized for the Apache Hadoop ecosystem. Unlike CSV, JSON, or XML, which are row-based formats, Parquet is specifically designed for efficient and performant querying of large datasets, particularly those used in Big Data analytics with tools like Apache Spark and Apache Hive.
5. Avro
Avro is a row-based binary data format. Similar to Parquet, It’s mainly used in the Hadoop ecosystem for big data processing. Avro particularly shines when it comes to its robust schema evolution capabilities, enabling additions, deletions, and modifications of fields over time without requiring a full data rewrite.
Both Parquet and Avro offer forms of data compression and schema evolution, but their row-based versus columnar-based approaches make them better suited to different types of data access needs.
Depending on the use-case, different data formats have their strengths and weaknesses. CSV and JSON are excellent for lightweight data interchange and human readability, XML provides great hierarchical structure and self-description, Avro is ideal for schema evolution, while Parquet is commonplace for highly performant big data operations. Understanding these characteristics is vital for choosing the right format for the right job in Microsoft Azure.
Practice Test
True or False: CSV is a file format that works by separating values with a delimiter, usually a comma.
- True
- False
Answer: True
Explanation: CSV stands for comma-separated values. This file format stores tabular data (numbers and text) in plain-text form.
Which of the following are common formats for storing data? (Select all that apply)
- a) CSV
- b) JSON
- c) JPG
- d) XML
Answer: a) CSV, b) JSON and d) XML
Explanation: CSV, JSON, and XML are all common formats for data files. JPG is an image file format.
Which format is defined as a self-describing, hierarchical data format?
- a) JSON
- b) CSV
- c) XML
- d) TXT
Answer: a) JSON
Explanation: JSON stands for JavaScript Object Notation. It is a self-describing and hierarchical data format.
True or False: JSON files typically end with the extension .jpg.
- True
- False
Answer: False
Explanation: JSON files typically end with the extension .json, not .jpg.
Which data format is also known as a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable?
- a) Parquet
- b) XML
- c) CSV
- d) JSON
Answer: b) XML
Explanation: XML stands for eXtensible Markup Language, a markup language that encodes documents in a format that is both human-readable and machine-readable.
Which format is a serialized programming language data structure?
- a) JSON
- b) XML
- c) CSV
- d) None of the above
Answer: a) JSON
Explanation: JSON, or JavaScript Object Notation, is a format for serializing data structures from programming languages.
Which format is considered a simple file format that stores multiple records in a plaintext file with each field separated by a delimiter?
- a) Parquet
- b) XML
- c) CSV
- d) Avro
Answer: c) CSV
Explanation: CSV, or Comma Separated Values, is a simple file format that uses a comma or other delimiter to separate different values.
True or False: XML files typically end with the .xml extension.
- True
- False
Answer: True
Explanation: XML files do indeed typically end with the .xml extension.
Which of the following data formats are binary? (Select all that apply)
- a) CSV
- b) XML
- c) Parquet
- d) Avro
Answer: c) Parquet, d) Avro
Explanation: Parquet and Avro are binary formats that allow for efficient and fast serialization and deserialization.
Which data format allows for nesting data structures in fields?
- a) JSON
- b) XML
- c) CSV
- d) BSON
Answer: a) JSON
Explanation: JSON, or JavaScript Object Notation, is capable of nesting data structures within its fields.
True or False: All data formats are compatible with all databases.
- True
- False
Answer: False
Explanation: Not all data formats are compatible with all databases. The format must match the database’s capabilities.
Which of the following formats can be queried using SQL?
- a) XML
- b) CSV
- c) Parquet
- d) None of the above
Answer: c) Parquet
Explanation: Parquet is a columnar storage file format that can be used in the Apache Hadoop ecosystem and it is designed to be used with SQL-like query languages.
The Avro format is related to which of the following?
- a) Microsoft Azure
- b) Apache Hadoop
- c) Google Cloud Platform
- d) Amazon AWS
Answer: b) Apache Hadoop
Explanation: Avro is a row-based storage format for Hadoop that optimizes for complex, nested data structures.
What is a key advantage of using XML format?
- a) It is human-readable.
- b) It does not require a schema.
- c) It supports binary data.
- d) It is the most modern format.
Answer: a) It is human-readable.
Explanation: XML format is both human-readable and machine-readable, making it easier to interpret than many binary data formats.
True or False: Parquet and Avro formats are typically used in big data processing.
- True
- False
Answer: True
Explanation: Parquet and Avro are both popular choices for big data processing due to their efficiency in serialization and deserialization.
Interview Questions
What is a CSV file in terms of data files?
CSV, which stands for Comma-Separated Values, is a simple file format used to store tabular data, such as a spreadsheet or database. Each line of the file is a data record and each record consists of one or more fields, separated by commas.
How is JSON used in data files?
JSON or JavaScript Object Notation is used to store and exchange data. It organizes data into key-value pairs and array data types. It’s often used when data is sent from a server to a web page. It is easy for humans to read and write and easy for machines to parse and generate.
What does XML stand for and how is it used in data files?
XML, or Extensible Markup Language, is used to define documents with a standard format that can be read by any XML tool. It is both human-readable and machine-readable. It’s often used for storage and transportation of data.
What is a Parquet file in the realm of data formats?
Apache Parquet is a columnar storage file format available to any project in the Hadoop ecosystem. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
What’s the importance of the Avro data file format in Azure?
Avro is a row-based format that is highly useful for Big Data applications. It supports schema evolution – you can change the schema after writing data. Plus, it provides rich data structures, a compact format, fast data processing, and connection with various languages, making it a suitable choice for data serialization in Azure and similar platforms.
Can you explain the use of ORC files in data management?
ORC or Optimized Row Columnar is a columnar storage file format highly valuable for performing high-speed queries. It allows efficient storage and optimal read performance by allowing operations such as projection and predicate pushdown.
How is the PDF format used in data management?
PDF, or Portable Document Format, is used to display documents in an electronic way that is independent of the software, hardware or operating system they are viewed on. However, it’s not a data storage or transfer format like CSV, JSON, or XML.
What is a flat file?
A flat file is a file with no structured relationships. It contains records that have no structured interrelationship. Each line in a flat file database is typically a record.
How is ASCII used in data formats?
ASCII (American Standard Code for Information Interchange) is a method of encoding characters based on the English alphabet. In data files, ASCII text can be used to ensure that the data can be easily shared between different software and systems.
What is the function of a .tar file format in Azure?
In Azure, a .tar file is a Unix archive format file that combines multiple files into one file. The term “tar” stands for tape archive. This file format is generally used in Linux and Unix systems. In Azure, tar files can be used for the storage and distribution of multiple files as a single archive.
How does Azure handle Excel proprietary formats for data files?
Azure allows you to import Excel files (.xlsx) to use in Power BI, Azure Machine Learning and other services. Proprietary Excel features like formulas or macros may not be fully supported.
When it comes to data files, what does TSV stand for and how is it used?
TSV stands for Tab-Separated Values. It’s a file format for storing tabular data that uses tabs to separate values. It’s commonly used for exporting and importing data in a spreadsheet format.
What is the role of the YAML file format in data management?
YAML, which stands for “YAML Ain’t Markup Language”, is a human-readable data serialization standard. In Azure, YAML files are commonly used for creating and configuring Kubernetes resources.
How does the .gz file format function in data file management in Azure?
In Azure, .gz files are compressed files using the gzip (GNU zip) algorithm. This format retains the original file name and timestamp in the compressed file and allows for efficient storage and transfer of data.
How is BSON used in Azure?
BSON, or Binary JSON, is a binary representation of JSON-like documents. It’s used in Azure Cosmos DB, a globally distributed, multi-model database service, when using the MongoDB API. Azure Cosmos DB represents internally stored documents in BSON.