As part of preparing for the DP-100 Designing and Implementing a Data Science Solution on Azure examination, one key facet you need to have an understanding of is how to access and wrangle data during interactive development. This involves acquiring data from a myriad of sources both within and external to Azure, subsequently transforming this data into a state that is befitting to your data science and machine learning models.
Accessing and wrangling data requires a clear understanding of the Azure data services, an aspect we are going to delve into profoundly in this article.
I. Accessing Data in Azure
Azure provides several services for data storage from which you can access your data.
- Azure Blob Storage: This service stores unstructured data as blobs (binary large objects). It’s suitable for storing data that doesn’t need to be stored in a relational database model. You can access data in Blob storage using either Azure Blob storage SDK for Python or Azure Storage REST API.
- Azure Data Lake Storage: This is a petabyte-scale repository suitable for big data analytics. It can be used to store relational data, semi-structured data (like CSV, JSON), unstructured data (like images), and binary data.
- Azure SQL Database: This fully managed relational database offers compatibility with the SQL Server engine.
For instance, to pull data from the Azure Blob Storage, you can use the following piece of code:
from azure.storage.blob import BlobClient
blob = BlobClient(account_url="https://your-storage-account.blob.core.windows.net",
container_name="your-container", blob_name="your-blob", credential="your-account-access-key")
data = blob.download_blob().readall()
II. Data Wrangling in Azure
Data wrangling in Azure primarily involves two steps. Cleansing the data and transforming the data.
- Cleansing the data: It’s a known fact that the data procured from various sources is not always clean, it could comprise inaccurate or incomplete data. Azure provides tools such as Azure Machine Learning Studio and Azure Data Catalog for defining, applying, and sharing data cleaning rules.
- Transforming the data: Once you have clean data, you might still need to transform it into a format required for your analytical process. Azure data factory is one such tool offered by Azure for ETL(Extract, Transform, Load) operations.
Azure also offers Azure Databricks which is an Apache Spark-based analytics platform designed for big data processing and machine learning tasks. It provides a cluster of virtually machines for distributed data processing where you can use Spark’s rich data-frame API to manipulate data in Azure.
An example of data wrangling using Azure Databricks is shown below:
import pyspark.sql.functions as f
# Load data into a DataFrame
df = spark.read.load('wasbs://container@account.blob.core.windows.net/data.csv', format='csv',header=True, inferSchema=True)
# Drop any rows with missing data
df = df.dropna()
# Create a Year column
df = df.withColumn('Year', f.year('Date'))
# Create a categoric 'Decade' column
df = df.withColumn('Decade', (f.floor('year'/10)*10))
Majorly, the ability to access and wrangle data efficiently during interactive development is an indispensable tool for every data scientist. In preparation for the DP-100 Designing and Implementing a Data Science Solution on Azure examination, you should be able to understand how to fetch data from various Azure storage platforms and how to clean and transform that data for your individual models.
Practice Test
True / False: In Azure, it is not possible to access and wrangle data during interactive development.
- True
- False
Answer: False
Explanation: Azure provides various tools and services, like Azure Databricks and Data Factory, to handle data during interactive development.
Multiple Choice: Which tool allows you to write SQL, Python, R, Scala, and Markdown in a collaborative notebook?
- A) Azure Kinect
- B) Azure AI Gallery
- C) Azure Machine Learning Studio
- D) Azure Databricks
Answer: D) Azure Databricks
Explanation: Azure Databricks is an Apache Spark-based analytics platform that allows collaborative notebooks in various languages.
True / False: Data wrangling in Azure involves cleaning and transforming raw data into a more consumable format.
- True
- False
Answer: True
Explanation: Data wrangling, regardless of platform, is the process of cleaning, structuring and enriching raw data into a desired format for better decision making.
Multiple Choice: If you need to filter out null values in a dataset when wrangling data, which language can you use?
- A) Python
- B) SQL
- C) R
- D) All of the above
Answer: D) All of the above
Explanation: All the mentioned programming languages have functions to filter out null values in a dataset.
True / False: You cannot access data stored in an Azure SQL database through Python.
- True
- False
Answer: False
Explanation: Azure provides SDKs and APIs for different languages, including Python, that can be used to interact with its services like Azure SQL Database.
Multiple Selection: What are some of the tools available on Azure for data wrangling?
- A) Azure Machine Learning Studio
- B) Azure Kinect
- C) Azure Databricks
- D) Azure Stream Analytics
Answer: A) Azure Machine Learning Studio and C) Azure Databricks
Explanation: These tools provide an environment for data wrangling and interactive development.
True / False: Azure Databricks does not support Scala for data wrangling.
- True
- False
Answer: False
Explanation: Azure Databricks provides an interactive workspace where you can use Scala, along with Python, SQL, and R for data wrangling.
Single Select: Which is not a part of the data wrangling process?
- A) Data Cleaning
- B) Data Transformation
- C) Data Programming
- D) Data Enrichment
Answer: C) Data Programming
Explanation: While programming skills are needed for data wrangling, ‘Data Programming’ itself is not a recognized part of the data wrangling process.
Multiple Choice: Which among the following machine learning stages cannot be done on Azure?
- A) Data Wrangling
- B) Building Models
- C) Training models
- D) None of the above
Answer: D) None of the above
Explanation: All these stages can be performed on Azure using its different data science tools.
True / False: Azure Data Factory cannot connect to on-premises data sources.
- True
- False
Answer: False
Explanation: Azure Data Factory has connectivity to both cloud and on-premises data sources.
Interview Questions
What is Data Wrangling in Azure?
Data wrangling in Azure refers to the process of transforming raw data into a more useful format that can be easily analyzed. This involves cleaning, structuring and enriching raw data for improved decision making in data analytics.
Explain the function of Azure Data Factory in Data wrangling.
Azure Data Factory is a cloud-based data integration service that can orchestrate and automate the movement and transformation of data. It enables access, wrangle and transformation of data from various sources to a usable state for analytics purposes.
What are the common techniques used in data wrangling?
The common techniques include: parsing, filtering, transforming, merging, cleaning, aggregating, and validating data.
What is Azure Data Lake in terms of data access and wrangling?
Azure Data Lake is a highly scalable public cloud service that allows developers, scientists, business professionals, and other Microsoft customers to gain insight from large, complex data sets. It allows access and wrangling of data on a massive scale with high processing speed.
How does Azure Machine Learning support interactive data wrangling?
Azure Machine Learning includes a data-preparation step in the pipeline where interactively wrangling of data can be performed. It supports data profiling, data cleaning, and data transformation tasks and visualizes the data wrangling process in a step-by-step manner.
What role does Azure Databricks play in data wrangling?
Azure Databricks is an Apache Spark-based analytics platform that integrates well with Azure services, providing a platform for cleaning, transforming, and wrangling data. It provides collaborative notebooks, integration with several data sources and built-in data visualization tools.
Explain the purpose of Azure Synapse Analytics in data wrangling.
Azure Synapse Analytics is an analytics service that brings together enterprise data warehousing and Big Data analytics. It allows users to explore, clean, and transform data using the familiar SQL language, enabling easy data wrangling.
What is the role of Azure Purview in data access?
Azure Purview is a unified data governance service that helps organizations manage and govern their on-premises, multi-cloud, and software-as-a-service data. It assists in understanding data access privileges and managing access controls.
Can you schedule automatic data refresh in Azure?
Yes, automatic data refresh or schedule refresh is a feature in Azure where data is automatically updated at regular intervals. It ensures that the data used in analytics is up-to-date.
How can you secure data access in Azure?
Azure provides several measures to secure data access, which include Role-Based Access Control (RBAC), Azure Active Directory for identity management, Access Control Lists (ACLs), and firewall rules.
What is Interactive Query in Azure?
Interactive Query in Azure is a service that provides a schema-on-read query engine that allows users to analyze data in Azure by executing simple SQL-based queries.
Can you use Power BI with Azure for data wrangling?
Yes, Power BI can connect to Azure services like Azure SQL Database, Azure Data Lake, Azure Databricks etc. for data wrangling tasks. It provides a user-friendly interface and powerful tools for data transformation and cleaning.
What is Azure Data Explorer?
Azure Data Explorer is a fast, fully managed data analytics service for real-time analysis of large volumes of data. It is designed for quick data exploration and helps in interactive data wrangling.
What data types does Azure Machine Learning support for data wrangling?
Azure Machine Learning supports various data types including numerical, categorical, text, date/time and boolean for data wrangling.
How can you handle missing values in Azure Machine Learning?
Missing values in Azure Machine Learning can be handled by using either the “Clean Missing Data” module, which replaces missing values with certain values, or by excluding those rows or columns that have missing values.