Microsoft Purview is a data governance solution that fosters efficient data discovery, sensitivity and classification, along with end-to-end data lineage. This platform gives organizations a holistic, up-to-date visualization of their data landscape, allowing them to extract maximum value and take informed decisions.

Data lineage in Purview employs a directed acyclic graph to represent the data’s lifecycle across various stages, systems, and transformations. By visualizing these data flows, users can trace the origin of the data, understand the relationship between various datasets, and gauge the impact of data changes.

Table of Contents

Pushing New or Updated Data Lineage to Purview

To update or push new data lineage into Purview, data engineers typically need to integrate their data sources with Purview. This can be achieved by setting up scanning and classification options on the specific data source. Purview supports a variety of data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure SQL Database, SQL Server on Azure VM, and more.

Let’s say we have a case where data lineage needs to be updated in Azure Data Lake Storage. Let’s take a look at the high-level steps below.

  • Register Data Sources with Purview
  • Before updating or pushing new lineage, the first step is to register the data source with Purview by providing necessary connection credentials.
  • Set up Scanning
  • After the registration, define a scan on that data source. The scan frequency could be scheduled as, daily, weekly, or custom as per the requirement. For each scan, specify scan rulesets based on what data should be included or excluded from the scan.
  • Apply Classifiers
  • Assign classifiers to the scan to identify which data format needs to be picked in the scan process. Classifiers help in understanding the format and semantics of the data.
  • Run the Scan
  • After completing setup, run the scan. Once the scan is completed, it will create or update the data catalog and the data lineage in Azure Purview.

Use Case: Updating Data Lineage for Azure Data Lake Storage

Let’s walk through an example of updating data lineage for Azure Data Lake Storage.

from azure.purview.scanning import PurviewScanning

# Establish a connection to Purview account.
account_name = "Your-Account-Name"
purview_client = PurviewScanning(account_name="Your-Account-Name")

# Register Data Source
data_source = {
"kind": "AdlsGen2",
"name": "SampleDataSource",
"properties": {
"accountUrl": "https://your-account-url.dfs.core.windows.net"}
}
purview_client.data_sources.create(data_source)

# Define a scan on the data source
scan = {
"name": "SampleScan",
"kind": "AdlsGen2",
"properties": {
"scanRulesetIds": ["/subscriptions/{subscriptionId}/providers/Microsoft.Purview/accounts/{accountName}/scanRulesets/{scanRulesetNAME}"],
"container": "your-container",
}
}
purview_client.scans.create("SampleDataSource", scan)

# Run the scan to update lineage.
purview_client.scans.run("SampleDataSource", "SampleScan")

This script provides a simple way of updating the data lineage in Azure Purview for Azure Data Lake Storage.

By pushing new or updating the existing data lineage, Microsoft Purview becomes a powerful tool that provides transparency and control over the data architecture. Understanding this feature is crucial for anyone studying for the DP-203 Data Engineering on Microsoft Azure exam.

To conclude, Purview’s data lineage management capabilities are a key component of the DP-203 certification material and are vital for effective data governance and data management practices. The ability to push new or updated data lineage data to this service expands on its value proposition and enables organizations to have a more complete and up-to-date understanding of their data estate.

Practice Test

True or False: New or updated data lineage can be automatically pushed to Microsoft Purview.

  • True
  • False

Answer: True

Explanation: Microsoft Purview is designed to allow automatic updates of new or updated data lineage, making the process more efficient and minimizing manual input.

What is data lineage in the context of Microsoft Purview?

  • a) The process of creating data
  • b) The lifecycle of data including where it travels, changes, and ends up
  • c) The process of deleting data
  • d) The process of securing data.

Answer: b) The lifecycle of data including where it travels, changes, and ends up

Explanation: Data lineage refers to the life-cycle of data, its origins, movements, changes and terminus. Microsoft Purview allows tracking of this journey, providing valuable insights for analytics, troubleshooting and compliance.

True or False: Data lineage can’t be manually pushed to Microsoft Purview

  • True
  • False

Answer: False

Explanation: Although new or updated data lineage can be automatically pushed to Microsoft Purview, manual pushing of data lineage is also possible for more control.

Which of the following are benefits of updating data lineage in Microsoft Purview? (Multiple select)

  • a) Improved data governance
  • b) More control over data quality
  • c) Enhanced data security
  • d) Lower storage costs

Answer: a) Improved data governance, b) More control over data quality

Explanation: Pushing new or updated data lineage to Microsoft Purview enhances data governance and quality control by providing comprehensive visibility and control over data assets.

True or False: Microsoft Purview cannot register and scan Azure Blob Storage to update its lineage.

  • True
  • False

Answer: False

Explanation: Microsoft Purview can register and scan various sources including Azure Blob Storage to catalogue and update its lineage.

Can data lineage in Microsoft Purview help during data migration?

  • a) Yes
  • b) No

Answer: a) Yes

Explanation: Understanding data lineage can aid in migration planning by identifying dependencies and potential risks, making Microsoft Purview advantageous during data migrations.

True or False: We cannot track the data lifecycle in Microsoft Purview.

  • True
  • False

Answer: False

Explanation: Microsoft Purview enables tracking of data lifecycle through data lineage.

Microsoft Purview is used for:

  • a) Data Discovery
  • b) Data Cataloging
  • c) Data Lineage
  • d) All of the above

Answer: d) All of the above

Explanation: Microsoft Purview provides data discovery, cataloging and lineage capabilities, offering a cloud-based, unified data governance service to simplify management across on-premises, multi-cloud and software-as-a-service sources.

True or False: Data Lineage means movement of data from source to destination.

  • True
  • False

Answer: True

Explanation: Data lineage essentially reflects the journey of data from its origins, how it moves and where it ends up, helping to understand dependencies and transformations within the lifecycle.

Data lineage in Microsoft Purview supports:

  • a) Troubleshooting and analytics
  • b) Auditing and compliance
  • c) Managing data risks
  • d) All of the above

Answer: d) All of the above

Explanation: Data lineage in Microsoft Purview supports a multitude of tasks such as auditing, troubleshooting, data risk management, and analytics, making it robust for data governance.

Interview Questions

What is Microsoft Purview?

Microsoft Purview is a unified data governance service that facilitates the management and understanding of data. It helps organizations to map all their data effectively across various sources, whether inside Azure or outside.

What are data lineages in Microsoft Purview?

Data lineage in Microsoft Purview refers to the lifecycle of data, including where it comes from, how it moves across various systems, and how it changes during its journey. It creates a visual representation of data history, thus enhancing trust and simplifying governance.

What is the use of data lineage in Microsoft Purview?

Data lineage assists in understanding data origins, transformations, and dependencies. It also aids in detecting data irregularities and ensuring compliance with various policies and regulations.

What does it mean to push new or updated data lineage to Microsoft Purview?

Pushing new or updated data lineage to Microsoft Purview means updating the information about a dataset’s lifecycle or introducing a new one. This allows for more accurate data management and governance.

How can you push new or updated data lineage to Microsoft Purview?

New or updated data lineage can be pushed by using Purview’s REST APIs or SDKs. The updates should maintain the schema format of Purview and comply with the existing systems’ data catalog updates.

Can you import data lineage from an on-premises SQL Server to Microsoft Purview?

Yes. Microsoft Purview allows for seamless data lineage importation from various sources including on-premises SQL Server, Azure SQL Database, and others.

Why should a data engineer update data lineage in Microsoft Purview?

Updating data lineage ensures that all changes are accurately reflected. It enhances trust, improves data quality, and aids effective decision-making.

Is compliance with data regulations possible with Microsoft Purview?

Yes, with Microsoft Purview’s data lineage and data classification features, it’s easier to manage and fulfil numerous privacy regulations like GDPR, HIPAA, etc.

Can Microsoft Purview handle real-time data lineage?

Currently, Microsoft Purview doesn’t support real-time data lineage. Data lineage updates require manual push updates using the Purview’s REST APIs or SDKs.

Which tools and technologies can be integrated with Microsoft Purview for data lineage management?

Microsoft Purview integrates with various tools and technologies including Azure Data Factory, Azure Synapse Analytics, Power BI, and more for data lineage management.

Is it possible to automate the process of pushing data lineage to Microsoft Purview?

Yes, it can be automated by developing and scheduling scripts to regularly pull updates and push them to Purview using its APIs and SDKs.

Is data lineage in Microsoft Purview secure?

Yes, Microsoft Purview ensures the security of data lineage with its strict access controls, policy enforcement capabilities, and data classification features.

How does Microsoft Purview visualize data lineage?

Microsoft Purview provides a graphical representation of data lineage, enabling users to visualize the lifecycle of data from its source to its endpoint.

Can Microsoft Purview be used to catalog data lineage from open-source systems?

Yes, Microsoft Purview supports cataloging data lineage from a range of data sources, including open-source systems, with the use of the Purview Data Map.

In Microsoft Purview, which role is primarily responsible for pushing new or updated data lineage?

In Microsoft Purview, the data curator role is primarily responsible for pushing new or updated data lineage, as they work on organizing, defining, and labeling enterprise data.

Leave a Reply

Your email address will not be published. Required fields are marked *