Data Lineage in purview insufficient - azure-synapse

Azure Purview at moment shows the data lineage from ADF for only Copy activities. Is this sufficient?
In this article it is given: "By pushing metadata from Azure Data Factory into Azure Purview a reliable and transparent lineage tracking is enabled." Is this above and beyond the copy activity? If yes how can we achieve this?
Is there any other way in Azure to view complete data lineage? Assume we are using ADF/Synapse/Azure Databricks.

Tools such as Data Factory, Data Share, Synapse, Azure Databricks, and so on, belong to category of data systems. The list of data processing systems currently integrated with Purview for lineage are seen here Azure Purview Data Catalog lineage user guide
Currently Azure Data Factory, supports scope: Copy activity , Data flow activity , Execute SSIS package activity
And the integration between Data Factory and Purview supports only a subset of the data systems that Data Factory supports, as described here.
Azure Purview currently doesn't support query or stored procedure for lineage or scanning. Lineage is limited to table and view sources only.
Some additional ways of finding information in the lineage view, include the following:
In the Lineage tab, hover on shapes to preview additional information about the asset in the tooltip .
Select the node or edge to see the asset type it belongs or to switch assets.
Columns of a dataset are displayed in the left side of the Lineage tab. For more information about column-level lineage, see Dataset column lineage.
Custom lineage reporting is also supported via Atlas hooks and REST API. Data integration and ETL tools can push lineage in to Azure Purview at execution time.
Connecting an Azure Purview Account to a Synapse workspace allows you to discover Azure Purview assets and interact with them through Synapse capabilities.
Here is a list of the Azure Purview features that are available in Synapse:
Use the search box at the top to find Purview assets based on keywords
Understand the data based on metadata, lineage, annotations
Connect those data to your workspace with linked services or integration datasets
Analyze those datasets with Synapse Apache Spark, Synapse SQL, and Data Flow
Overview of the metadata, view and edit schema of the metadata with classifications, glossary terms, data types, and descriptions
View lineage to understand dependencies and do impact analysis.
View and edit Contacts to know who is an owner or expert over a dataset
Related to understand the hierarchical dependencies of a specific dataset. This experience is helpful to browse through data hierarchy.

Related

Synapse Lake database view not available in SQL Pool?

Currently exploring using Spark notebooks in Synapse for data transformation instead of data flows but the lake db capabilities are a little confusing.
I created a lake db, an external table (catalog?) and a view using a notebook in Synapse Workspace. The view is visible in the Synapse UI and I can query it.
But the view is not available when connecting via the SQL pool using management or data studio for example. Are only table meta data shared, or am I missing something? Having trouble finding documentation regarding this.
But the view is not available when connecting via the SQL pool using management or data studio for example. Is this intended, or am I missing something?
The Serverless SQL Pool and the Spark Pool share a catalog, but the Dedicated SQL Pool has its own.
Spark views are session (temp views) or app (global views) scoped and do not belong in the catalog. That is the reason you don't see views.

Usage Tracking in Azure synapse analytics

Can anyone share a Kusto query (KQL) that I can use in log analytics that would return some usage tracking stats?
I am trying to identify which "Views" and "Tables" are used the most. Also trying to find out who the power users are and commands/query that is run against the "Tables".
Any insights would be appreciated.
You can use below functions to gather the useage statics
DiagnosticMetricsExpand()
DiagnosticLogsExpand()
ActivityLogRecordsExpand()
And create target tables to store the function data to analyse the useage information.
Refer the Azure documentation for complete details https://learn.microsoft.com/en-us/azure/data-explorer/ingest-data-no-code?tabs=activity-logs
Tutorial: Ingest monitoring data in Azure Data Explorer without code
In this tutorial, you learn how to ingest monitoring data to Azure Data Explorer without one line of code and query that data.

Using Dataverse Synapse Link as reference data for Azure Stream Analytics

We are trying to use our Dataverse data as reference data for our Azure Stream Analytics. The idea is to couple customer acitivities with their CRM profile to create meaningful actions for them. We are currently moving from DES to the Dataverse Synapse Link, and have created the data lake where data gets dumped and can see it in Synapse Studio. However, Stream Analytics does not take CDM format out-of-the-box. It seems it can only handle CSV (with headers) and Json formats.
What is the best approach to get our Dataverse data in as reference for Stream Analytics (and in real time as possible)? Should we create a custom deserializer, or use ADF or something else?

Can we join tables in on-premise SQL Server database to tables in Delta tables in Azure Delta lake? what are my options

I am archiving rows that are older than a year into ADLSv2 as delta tables, when there is a need to report on that data, I need to join archived data with some tables existing on on-premise database. Is there a way we can do a join without re-hydrating from or hydrating data to cloud?
Yes, you can achieve this task by using Azure Data Factory.
Azure Data Factory (ADF) is a fully managed, serverless data integration
service. Visually integrate data sources with more than 90 built-in,
maintenance-free connectors at no added cost. Easily construct ETL and
ELT processes code-free in an intuitive environment or write your own
code.
Firstly, you need to install the Self-hosted Integration Runtime in your local machine to access the on-premises SQL Server in ADF. To accomplish this, refer Connect to On-premises Data in Azure Data Factory with the Self-hosted Integration Runtime.
As you have archived the data in ADLS, you need to change the Access tier of that container from Cold -> Hot in order to retrieve the data in ADF.
Later, create a Linked Service using Self-hosted IR which you have created. Create a Dataset using this Linked Service to access the on-premises database.
Similarly, create a Linked Service using default Azure IR. Create a Dataset using this Linked Service to access the data from ADLS.
Now, you also require a destination database where you will store the data after join. If you are storing it in same on-premises database, you can use the existing Linked Service but you need to create a new Dataset mentioning the destination table name.
Once all this configuration done, create a Data Flow activity pipeline in ADF.
Mapping data flows are visually designed data transformations in Azure
Data Factory. Data flows allow data engineers to develop data
transformation logic without writing code. The resulting data flows
are executed as activities within Azure Data Factory pipelines that
use scaled-out Apache Spark clusters.
Learn more about Mapping data flow here.
Finally, in data-flow activity, your sources will be on-premises dataset and ADLS dataset which you have created above. You will be using join transformation in mapping data flow to combine data from two sources. The output stream will include all columns from both sources matched based on a join condition.
The sink transformation will take your destination dataset where the data will be stored as an output.

deal with multiple reader/writer in azure data lake

I am new to azure data lake and am currently using data factory v2 to move data from my transactional database to azure data lake storage.
Consider a scenario
Company has multiple datasources
Team A is responsible for Source A
Team B is responsible for Source B
Team C is responsible for Source C
Multiple Writers
Each Team is responsible for moving the data into the data lake.
Team A moves data under
/TeamA/entity01.csv
/TeamA/entity02.csv
..
Team B moves data under
/TeamB/entity03.csv
..
Multiple Readers
Team Analytics can read the data and perform calculations in a databricks environment
Team Power BI can fetch the data transform it and copy it into single tenant folders
Tenant1/entity01.csv
Tenant2/entity02.csv
Question
How can the readers read without conflicts with the writers. So that while a reader is reading data, the file is NOT being written into by a Team X update data factory activity ?
What I was thinking / What have I tried :
I was thinking of having a shared source of meta data (maybe in as table storage accessible by all the readers).
"teamA/entity1" : [
"TeamA/Entity1/01-02-2018/0000/data.csv",
"TeamA/Entity1/01-01-2018/0000/data.csv",
]
"teamA/entity2" : [
"TeamA/Entity2/01-01-2018/1200/data.csv"
"TeamA/Entity2/01-01-2018/0600/data.csv"
"TeamA/Entity2/01-01-2018/0000/data.csv"
]
"teamB/entity3" : [
"TeamA/Entity3/01-01-2018/0600/data.csv"
"TeamA/Entity3/01-01-2018/0000/data.csv"
]
the writers will have added responsible for maintaining a set of versions to avoid deleting/overriding data.
the reader will have added responsibility of performing a lookup here and then reading the data.
Data Lake writes to temporary files in the background, before subsequently writing to the actual file. Which will likely mitigate this problem, however I'm unsure whether this will 100% avoid clashes.
If you are willing to have the pipelines in one factory you could use the inbuilt chaining of activities to allow data factory to manage the dependencies.
We typically write to "serving storage" such as SQL server rather than letting powerbi have direct access to data lake store, which may help separate things (also benefits from DirectQuery etc).
However I haven't seen data bricks support yet, I'd bet it is coming similar to how HDInsight can be used.
Notably, as you are finding Data Lake Store not being a OLTP data source this sort of thing isn't what data lake store is meant for, this stackoverflow post discusses this in more detail: Concurrent read/write to ADLA