How do we find out the dependencies between various ADF entities such as Pipelines, Datasets & Linked Service?
Example: I have one dataset DS_ASQL_DB. How do we check if this dataset is being used/referred to in any ADF pipelines?
In the ADF UI, we can click the entity and see the Related tab.
Related
I am moving from SSIS to Azure.
we have 100's of files and MSSQL tables that we want to push into a Gen2 data lake
using 3 zones then SQL Data Lake
Zones being Raw, Staging & Presentation (Change names as you wish)
What is the best process to automate this as much as possible
for example build a table with files / folders / tables to bring into Raw zone
then have Synapse bring these objects either full or incremental load
then process the them into the next 2 zones I guess more custom code as we progress.
Your requirement can be accomplished using multiple activities in Azure Data Factory.
To migrate SSIS packages, you need to use SSIS Integrated Runtime (IR). ADF supports SSIS Integration which can be configured by creating a new SSIS Integration runtime. To create the same, click on the Configure SSIS Integration, provide the basic details and create a new runtime.
Refer below image to create new SSIS IR.
Refer this third-party tutorial by SQLShack to Move local SSIS packages to Azure Data Factory.
Now, to copy the data to different zones using copy activity. You can make as much copy of your data as your requirement using copy activity. Refer Copy data between Azure data stores using Azure Data Factory.
ADF also supports Incrementally load data using Change Data Capture (CDC).
Note: Both Azure SQL MI and SQL Server support the Change Data Capture technology.
Tumbling window trigger and CDC window parameters need to be configured to make the incremental load automated. Check this official tutorial.
The last part:
then process them into the next 2 zones
This you need to manage programmatically as there is no such feature available in ADF which can update the other copies of the data based on CDC. You need to either create a separate CDC for those zones or do it logically.
I am archiving rows that are older than a year into ADLSv2 as delta tables, when there is a need to report on that data, I need to join archived data with some tables existing on on-premise database. Is there a way we can do a join without re-hydrating from or hydrating data to cloud?
Yes, you can achieve this task by using Azure Data Factory.
Azure Data Factory (ADF) is a fully managed, serverless data integration
service. Visually integrate data sources with more than 90 built-in,
maintenance-free connectors at no added cost. Easily construct ETL and
ELT processes code-free in an intuitive environment or write your own
code.
Firstly, you need to install the Self-hosted Integration Runtime in your local machine to access the on-premises SQL Server in ADF. To accomplish this, refer Connect to On-premises Data in Azure Data Factory with the Self-hosted Integration Runtime.
As you have archived the data in ADLS, you need to change the Access tier of that container from Cold -> Hot in order to retrieve the data in ADF.
Later, create a Linked Service using Self-hosted IR which you have created. Create a Dataset using this Linked Service to access the on-premises database.
Similarly, create a Linked Service using default Azure IR. Create a Dataset using this Linked Service to access the data from ADLS.
Now, you also require a destination database where you will store the data after join. If you are storing it in same on-premises database, you can use the existing Linked Service but you need to create a new Dataset mentioning the destination table name.
Once all this configuration done, create a Data Flow activity pipeline in ADF.
Mapping data flows are visually designed data transformations in Azure
Data Factory. Data flows allow data engineers to develop data
transformation logic without writing code. The resulting data flows
are executed as activities within Azure Data Factory pipelines that
use scaled-out Apache Spark clusters.
Learn more about Mapping data flow here.
Finally, in data-flow activity, your sources will be on-premises dataset and ADLS dataset which you have created above. You will be using join transformation in mapping data flow to combine data from two sources. The output stream will include all columns from both sources matched based on a join condition.
The sink transformation will take your destination dataset where the data will be stored as an output.
We Need to migrate Datasets from ADF which are linked with Linked Services and Pipelines only to Synapse Analytics.
The GITHUB solution (from previous posts https://learn.microsoft.com/en-us/answers/questions/533505/import-bulk-pipelines-from-azure-data-factory-to-a.html)
migrates entire all datasets, pipelines, linked services from ADF to Synapse Analytics.
But we need to migrate Datasets, linked services and pipelines which are linked each other and don't need to migrate which were not linked.
Unfortunately, there is no direct way to exclude the unwanted objects from Azure Data Factory when migrating to other service (Synapse Analytics in your case).
As a workaround, you can make a copy of the existing factory, remove the objects you do not wish to migrate, and use that new factory as your source.
Please follow the below steps to copy the existing data factory objects to new data factory.
Go to your existing ADF Workspace. Follow the path: Manage -> ARM templates -> Export ARM template.
Extract the downloaded file. Open the arm_template.json file in Notepad++ or any other editor. On line number 8, for parameter defaultValue, give the name of your new data factory where you will copy the objects.
Create a new Azure Data Factory with the same name which you have provided in the above step.
Go the Workspace of this newly created data factory. Follow the path: Manage -> ARM template -> Import ARM template. This will open a separate Custom deployment tab.
Select Build your own template in the editor option.
Delete the existing content on whitespace. Click on Load file option to upload the arm_template.json file which you downloaded and edited earlier. Click on Save.
In the final step, you need to give Subscription, Resource Group, Region and Name of newly created data factory where all your objects will be copied. Along with that, you need to provide the connection string of all the linked services which will be copied in the new factory. See the below image for reference. Once done, click on Review and Create and this will copy all your objects to new Data Factory.
Now, in your new factory, you can delete all the objects which you don't want to migrate. Once done, follow the same GitHub link which is mentioned in Microsoft Q&A answer to migrate the objects to Synapse Analytics.
Note: You can later delete the resources (data factory which is used for migration).
Azure Purview at moment shows the data lineage from ADF for only Copy activities. Is this sufficient?
In this article it is given: "By pushing metadata from Azure Data Factory into Azure Purview a reliable and transparent lineage tracking is enabled." Is this above and beyond the copy activity? If yes how can we achieve this?
Is there any other way in Azure to view complete data lineage? Assume we are using ADF/Synapse/Azure Databricks.
Tools such as Data Factory, Data Share, Synapse, Azure Databricks, and so on, belong to category of data systems. The list of data processing systems currently integrated with Purview for lineage are seen here Azure Purview Data Catalog lineage user guide
Currently Azure Data Factory, supports scope: Copy activity , Data flow activity , Execute SSIS package activity
And the integration between Data Factory and Purview supports only a subset of the data systems that Data Factory supports, as described here.
Azure Purview currently doesn't support query or stored procedure for lineage or scanning. Lineage is limited to table and view sources only.
Some additional ways of finding information in the lineage view, include the following:
In the Lineage tab, hover on shapes to preview additional information about the asset in the tooltip .
Select the node or edge to see the asset type it belongs or to switch assets.
Columns of a dataset are displayed in the left side of the Lineage tab. For more information about column-level lineage, see Dataset column lineage.
Custom lineage reporting is also supported via Atlas hooks and REST API. Data integration and ETL tools can push lineage in to Azure Purview at execution time.
Connecting an Azure Purview Account to a Synapse workspace allows you to discover Azure Purview assets and interact with them through Synapse capabilities.
Here is a list of the Azure Purview features that are available in Synapse:
Use the search box at the top to find Purview assets based on keywords
Understand the data based on metadata, lineage, annotations
Connect those data to your workspace with linked services or integration datasets
Analyze those datasets with Synapse Apache Spark, Synapse SQL, and Data Flow
Overview of the metadata, view and edit schema of the metadata with classifications, glossary terms, data types, and descriptions
View lineage to understand dependencies and do impact analysis.
View and edit Contacts to know who is an owner or expert over a dataset
Related to understand the hierarchical dependencies of a specific dataset. This experience is helpful to browse through data hierarchy.
This question might not be well researched but I need to find out proper way to implement this solution before starting design.
Question is, Can we consume SSAS MDX query as datasource in Azure Data Factory Linked Service source?
Data factory cannot query SSAS with MDX or DAX, but maybe you can query the source of the SSAS, in a traditional BI architecture it would be a Data Warehouse or a SQL server. This is because SSAS models are meant to be consumed by reporting tools (Power BI, reporting services, etc) and not data integration tools, which serve very different processes.
Cheers!
The supported list of connectors for the Copy activity available as at today is available here:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview#supported-data-stores-and-formats
It looks like SSAS MDX queries are not included at this point.
ADF v2 supports the running of SSIS packages within ADF pipelines so it may be possible via that route (untested).