How to execute a databricks notebook when multiple files loaded to ADLS - azure-data-factory-2

I'm looking for a light way of executing a databricks notebook that depends on multiple files having been loaded to Azure Data Lake Storage.
Multiple different ADF packages are loading different files into ADLS and then processed by databricks notebooks. Some of the notebooks depend on multiple files from different packages.
A single file is simple enough with an event trigger. Can this be generalised to more than one file without something like Airflow handling dependencies?

This isn't exactly light since you'll have to provision a Azure SQL table, but this is what I'll do:
I would create and store a JSON file in ADLS which details each notebook/pipeline and the file name dependencies.
I'll then provision an Azure SQL Table to store the metadata of each of these files. Essentially, this table will have 3 columns:
General File Name (which matches the file name dependencies in step #1 (e.g.: FileName)
Real File Name (e.g.:FileName_20201007.csv)
Timestamp
Flag (boolean) if file is present
Flag (boolean) if file is processed (i.e.: it's dependent Databricks
notebook has run)
To populate the table in Step#2, I'd use a Azure Logic App which will look for when a blob that meets your criteria is created and then subsequently update/create a new entry on the Azure SQL Table.
See:
https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-azureblobstorage &
https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-sqlazure
You'll need to ensure that at the end of the Azure pipeline/Databricks Notebook that is ran, you update the Azure SQL flag of the respective dependencies to indicate these versions of the file is processed. Your Azure SQL Table will function as a 'watermark' table.
Before your pipeline triggers the Azure databricks notebook, your pipeline will look up the JSON file in ADLS, identify the dependencies for each Notebook, check if all the dependencies are available AND not processed by the Databricks notebook, and subsequently continue to run the Databricks notebook once all this criteria is met.
In terms of triggering your pipeline, you could either use an Azure LogicApp to do this or leverage a tumbling window on ADF.

Related

Load multiple files using Azure Data factory or Synapse

I am moving from SSIS to Azure.
we have 100's of files and MSSQL tables that we want to push into a Gen2 data lake
using 3 zones then SQL Data Lake
Zones being Raw, Staging & Presentation (Change names as you wish)
What is the best process to automate this as much as possible
for example build a table with files / folders / tables to bring into Raw zone
then have Synapse bring these objects either full or incremental load
then process the them into the next 2 zones I guess more custom code as we progress.
Your requirement can be accomplished using multiple activities in Azure Data Factory.
To migrate SSIS packages, you need to use SSIS Integrated Runtime (IR). ADF supports SSIS Integration which can be configured by creating a new SSIS Integration runtime. To create the same, click on the Configure SSIS Integration, provide the basic details and create a new runtime.
Refer below image to create new SSIS IR.
Refer this third-party tutorial by SQLShack to Move local SSIS packages to Azure Data Factory.
Now, to copy the data to different zones using copy activity. You can make as much copy of your data as your requirement using copy activity. Refer Copy data between Azure data stores using Azure Data Factory.
ADF also supports Incrementally load data using Change Data Capture (CDC).
Note: Both Azure SQL MI and SQL Server support the Change Data Capture technology.
Tumbling window trigger and CDC window parameters need to be configured to make the incremental load automated. Check this official tutorial.
The last part:
then process them into the next 2 zones
This you need to manage programmatically as there is no such feature available in ADF which can update the other copies of the data based on CDC. You need to either create a separate CDC for those zones or do it logically.

Oracle Cloud to Azure Cloud storage

We have a requirement to move data from oracle Cloud storage to Azure Cloud storage.
The requirement is basically to move data from an Oracle ADW database (hosted on Oracle cloud) to Snowflake database (hosted on Azure).
Since the data volume in tables is huge (some with 60mil+ records) we do not wish to use any ETL tool and instead want to setup a pipeline as below.
Oracle ADW database -> Store data in Oracle storage --> Move data to Azure Cloud storage -> Load into Snowflake using snowpipe or similar snowflake utilities.
How should I go about this implementation?
Also share your views on whether we can use Oracle fastconnect and Azure ExpressRoute to directly pull data from Oracle Cloud onto snowflake (or into Azure storage)
I am looking for the same thing with the simplest method from Oracle (on prem but could be cloud), into Snowflake. Looks like data must be exporeted or dropped to external tables, shifted to Azure Blob storage (like AWS S3), then pushed into Snowflake using COPY INTO - basically copying on disk external tables. This is what Snowpipe does:
"Snowpipe copies the files into a queue, from which they are loaded into the target table in a continuous, serverless fashion based on parameters defined in a specified pipe object. The following table indicates the cloud storage service support for automated Snowpipe from Snowflake accounts hosted on each cloud platform:"
It's been a while since I have worked with this. The other option is GoldenGate, which was not expensive the last time I looked into it:
https://www.snowflake.com/blog/continuous-data-replication-into-snowflake-with-oracle-goldengate/
Easy, simple, fast. Anyone have any better ideas would be appreciated.

Excel into Azure Data Factory into SQL

I read a few threads on this but noticed most are outdated, with excel becoming an integration in 2020.
I have a few excel files stored in Drobox, I would like to automate the extraction of that data into azure data factory, perform some ETL functions with data coming from other sources, and finally push the final, complete table to Azure SQL.
I would like to ask what is the most efficient way of doing so?
Would it be on the basis of automating a logic app to extract the xlsx files into Azure Blob, use data factory for ETL, join with other SQL tables, and finally push the final table to Azure SQL?
Appreciate it!
Before using Logic app to extract excel file Know Issues and Limitations with respect to excel connectors.
If you are importing large files using logic app depending on size of files you are importing consider this thread once - logic apps vs azure functions for large files
Just to summarize approach, I have mentioned below steps:
Step1: Use Azure Logic app to upload excel files from Dropbox to blob storage
Step2: Create data factory pipeline with copy data activity
Step3: Use blob storage service as a source dataset.
Step4: Create SQL database with required schema.
Step5: Do schema mapping
Step6: Finally Use SQL database table as sink

Options for ingesting and processing data in Azure sql

I need expert opinion on a project I am working on. We currently get data files that we load into our Azure sql database using a local script that calls stored procedures. I am planning on replacing the script with ssis jobs to load the data into our Azure Sql but wondering if that's a good option given our needs.I am opened to different suggestions too. The process we go through is to load data file to staging tables and validate before making updates to live tables. The validation and updates are done by calling stored procedures...so the ssis package will just load the data and make calls to those stored procedures. I have looked at ADF IR and Databricks but they seem overkill but am open to hear people with experience using those as well. I am currently running the ssis package locally as well. Any suggestion on better architecture or tools for this scenario? Thanks!
I would definitely have a look at Azure Data Factory Data flows. With this you can easily build your ETL pipelines in the a Azure Data Factory GUI.
In the following example two text files from a Blob Storage are read, joined, a surrogate key is added and finally the data is loaded to Azure Synapse Analytics (would be the same for Azure SQL):
You finally put this Mapping Data Flow into a pipeline and can trigger it, e. g. if new data arrives.
You can just BULK INSERT data from Azure Blob Store:
https://learn.microsoft.com/en-us/sql/relational-databases/import-export/examples-of-bulk-access-to-data-in-azure-blob-storage?view=sql-server-ver15#accessing-data-in-a-csv-file-referencing-an-azure-blob-storage-location
Then you can use ADF (no IR) or Databricks or Azure Batch or Azure Elastic Jobs to schedule the execution.

Error in SSIS Packages loading data into azure data warehouse

We have some ssis packages loading data into azure data warehouse from CSV files. All the data flow tasks inside the packages are configured for parallel processing.
Recently packages are started failing with following error.
Failed to copy to SQL Data Warehouse from blob storage. 110802;An internal DMS error occurred that caused this operation to fail. Details: Exception: System.NullReferenceException, Message: Object reference not set to an instance of an object.
When we run the package manually (Running Each dft individually) its running fine. When we run the package manually as it is ( with parallel processing), same error occurs.
Anyone here please help to find the root-cause for this issue?
I believe this problem may occur if multiple jobs are trying to access the same file exactly at the same time.
You may need to check if one CSV file is source for multiple SSIS packages, if yes, you may need to change your approach.
When one package is trying to read one CSV file, it locks that file so that other job can't modify this file.
To get rid of this problem, you can use sequential DFTs for those tasks that are using the same CSV as source and keep other DFTs in parallel as it is.
IMHO it's a mistake to use SSIS Data Flow to insert data in Azure SQL Data Warehouse. There were problems with the drivers early on which made performance horrendously slow and even though these may now have been fixed, the optimal method for importing data into Azure SQL Data Warehouse is Polybase. Place your csv files into blob store or Data Lake, then reference those files using Polybase and external tables. Optionally then import the data into internal tables using CTAS, eg pseudocode
csv -> blob store -> polybase -> external table -> CTAS to internal table
If you must use SSIS, consider using only the Execute SQL task in more of an ELT-type approach or use the Azure SQL DW Upload Task which is part of the Azure Feature Pack for SSIS which is available from here.
Work through this tutorial for a closer look at this approach:
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/design-elt-data-loading