How to copy IoT Hub stored blobs to an Azure SQL using Data Factory - azure-sql-database

We are using the IoT Hub routing feature to store messages into an Azure Blob container. By default it stores the messages in a hierarchical manner - creating a folder structure for year, month, day and so on. Within the folder for each day, it creates multiple block blob binary files. Each file may contain multiple JSON objects, each representing a unique IoT telemetry message.
How can I use Azure Data Factory to copy each of these messages into an Azure SQL database?
Screenshot from Azure Storage Explorer
A sample blob file containing multiple messages

It seams that all the files have the same json schema. Then you could follow my steps.
I created an folder csv in my container and have several csv files with json data:
Source Dataset: the data in csv file is json format, so I choose the json format file.
choose the container: test
import the schema(.json)
Source setting: using wildcard file path to choose all the folder and file in the container.
Sink setting:
Mapping:
Run the pipeline and check the result in sink table:

Related

How to write to Blob Storage in Azure SQL Server using TSql?

I'm creating a stored procedure which gets executed when a CSV is uploaded to Blob Storage. This file is then processed using TSQL and wish to write the result to a file
I have been able to read a file and process it using DATA_SOURCE, database scoped credential and external data source. I'm however stuck on writing the output back to a different blob container. How would I do this?
If it was me, I'd use Azure Data Factory, you can create a pipeline that's activated when a file is added to a blob, have it import that file, run an SP and export the results to a blob.
That maybe an Azure function that is activated on changes to a blob container.

Excel into Azure Data Factory into SQL

I read a few threads on this but noticed most are outdated, with excel becoming an integration in 2020.
I have a few excel files stored in Drobox, I would like to automate the extraction of that data into azure data factory, perform some ETL functions with data coming from other sources, and finally push the final, complete table to Azure SQL.
I would like to ask what is the most efficient way of doing so?
Would it be on the basis of automating a logic app to extract the xlsx files into Azure Blob, use data factory for ETL, join with other SQL tables, and finally push the final table to Azure SQL?
Appreciate it!
Before using Logic app to extract excel file Know Issues and Limitations with respect to excel connectors.
If you are importing large files using logic app depending on size of files you are importing consider this thread once - logic apps vs azure functions for large files
Just to summarize approach, I have mentioned below steps:
Step1: Use Azure Logic app to upload excel files from Dropbox to blob storage
Step2: Create data factory pipeline with copy data activity
Step3: Use blob storage service as a source dataset.
Step4: Create SQL database with required schema.
Step5: Do schema mapping
Step6: Finally Use SQL database table as sink

How to execute a databricks notebook when multiple files loaded to ADLS

I'm looking for a light way of executing a databricks notebook that depends on multiple files having been loaded to Azure Data Lake Storage.
Multiple different ADF packages are loading different files into ADLS and then processed by databricks notebooks. Some of the notebooks depend on multiple files from different packages.
A single file is simple enough with an event trigger. Can this be generalised to more than one file without something like Airflow handling dependencies?
This isn't exactly light since you'll have to provision a Azure SQL table, but this is what I'll do:
I would create and store a JSON file in ADLS which details each notebook/pipeline and the file name dependencies.
I'll then provision an Azure SQL Table to store the metadata of each of these files. Essentially, this table will have 3 columns:
General File Name (which matches the file name dependencies in step #1 (e.g.: FileName)
Real File Name (e.g.:FileName_20201007.csv)
Timestamp
Flag (boolean) if file is present
Flag (boolean) if file is processed (i.e.: it's dependent Databricks
notebook has run)
To populate the table in Step#2, I'd use a Azure Logic App which will look for when a blob that meets your criteria is created and then subsequently update/create a new entry on the Azure SQL Table.
See:
https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-azureblobstorage &
https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-sqlazure
You'll need to ensure that at the end of the Azure pipeline/Databricks Notebook that is ran, you update the Azure SQL flag of the respective dependencies to indicate these versions of the file is processed. Your Azure SQL Table will function as a 'watermark' table.
Before your pipeline triggers the Azure databricks notebook, your pipeline will look up the JSON file in ADLS, identify the dependencies for each Notebook, check if all the dependencies are available AND not processed by the Databricks notebook, and subsequently continue to run the Databricks notebook once all this criteria is met.
In terms of triggering your pipeline, you could either use an Azure LogicApp to do this or leverage a tumbling window on ADF.

Calculate Hashes in Azure Data Factory

We have a requirement where we want to copy the files and folders from on premise to the Azure Blob Storage. Before copying the files I want to calculate the hashes and put that in a file at the source location.
We want this to be done using Azure Data Factory. I am not finding any option in Azure Data Factory to calculate the hashes for a file system type of objects. I am able to find the hashes for a blob once its landed at destination.
Can some one guide me how this can be achieved.
You need to use data flows in data factory to transform the data.
In a mapping data flow you can just add a column using derived column with an expression using for example the md5() or sha2() function to produce a hash.

SSIS sending source Oledb data to S3 Buckets in parquet File

My source is SQL Server and I am using SSIS to export data to S3 Buckets, but now my requirement is to send files as parquet File formate.
Can you guys give some clues on how to achieve this?
Thanks,
Ven
For folks stumbling on this answer, Apache Parquet is a project that specifies a columnar file format employed by Hadoop and other Apache projects.
Unless you find a custom component or write some .NET code to do it, you're not going to be able to export data from SQL Server to a Parquet file. KingswaySoft's SSIS Big Data Components might offer one such custom component, but I've got no familiarity.
If you were exporting to Azure, you'd have two options:
Use the Flexible File Destination component (part of the Azure feature pack), which exports to a Parquet file hosted in Azure Blob or Data Lake Gen2 storage.
Leverage PolyBase, a SQL Server feature. It let's you export to a Parquet file via the external table feature. However, that file has to be hosted in a location mentioned here. Unfortunately S3 isn't an option.
If it were me, I'd move the data to S3 as a CSV file then use Athena to convert the CSV file to Pqrquet. There is a nifty article here that talks through the Athena piece:
https://www.cloudforecast.io/blog/Athena-to-transform-CSV-to-Parquet/
Net-net, you'll need to spend a little money, get creative, switch to Azure, or do the conversion in AWS.