Calculate Hashes in Azure Data Factory - azure-data-factory-2

We have a requirement where we want to copy the files and folders from on premise to the Azure Blob Storage. Before copying the files I want to calculate the hashes and put that in a file at the source location.
We want this to be done using Azure Data Factory. I am not finding any option in Azure Data Factory to calculate the hashes for a file system type of objects. I am able to find the hashes for a blob once its landed at destination.
Can some one guide me how this can be achieved.

You need to use data flows in data factory to transform the data.
In a mapping data flow you can just add a column using derived column with an expression using for example the md5() or sha2() function to produce a hash.

Related

azue synapse copy from Azure Sql to Datalake Table

i want to copy data from azure Sql Tabel to Datalake Storage account table using synapse analytics, in the Datalake table i want to store table name and max id for the incremental load, is this possibe
If your requirement is only to transfer the data from Azure SQL Database to Data Lake Storage (ADLS) account and no big data analysis required, you can simply use Copy activity in either Azure Data Factory (ADF) or Synapse pipeline.
ADF also allows you to perform required transformation on your data before storing it into the destination using data flow activity.
Refer this official tutorial to Copy data from a SQL Server database to Azure Blob storage.
Now coming to Incremental load, ADF and Synapse pipelines both provide complete inbuilt support for it. You need to select a column as Watermark column in your source table.
Watermark column in the source data store, which can be used to slice
the new or updated records for every run. Normally, the data in this
selected column (for example, last_modify_time or ID) keeps increasing
when rows are created or updated. The maximum value in this column is
used as a watermark.
Microsoft provides a complete step-by-step tutorial to Incrementally load data from Azure SQL Database to Azure Blob storage using the Azure portal which you can follow and implement with appropriate changes as per your use case.
Apart from watermark technique, there are other methods which you can choose to manage incremental load. Check here.

Excel into Azure Data Factory into SQL

I read a few threads on this but noticed most are outdated, with excel becoming an integration in 2020.
I have a few excel files stored in Drobox, I would like to automate the extraction of that data into azure data factory, perform some ETL functions with data coming from other sources, and finally push the final, complete table to Azure SQL.
I would like to ask what is the most efficient way of doing so?
Would it be on the basis of automating a logic app to extract the xlsx files into Azure Blob, use data factory for ETL, join with other SQL tables, and finally push the final table to Azure SQL?
Appreciate it!
Before using Logic app to extract excel file Know Issues and Limitations with respect to excel connectors.
If you are importing large files using logic app depending on size of files you are importing consider this thread once - logic apps vs azure functions for large files
Just to summarize approach, I have mentioned below steps:
Step1: Use Azure Logic app to upload excel files from Dropbox to blob storage
Step2: Create data factory pipeline with copy data activity
Step3: Use blob storage service as a source dataset.
Step4: Create SQL database with required schema.
Step5: Do schema mapping
Step6: Finally Use SQL database table as sink

How can I create a dynamic blob linked service?

I want to access multiple storage accounts. I want one sql table to be uploaded to one blob and other sql table to the other. for this, i have to use just one pipeline. How can I create a dynamic blob?
Each storage account requires its own connection string.
One way to do this would be to create a service (for example an azure function) that you pipeline calls. In the function there is logic to place the the correct table in the correct storage account.

Quickest way to import a large (50gb) csv file into azure database

I've just consolidated 100 csv.files into a single monster file with a total size of about 50gb.
I now need to load this into my azure database. Given that I have already created my table in the database what would be the quickest method for me to get this single file into the table?
The methods I've read about include: Import Flat File, blob storage/data factory, BCP.
I'm looking for the quickest method that someone can recommend please?
Azure data factory should be a good fit for this scenario as it is built to process and transform data without worrying about the scale.
Assuming that you have the large csv file stored somewhere on the disk you do not want to move it to any external storage (to save time and cost) - it would be better if you simply create a self integration runtime pointing to your machine hosting your csv file and create linked service in ADF to read the file. Once that is done, simply ingest the file and point it to the sink which is your SQL Azure database.
https://learn.microsoft.com/en-us/azure/data-factory/connector-file-system

Create single Azure Analysis Services table from many blobs in Data Lake Store

I'm new to analysis services and data lake, working on a POC. I've used data factory to pull in some TSV data from blob storage, which is logically organized as small "partition" blobs (thousands of blobs). I have a root folder that can be thought of as containing the whole table, containing subfolders that logically represent partitioning by, say, customer - these contain subfolders that loggically represent partitioning the customer's data by, say, date. I want to model this whole folder/blob structure as one table in Analysis Services, but can't seem to figure out how. I have seen the blog posts and examples that create a single AAS table from a single ADLS file, but information on other data file layouts seems sparse. Is my approach to this wrong, or am I just missing something obvious?
This blog post provides instructions on appending multiple blobs into a single table.
Then the part 3 blog post describes creating some Analysis Services partitions to improve processing performance.
Finally this blog post describes connecting to Azure Data Lake Store (as opposed to Azure Blob Storage in the prior posts).
I would use those approaches to create say 20-200 partitions (not thousands) in Azure Analysis Services. Partitions should generally be at least 8 million rows to get optimal compression and performance. I assume that will require appending several blobs together in order to achieve that size.