Add meta data to a data lake file using ADF - azure-data-factory-2

Azure Data Factory v2 has a Get Metadata activity which can read meta data on the files stored in ADLS. It can preserve the meta data on files when it moves/copies the files.
But is there a way to add or modify meta data on the lake files using ADF?

Yes there's a way.
You can make use of Azure Blob Storage API:
set-blob-metadata method for Blob Storage
Data lake is just an extension to underlying Blob Storage engine
So, you can hook up a web activity in your pipeline and call the rest api pointing at your blob and it will set metadata for you.

The meta data are created by Data Lake(Storage) once the files are uploaded on.
These properties can not be changed unless you delete and re-upload them to Data Lake(or Stroage). Some others have asked the same questions about how to change this meta data in Stack overflow. You could easily find these by seraching.
But if you modify the content of the file in Data Lake, such as add or delete columns, the size, columnCount and structure can be changed.
So for the question "is there a way to add or modify meta data on the lake files using ADF?", the answer is no, there isn't.
HTP.

Related

How can I export data from azure storage table to .csv file in .Net core C#

is there an azure API to import/export an existing collection from Azure Table Storage in .csv?
The Table Storage REST API does not provide a response as CSV directly, so it's always necessary to transform the data accordingly, as for example the Azure Storage Explorer does using an older version of the azcopy v7.3.
I've built a little C# library that basically does the same. It currently caches all rows in memory though to create the CSV headers so that's something to be aware of.

Calculate Hashes in Azure Data Factory

We have a requirement where we want to copy the files and folders from on premise to the Azure Blob Storage. Before copying the files I want to calculate the hashes and put that in a file at the source location.
We want this to be done using Azure Data Factory. I am not finding any option in Azure Data Factory to calculate the hashes for a file system type of objects. I am able to find the hashes for a blob once its landed at destination.
Can some one guide me how this can be achieved.
You need to use data flows in data factory to transform the data.
In a mapping data flow you can just add a column using derived column with an expression using for example the md5() or sha2() function to produce a hash.

Quickest way to import a large (50gb) csv file into azure database

I've just consolidated 100 csv.files into a single monster file with a total size of about 50gb.
I now need to load this into my azure database. Given that I have already created my table in the database what would be the quickest method for me to get this single file into the table?
The methods I've read about include: Import Flat File, blob storage/data factory, BCP.
I'm looking for the quickest method that someone can recommend please?
Azure data factory should be a good fit for this scenario as it is built to process and transform data without worrying about the scale.
Assuming that you have the large csv file stored somewhere on the disk you do not want to move it to any external storage (to save time and cost) - it would be better if you simply create a self integration runtime pointing to your machine hosting your csv file and create linked service in ADF to read the file. Once that is done, simply ingest the file and point it to the sink which is your SQL Azure database.
https://learn.microsoft.com/en-us/azure/data-factory/connector-file-system

BigQuery - load a datasource in Google big query

I have a MySQL DB in AWS and can I use the database as a data source in Big Query.
I m going with CSV upload to Google Cloud Storage bucket and loading into it.
I would like to keep it Synchronised by directly giving the data source itself than loading it every time.
You can create a permanent external table in BigQuery that is connected to Cloud Storage. Then BQ is just the interface while the data resides in GCS. It can be connected to a single CSV file and you are free to update/overwrite that file. But not sure if you can link BQ to a directory full of CSV files or even are tree of directories.
Anyway, have a look here: https://cloud.google.com/bigquery/external-data-cloud-storage

Storing files in Azure SQLdatabase

I have a vb.net based application which references an Azure SQL Database, I have set up a storage account to which I would like to store files to from the application. I am not sure how to create that link between the DB and the Storage account?
Going through the "SQL Server Data Files in Windows Azure Storage service" Tutorial I cannot create a URI for the sotrage blob. Using Azure Storage Explorer I select my container go into security and generate a signature which all works fine. When I test the URI with the "Test in Browser" button I get this error:
<Error>
<Code>AuthenticationFailed</Code>
<Message>
Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:22ab2830-0001-001d-67a0-5460bb000000 Time:2014-10-17T14:06:11.9864269Z
</Message>
<AuthenticationErrorDetail>
Signature did not match. String to sign used was r 2014-10-17T06:00:00Z 2014-10-25T06:00:00Z /macrocitrus/$root 2014-02-14
</AuthenticationErrorDetail>
</Error>
to what this means I have no idea. I am a completely new user with Windows Azure so I am not even sure that I am on the right track?
Is there any documentation that actually explains the steps or what one would require to allow storage access from an SQL DB to an Azure Storage account?
I would not recommend saving the binary content in SQL Database. Instead I would recommend that you save them in blob storage. Here are my reasons for doing so:
Blob storage is designed for that purpose.
Storing data in blob storage is much-much cheaper than storing the data in SQL Database.
By storing binary data with other data, you're unnecessarily making your data access layer bulkier as all the data will be streamed through your database.
General approach in these kinds of scenarios is to keep binary data in blob storage as blobs (think of blobs as files in the cloud). Since each blob gets a unique URL, you can just store the URL in your SQL Database table. So if we go with this approach, what you will be doing is first uploading the blob in blob storage, get its URL and then update the database.
If you search for uploading files in blob storage, I am sure you will find a lot of examples with source code (so I will not bother providing it here :); I hope its all right).
Now coming to the error you're getting. Basically the link you created using Azure Storage Explorer is known as Shared Access Signature (SAS) URL which basically grants a time-limited/permission bound access to your storage account. Now Azure Storage Explorer gave you a SAS URL for the container. There are two ways you can use that URL (assuming you granted Read & List permissions when creating the SAS URL:
To list blobs in that container, just append restype=container&comp=list to your URL and then paste it in the browser and you will see an XML listing of all blobs.
To download a blob, you would need to insert the name of the blob in the URL. So if your URL is like https://[youraccount].blob.core.windows.net/[yourcontainer]?[somestuffhere] and your blob name is myawesomepicture.png, your SAS URL for viewing the file in the browser would be https://[youraccount].blob.core.windows.net/[yourcontainer]/myawesomepicture.png?[somestuffhere]
I wrote a blog post on using Shared Access Signatures which you may find useful: http://gauravmantri.com/2013/02/13/revisiting-windows-azure-shared-access-signature/.