Azure Data Lake file properties and/or checksum - azure-data-lake

I'm trying to write a process that will skip execution of some processing jobs based upon whether data in the file has changed, and I would like to do this via checksum. Is there any way (currently or on the roadmap) to give visibility into a file's MD5 checksum or similar?
Alternatively, can I tag a file with a "property" such as a checksum of the file?
Thanks!

For Azure Data Lake Storage Gen1, unfortunately there is no such capability available.
Such capabilities will be available through ADLS Gen2, since it is built on Blob Storage - https://azure.microsoft.com/en-us/services/storage/data-lake-storage/ .
Thanks,
Sachin Sheth
Program Manager, ADLS.

Related

Can I make a BigQuery load job only read storage blob that match a metadata version?

We create files ("blobs") in Google Cloud Storage and instruct BiqQuery load jobs to load them into a table. The blobs are kept in a shared bucket and there are concurrent jobs loading into target tables. We would like to make sure that one job is on loading blobs that another job is loading.
Our idea is to use the metadata support of Google Cloud Storage to manage what blobs are meant to be loaded by which job. Meta data is easy to modify (easier than for example rename the blob) so it is good for state management.
In the cloud storage API there is support for metadata versioning, e.g. you can make storage operations conditional on a specific version of the blob. It is well described here https://cloud.google.com/storage/docs/generations-preconditions , see the if-generation-match precondition.
I try to find corresponding support in the BiqQuery load job https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad but I don't find it. Do know if there is this kind of metadata versioning conditional load support in BigQuery load API?

Can Azure Data Factory read data from Delta Lake format?

We were able to read the files by specifiying the delta file source as a parquet dataset in ADF. Although this reads the delta file, it ends up reading all versions/snapshots of the data in the delta file instead of specifically picking up the most recent version of the delta data.
There is a similar question here - Is it possible to connect to databricks deltalake tables from adf
However, I am looking to read the delta file from an ADLS Gen2 location. Appreciate any guidance on this.
I don't think you can do it as easily as reading from Parquet files today, because the Delta Lake files are basically transaction log files + snapshots in Parquet format. Unless you VACUUM every time before you read from a Delta Lake directory, you are going to end up readying the snapshot data like you have observed.
Delta Lake files do not play very nicely OUTSIDE OF Databricks.
In our data pipeline, we usually have a Databricks notebook that exports data from Delta Lake format to regular Parquet format in a temporary location. We let ADF read the Parquet files and do the clean up once done. Depending on the size of your data and how you use it, this may or may not be an option for you.
Time has passed and now ADF Delta support for Data Flow is in preview... hopefully it makes it into ADF native soon.
https://learn.microsoft.com/en-us/azure/data-factory/format-delta

Automatically detect changes in GCS for BigQuery

Now I have a BigQuery table whose data source is from some bucket at GCS(Google Cloud Storage).
The GCS is dynamic constantly with new files added in. So do we have any available mechanisms for BigQuery to automatically detect the changes in GCS and sync with the latest data?
Thanks!
There is a very cool beta feature you can use to do that. Check out BigQuery Cloud Storage Transfer. You can schedule transfers run backfill, and much more.
Read "limitations" to see if it can work for you.

any storage service like amazon s3 which allows upload /Download at the same time on large file

My requirement to upload large file (35gb), when the upload is in progress need to start the download process on the same file. Any storage service which allows develop .net application
Because Amazon s3 will not allow simultaneously upload and download on
You could use Microsoft Azure Storage Page or Append Blobs to solve this:
1) Begin uploading the large data
2) Concurrently download small ranges of data (no greater than 4MB so the client library can read it in one chunk) that have already been written to.
Page Blobs need to be 512 byte aligned and can be read and written to in a random access pattern, whereas AppendBlobs need to be written to sequentially in an append-only pattern.
As long as you're reading sections that have already been written to you should have no problems. Check out the Blob Getting Started Doc: https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/ and some info about Blob Types: https://msdn.microsoft.com/library/azure/ee691964.aspx
And feel free to contact us with any follow up questions.

Moving azure storage containers from one blob to another

Hello I have two blobs in my account:
Blob1
Blob2
Blob2 is empty, how can I take all the containers from Blob1 and move it to Blob2?
I am doing this because I would like to use a different subscription to help save some money. It doesn't seem like its possible any other way.
This is all under the same windows live account.
Thank you!
I am glad to hear that Azure Support was able to reassign your subscription. In the future, if you would like to copy Azure Storage blobs from one account to another, you can use the Copy Blob REST API. If you are using Azure Storage Client Library, the corresponding method is ICloudBlob.StartCopyFromBlob. The Blob service copies blobs on a best-effort basis and you can use the value of x-ms-copy-id header to check the status of a specific copy operation.