Can Azure Data Factory read data from Delta Lake format? - azure-data-factory-2

We were able to read the files by specifiying the delta file source as a parquet dataset in ADF. Although this reads the delta file, it ends up reading all versions/snapshots of the data in the delta file instead of specifically picking up the most recent version of the delta data.
There is a similar question here - Is it possible to connect to databricks deltalake tables from adf
However, I am looking to read the delta file from an ADLS Gen2 location. Appreciate any guidance on this.

I don't think you can do it as easily as reading from Parquet files today, because the Delta Lake files are basically transaction log files + snapshots in Parquet format. Unless you VACUUM every time before you read from a Delta Lake directory, you are going to end up readying the snapshot data like you have observed.
Delta Lake files do not play very nicely OUTSIDE OF Databricks.
In our data pipeline, we usually have a Databricks notebook that exports data from Delta Lake format to regular Parquet format in a temporary location. We let ADF read the Parquet files and do the clean up once done. Depending on the size of your data and how you use it, this may or may not be an option for you.

Time has passed and now ADF Delta support for Data Flow is in preview... hopefully it makes it into ADF native soon.
https://learn.microsoft.com/en-us/azure/data-factory/format-delta

Related

HDFS partitioning using GCS - can i overcome the mistake we've done in the past?

We're using GCS as a data lake for our raw data. However, someone in the past has made a decision to store the data in the following format:
gs://bla-bla/2022/06/28/18/file.json
Lately, we're thinking of starting to use bigquery to make the data more accessible.
I couldn't figure out a way to use the format I wrote above as HDFS partition (always scanning the entire directory).
Is there a way to configure any hadoop/metastore to use to format we're already using? Or do we have to transfer all the data to the right HDFS format?
The format I know HDFS is looking for:
gs://bla-bla/year=2022/month=06/day=28/hour=18/file.json
Thanks

Load batch CSV Files from Cloud Storage to BigQuery and append on same table

I am new to GCP and recently created a bucket on Google Cloud Storage. RAW files are dumping every hour on GCS bucket in every hour in CSV format.
I would like to load all the CSV files from Cloud storage to BigQuery and there will be a scheduling option to load the recent files from Cloud Storage and append the data to the same table on BigQuery.
Please help me to setup this.
There is many options. But I will present only 2:
You can do nothing and use external table in BigQuery, that means you let the data in Cloud Storage and ask BigQuery to request the data directly from Cloud Storage. You don't duplicate the data (and pay less for storage), but the query are slower (need to load the data from a less performant storage and to parse, on the fly, the CSV) and you process all the file for all queries. You can't use BigQuery advanced feature such as partitioning, clustering and others...
Perform a BigQuery load operation to load all the existing file in a BigQuery table (I recommend to partition the table if you can). For the new file, forget the old school scheduled ingestion process. With cloud, you can be event driven. Catch the event that notify a new file on Cloud Storage and load it directly in BigQuery. You have to write a small Cloud Functions for that, but it's the most efficient and the most recommended pattern. You can find code sample here
Just a warning on the latest solution, you can perform "only" 1500 load job per day and per table (about 1 per minute)

Azure Data Factory - load Application Insights logs to Data Lake Gen 2

I have Application Insights configured with a retention period for logs of three months and I want to load them using Data Factory pipelines, scheduled daily, to a Data Lake Gen 2 storage.
The purpose of doing this is to not lose data after the retention period passes and to have the data stored for future purposes - Machine Learning and Reporting, mainly.
I am trying to decide what format to use for storing these data, from the many formats available in Data Lake Gen 2, so if anyone has a similar design, any information or reference to documentation would be greater appreciated.
Per my experience, most format of the log files are .log files. If we want to keep file type and move them to Data Lake Gen 2, please use Binary format.
Binary format can help you move all the folder/sub-folder and all the files to other destination.
HTH.

Easiest way to migrate data from Aurora to S3 in Apache ORC or Apache Parquet

Athena looks nice.
To use it, at our scale, we need to make it cheaper and more performant, which would mean saving our data in ORC or Parquet formats.
What is the absolute easiest way to migrate an entire Aurora database to S3, transforming it into one of those formats?
DMS and Data Pipeline seem to get you there minus the transformation step...
The transform step can be done with python, here is a sample: https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
See this article: http://docs.aws.amazon.com/athena/latest/ug/partitions.html
I would try DMS to initially create the data in s3 and then use the above python.

Loading data from Bigquery to google storage bucket in CSV file format

I run a dataset in bigquery on a daily basis which i need to export to my google storage bucket. The dataset is greater than 10MB which means i'm unable to use app-scripts.
Essentially, I'd like to automate a data load using my bigquery script which exports the dataset as a CSV file to google storage.
Can anyone point me into the right direction in terms of which programme/method to use. Please also share your experiences.
Thanks
Here you can find some details on how to export data from BigQuery to Cloud Storage along with a sample written in Python.
https://cloud.google.com/bigquery/exporting-data-from-bigquery
You can implement a simple application running on App Engine that will contain cron job scheduled to run once a day and perform the steps described in the tutorial above.
https://cloud.google.com/appengine/docs/python/config/cron