Azure Data Factory - load Application Insights logs to Data Lake Gen 2 - azure-data-factory-2

I have Application Insights configured with a retention period for logs of three months and I want to load them using Data Factory pipelines, scheduled daily, to a Data Lake Gen 2 storage.
The purpose of doing this is to not lose data after the retention period passes and to have the data stored for future purposes - Machine Learning and Reporting, mainly.
I am trying to decide what format to use for storing these data, from the many formats available in Data Lake Gen 2, so if anyone has a similar design, any information or reference to documentation would be greater appreciated.

Per my experience, most format of the log files are .log files. If we want to keep file type and move them to Data Lake Gen 2, please use Binary format.
Binary format can help you move all the folder/sub-folder and all the files to other destination.
HTH.

Related

Load batch CSV Files from Cloud Storage to BigQuery and append on same table

I am new to GCP and recently created a bucket on Google Cloud Storage. RAW files are dumping every hour on GCS bucket in every hour in CSV format.
I would like to load all the CSV files from Cloud storage to BigQuery and there will be a scheduling option to load the recent files from Cloud Storage and append the data to the same table on BigQuery.
Please help me to setup this.
There is many options. But I will present only 2:
You can do nothing and use external table in BigQuery, that means you let the data in Cloud Storage and ask BigQuery to request the data directly from Cloud Storage. You don't duplicate the data (and pay less for storage), but the query are slower (need to load the data from a less performant storage and to parse, on the fly, the CSV) and you process all the file for all queries. You can't use BigQuery advanced feature such as partitioning, clustering and others...
Perform a BigQuery load operation to load all the existing file in a BigQuery table (I recommend to partition the table if you can). For the new file, forget the old school scheduled ingestion process. With cloud, you can be event driven. Catch the event that notify a new file on Cloud Storage and load it directly in BigQuery. You have to write a small Cloud Functions for that, but it's the most efficient and the most recommended pattern. You can find code sample here
Just a warning on the latest solution, you can perform "only" 1500 load job per day and per table (about 1 per minute)

Can Azure Data Factory read data from Delta Lake format?

We were able to read the files by specifiying the delta file source as a parquet dataset in ADF. Although this reads the delta file, it ends up reading all versions/snapshots of the data in the delta file instead of specifically picking up the most recent version of the delta data.
There is a similar question here - Is it possible to connect to databricks deltalake tables from adf
However, I am looking to read the delta file from an ADLS Gen2 location. Appreciate any guidance on this.
I don't think you can do it as easily as reading from Parquet files today, because the Delta Lake files are basically transaction log files + snapshots in Parquet format. Unless you VACUUM every time before you read from a Delta Lake directory, you are going to end up readying the snapshot data like you have observed.
Delta Lake files do not play very nicely OUTSIDE OF Databricks.
In our data pipeline, we usually have a Databricks notebook that exports data from Delta Lake format to regular Parquet format in a temporary location. We let ADF read the Parquet files and do the clean up once done. Depending on the size of your data and how you use it, this may or may not be an option for you.
Time has passed and now ADF Delta support for Data Flow is in preview... hopefully it makes it into ADF native soon.
https://learn.microsoft.com/en-us/azure/data-factory/format-delta

What to use to serve as an intermediary data source in ETL job?

I am creating an ETL pipeline that uses variety of sources and sends the data to Big Query. Talend cannot handle both relational and non relational database components in one job for my use case so here's how i am doing it currently:
JOB 1 --Get data from a source(SQL Server, API etc), transform it and store transformed data in a delimited file(text or csv)
JOB 1 -- Use the stored transformed data from delimited file in JOB 1 as source and then transform it according to big query and send it.
I am using delimited text file/csv as intermediary data storage to achieve this.Since confidentiality of data is important and solution also needs to be scalable to handle millions of rows, what should i use as this intermediary source. Will a relational database help? or delimited files are good enough? or anything else i can use?
PS- I am deleting these files as soon as the job finishes but worried about security till the time job runs, although will run on safe cloud architecture.
Please share your views on this.
In Data Warehousing architecture, it's usually a good practice to have the staging layer to be persistent. This gives you among other things, the ability to trace the data lineage back to source, enable to reload your final model from the staging point when business rules change as well as give a full picture about the transformation steps the data went through from all the way from landing to reporting.
I'd also consider changing your design and have the staging layer persistent under its own dataset in BigQuery rather than just deleting the files after processing.
Since this is just a operational layer for ETL/ELT and not end-user reports, you will be paying only for storage for the most part.
Now, going back to your question and considering your current design, you could create a bucket in Google Cloud Storage and keep your transformation files there. It offers all the security and encryption you need and you have full control over permissions. Big Query works seemingly with Cloud Storage and you can even load a table from a Storage file straight from the Cloud Console.
All things considered, whatever the direction you chose I recommend to store the files you're using to load the table rather than deleting them. Sooner or later there will be questions/failures in your final report and you'll likely need to trace back to the source for investigation.
In a nutshell. The process would be.
|---Extract and Transform---|----Load----|
Source ---> Cloud Storage --> BigQuery
I would do ELT instead of ETL: load the source data as-is and transform in Bigquery using SQL functions.
This allows potentially to reshape data (convert to arrays), filter out columns/rows and perform transform in one single SQL.

BigQuery Datawarehouse design?

In a typical HDFS environment for Datawarehouse, I have seen some different stages in which the data are staged and transformed like in below. I am trying to design a system in Google cloud platform where I can perform all these transformations. Please help.
HDFS::
Landing Zone -> Stage 1 Zone -> Stage 2 Zone
Landing Zone - for having the raw data
Stage 1 Zone - the raw data from Landing zone is transformed, and then changed to a different data format and/or denormalized and stored in Stage 1
Stage 2 Zone - Data from stage 1 is updated on a transaction table say HBASE. If it is just a time period data, then still HDFS based HIVE table
Then, reporting happens from Stage 2 (There could also be multiple zones in between if for transformation)
My thought process of implementation in Google Cloud::
Landing(Google cloud storage) -> Stage 1 (BigQuery - hosts all time based data) -> Stage 2 (BigQuery for time based data/Maintain Big table for transactional data based on key)
My questions are below:
a) Does this implementation looks realistic. I am planning to use Dataflow for read and load between these Zones? What would be a better design, if anyone has implemented one before to build a warehouse?
b) How effective it is to use Dataflow to read Big Query and then update Big table? I have seen some Dataflow connector for Big table updates here
c) Can Json data be used as the primary format, since BigQuery supports that?
There's a solution that may fit your scenario. I would load the data to Cloud Storage, read it and do the transformation with Dataflow, then either send it to Cloud Storage to be loaded in Bigquery after that and/or write directly to BigTable with the Dataflow connector that you mentioned.
As I mentioned before, you could send your transformed data to both databases from Dataflow. Keep in mind that BigQuery and Bigtable are good for analytics, however, Bigtable has a low-latency read and write access and BigQuery has a high latency as it does query jobs to gather the data.
Yes, it'll be a good idea as you could load your JSON data from Cloud Storage to BigQuery directly.

Hortonworks: Hbase, Hive, etc used for which type of data

I would like to ask if anyone could tell me or refer me to an internet page which describes all possibilities to store data in an apache hadoop cluster.
What I would like to know is: Which type of data should be stored in which "system". Under type of data I mean for example:
Live data (realtime)
Historical data
Data which is regularly accessed from an application
...
The complete question is not reduced on Hbase or Hive ("System") but for everything which is available under Hdp.
I hope someone could lead me in a direction where i could find my answer. Thanks!
I can give you an overview, but rest of the things you have to read on your own.
Let's begin with the types of data you want to store in HDFS:
Data in Motion(Which you denoted as real-time data).
So, how can you fetch the real-time data? Is it even possible? The answer is NO. There will always be a delay. However, we can reduce the downtime and processing time of the data. For which we have HDF(Hortonworks Data Flow). It works with the data in motion. There are many services providing the real-time data streaming. You can take the example of Kafka, Nifi, Storm and many more. These tools are used to process the data. You also need to store the data in such a way that you'd be able to fetch it no time(~2 sec), for that we use HBase. HBase stores the data in the columnar structure.
Data at rest (Historic/Data stored for future use)
So, to store the data at rest, there are no such issues. HDP(Hortonworks Data Platform) is there providing us the services to ingest, store and process the data. Even we can integrate HDF services to HDP(prior to version 2.6), which makes it easier to process Data in motion also. Here we need Databases to store a large amount of data. However, we are provided with HDFS(Hadoop Distributed File System) which can help us store any kind of data. But we don't ONLY want to store our data, we want to fetch it no time when it is required. So, how are we planning to do that? By storing our data in a structured form. For which we are provided Hive and HBase. To store such amount of data which is in TB, we need to run heavy processes that are where MapReduce, YARN, Spark, Kubernetes, Spark comes in to picture.
This is the basic idea of storing and processing data in Hadoop.
Rest you can always read from the internet.