azure data lake incremental copy task from onprem to data lake storage - azure-data-lake

I have 3 folder on on prem servefr, and each folder having several files .
my aim to load the files from onprem server to data lake incrementally , so once we copied the file to data lake next time only new files need to be moved .
thanks in advance
vipin jha

Have you looked at Azure Data Factory for the data movement?
Otherwise, you will have to implement an uploading process that keeps a "high watermark" that tells you what you have already uploaded and then only start the uploading for files after the watermark. E.g., if you upload daily, write the last day into a file for example that you read to determine where to start the next day. Also make sure that you organize the data in a way to make that easy.

Related

BigQuery - Data transfer "Detected that no changes will be made to the destination table"

I use a script to generate files from an API and store them on Google Cloud Storage. Following this documentation, https://cloud.google.com/bigquery/docs/cloud-storage-transfer?hl=en_US#limitations, I've created a BigQuery table with the corresponding schema in advance and t then created a Data Transfer with the following configuration:
When I run the Data Transfer the following error shows up in the logs:
Detected that no changes will be made to the destination table
I've updated some of the files, added files, deleted files, etc and everytime I get the same message. I also have other Data Transfers that work just fine with the same BigQuery instance and Cloud Storage bucket.
Only issue I found on SO, Not able to update Big query table with Transfer from a Storage file, says you need to wait 1 hour, but even after a day I get the same error.
Any idea as to what triggers BiQuery to determine changes have been made (or not)?

Download big number of files (400k) from S3 bucket into Azure Datalake Gen2 using Azure Data Factory

I need to download a big number of files (around 400k) files from an S3 bucket. I have the paths stored in a csv file. Some of the paths may not exist.
The two options i see are:
Use the foreach activity and somehow pass the contents of the file there. But i think that this would flood my monitor pane with a huge number of runs, and it feels like it is meant to be for smaller pipelines.
Use the listOfFiles option which is supported in the S3 source. The problem with this approach is that the list must be in the S3 bucket and cannot be loaded from Azure Datalake Gen2 (anybody knows why, please let me know as well).
I have tried using the listOfFiles way, but the pipeline fails once it finds the first missing file. The fault tolerance options contain a "skip missing file" option but it is defined as "Skip the files if it is being deleted from source store during the data movement", so it is of no use to me.
I don't want to download more files than needed, so copying the bucket as-is is not an option. How can i approach this issue with ADF? I'm looking for a solution that uses the predefined transformations, ideally i would like to not involve Azure Batch or Azure Functions for such a simple task.

What is the best way or tool to open to read a raw clickstream blob storage data in azure

I have clickstream blob storage about 800mb average file sizes and when i open the file it defaults to text file. How do i open and read the data possibly json format or column format. I would also like to understand if i can build an API to consume that data. I recently built an azure function app http trigger but the file is too large to open it up and the function times out. So any suggestion on those two would be appreciated
Thank you

Approach for large data set for reporting

I am having 220 millions of raw files in AWS s3 which I considering to merge all into a single file which estimate around 10 terabyte. The merge file will be serve as a fact table but in file format for reporting purposes for the audit.
The raw files are source data from an application. If there is any new data changes to the application, the contain of the file will be change.
I would like to ask is anybody come across this end to end process for this user case?
s3--> ETL (file merging)--> s3 --> reporting (tableau)
I haven't personally tried it, but this is kind of what Athena is made for... Skipping your ETL process, and querying directly from the files. Is there a reason you are dumping this all into a single file instead of keeping it dispersed? Rewriting a 10TB file over and over again is very expensive and time consuming... I'd personally at least investigate keeping the files 1-1 with the source files.
Create a s3 trigger that fires when a file is rewritten on s3
Create a Lambda that creates your "audit ready" report files on s3
Use AWS Athena to query those report files
Tableau connector to Athena for your reports

query regarding cloud file storage services- can i append data to an existing file

I am working to create an application where some files will be stored in Amazon S3/Rackspace Cloud Files/other similar cloud file storage providers.
There are a couple of scenarios where it would be easier for me, if I could append data to an existing file... Is this possible? Or do I have to download the file from Amazon S3, then append data to it, and finally upload the modified file back to Amazon S3?
There is no way to append anything to existing files in S3.
You will have to download it and upload it again after modifying.
If you wish though, you can always upload the new data with a tag (a timestamp or a counter), e.g. file_201201011344. So when reading files, you get all files mactching your pattern and append them on the client side.