Download big number of files (400k) from S3 bucket into Azure Datalake Gen2 using Azure Data Factory - amazon-s3

I need to download a big number of files (around 400k) files from an S3 bucket. I have the paths stored in a csv file. Some of the paths may not exist.
The two options i see are:
Use the foreach activity and somehow pass the contents of the file there. But i think that this would flood my monitor pane with a huge number of runs, and it feels like it is meant to be for smaller pipelines.
Use the listOfFiles option which is supported in the S3 source. The problem with this approach is that the list must be in the S3 bucket and cannot be loaded from Azure Datalake Gen2 (anybody knows why, please let me know as well).
I have tried using the listOfFiles way, but the pipeline fails once it finds the first missing file. The fault tolerance options contain a "skip missing file" option but it is defined as "Skip the files if it is being deleted from source store during the data movement", so it is of no use to me.
I don't want to download more files than needed, so copying the bucket as-is is not an option. How can i approach this issue with ADF? I'm looking for a solution that uses the predefined transformations, ideally i would like to not involve Azure Batch or Azure Functions for such a simple task.

Related

Unzip files from S3 before putting them into Snowflake

I have data available in an S3 bucket we don't own, with a zipped folder containing files for each date.
We are using Snowflake as our data warehouse. Snowflake accepts gzip'd files, but does not ingest zip'd folders.
Is there a way to directly ingest the files into Snowflake that will be more efficient than copying them all into our own S3 bucket and unzipping them there, then pointing e.g. Snowpipe to that bucket? The data is on the order of 10GB per day, so copying is very doable, but would introduce (potentially) unnecessary latency and cost. We also don't have access to their IAM policies, so can't do something like S3 Sync.
I would be happy to write something myself, or use a product/platform like Meltano or Airbyte, but I can't find a suitable solution.
How about using SnowSQL to load the data into Snowflake, and using Snowflake stage table/user/named stage to hold files at stages.
https://docs.snowflake.com/en/user-guide/data-load-local-file-system-create-stage.html
I had a similar use case. I use an event based trigger that runs a Lambda function everytime there is a new zipped file in my S3 folder. The Lambda functions opens the zipped files, gzips each individual file and re-uploads them to a different S3 folder. Here's the full working code: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9

how to store auto generated files in a different AWS S3 folder while running Tableau using Athena connector?

I am using Athena to connect a single csv file stored in AWS S3 folder with Tableau Desktop and have been successful in connecting the S3 data using Athena.
However, when I perform any activity in Tableau like drag and drop, slice and dice, for each activity, an auto generated csv and a metadata gets saved in the same folder as my input file.
Due to this additional files getting auto-generated in the same input file folder, the visuals in Tableau also get affected (due to additional records).
How do i ensure that, for any activity i perform in Tableau, the auto-generated files get stored in a different folder (rather than the same folder from where the input file is being called) ?
This will solve my problem as the visuals and the analysis will show correct numbers.
Currently, the work-around that I am using is after every activity I perform in Tableau (slice,filter, etc..), i go back to the S3 folder, delete the additional files that got auto-generated, then continue with activity in Tableau, then back to S3 folder for deletion, etc... (Definitely not the ideal way).
While executing Athena query, I am storing the query results in a different folder, because there is a provision for doing the same.
Please suggest if there is a similar provision for storing the auto-generated files (while working on Tableau) in a different folder ?
P.S. If there is an option of preventing these files from getting generated, that will also be helpful.
Anand
How do I ensure that the auto-generated files get stored in a different folder?
In order to store results of you queries in a different location, you need to specify different path for S3 Staging Directory. In order to do that, you need to Edit Connection to AWS Athena.
Here we did everything within Tableau itself, but the same result can be accomplished within AWS Athena settings for query result locaion
If there is an option of preventing these files from getting generated, that will also be helpful.
On the left side of the toolbar, there is an option Pause/Resume Auto Updates. When paused, Tableau doesn't send new query to AWS Athena.

Approach for large data set for reporting

I am having 220 millions of raw files in AWS s3 which I considering to merge all into a single file which estimate around 10 terabyte. The merge file will be serve as a fact table but in file format for reporting purposes for the audit.
The raw files are source data from an application. If there is any new data changes to the application, the contain of the file will be change.
I would like to ask is anybody come across this end to end process for this user case?
s3--> ETL (file merging)--> s3 --> reporting (tableau)
I haven't personally tried it, but this is kind of what Athena is made for... Skipping your ETL process, and querying directly from the files. Is there a reason you are dumping this all into a single file instead of keeping it dispersed? Rewriting a 10TB file over and over again is very expensive and time consuming... I'd personally at least investigate keeping the files 1-1 with the source files.
Create a s3 trigger that fires when a file is rewritten on s3
Create a Lambda that creates your "audit ready" report files on s3
Use AWS Athena to query those report files
Tableau connector to Athena for your reports

Simple way to load new files only into Redshift from S3?

The documentation for the Redshift COPY command specifies two ways to choose files to load from S3, you either provide a base path and it loads all the files under that path, or you specify a manifest file with specific files to load.
However in our case, which I imagine is pretty common, the S3 bucket periodically receives new files with more recent data. We'd like to be able to load only the files that haven't already been loaded.
Given that there is a table stl_file_scan that logs all the files that have been loaded from S3, it would be nice to somehow exclude those that have successfully been loaded. This seems like a fairly obvious feature, but I can't find anything in the docs or online about how to do this.
Even the Redshift S3 loading template in AWS Data Pipeline appears to manage this scenario by loading all the data -- new and old -- to a staging table, and then comparing/upserting to the target table. This seems like an insane amount of overhead when we can tell up front from the filenames that a file has already been loaded.
I know we could probably move the files that have already been loaded out of the bucket, however we can't do that, this bucket is the final storage place for another process which is not our own.
The only alternative I can think of is to have some other process running that tracks files that have been successfully loaded to redshift, and then periodically compares that to the s3 bucket to determine the differences, and then writes the manifest file somewhere before triggering the copy process. But what a pain! We'd need a separate ec2 instance to run the process which would have it's own management and operational overhead.
There must be a better way!
This is how I solved the problem,
S3 -- (Lambda Trigger on newly created Logs) -- Lambda -- Firehose -- Redshift
It works at any scale. With more load, more calls to Lambda, more data to firehose and everything taken care automatically.
If there are issues with the format of the file, you can configure dead letter queues, events will be sent there and you can reprocess once you fix lambda.
Here I would like to mention some steps that includes process that how to load data in redshift.
Export local RDBMS data to flat files (Make sure you remove invalid
characters, apply escape sequence during export).
Split files into 10-15 MB each to get optimal performance during
upload and final Data load.
Compress files to *.gz format so you don’t end up with $1000
surprise bill :) .. In my case Text files were compressed 10-20
times.
List all file names to manifest file so when you issue COPY command
to Redshift its treated as one unit of load.
Upload manifest file to Amazon S3 bucket.
Upload local *.gz files to Amazon S3 bucket.
Issue Redshift COPY command with different options.
Schedule file archiving from on-premises and S3 Staging area on AWS.
Capturing Errors, setting up restart ability if something fails
Doing it easy way you can follow this link.
In general compare of loaded files to existing on S3 files is a bad but possible practice. The common "industrial" practice is to use message queue between data producer and data consumer that actually loads the data. Take a look on RabbitMQ vs Amazon SQS and etc..

Moving files >5 gig to AWS S3 using a Data Pipeline

We are experiencing problems with files produced by Java code which are written locally and then copied by the Data Pipeline to S3. The error mentions file size.
I would have thought that if multipart uploads is required, then the Pipeline would figure that out. I wonder if there is a way of configuring the Pipeline so that it indeed uses multipart uploading. Because otherwise the current Java code which is agnostic about S3 has to write directly to S3 or has to do what it used to and then use multipart uploading -- in fact, I would think the code would just directly write to S3 and not worry about uploading.
Can anyone tell me if Pipelines can use multipart uploading and if not, can you suggest whether the correct approach is to have the program write directly to S3 or to continue to write to local storage and then perhaps have a separate program be invoked within the same Pipeline which will do the multipart uploading?
The answer, based on AWS support, is that indeed 5 gig files can't be uploaded directly to S3. And there is no way currently for a Data Pipeline to say, "You are trying to upload a large file, so I will do something special to handle this." It simply fails.
This may change in the future.
Data Pipeline CopyActivity does not support files larger than 4GB. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
This is below the 5GB limit imposed by S3 for each file-part put.
You need to write your own script wrapping AWS CLI or S3cmd (older). This script may be executed as a shell activity.
Writing directly to S3 may be an issue as S3 does not support append operations - unless you can somehow write multiple smaller objects in a folder.