Event driven Fetch S3 file in Nifi - amazon-s3

I am currently fetching an S3 file Using the FetchS3Object Processor. But it is a time-driven process and sometimes the S3 files are dumped later than expected due to which the Processor is unable to fetch the file.
Is there a way to make the processor event-driven or is there a way to make the processor run in a loop until it fetches the file?

You have at least these options:
define aws lambda function that sends SQS message every time new object is added to S3 bucket and then consume this event in NiFi with GetSQS processor
use ListS3 processor that will detect every new object added to S3 bucket

Related

Apache flink with S3 as source and S3 as sink

Is it possible to read events as they land in S3 source bucket via apache Flink and process and sink it back to some other S3 bucket? Is there a special connector for that , or I have to use the available read/save examples mentioned in Apache Flink?
How does the checkpointing happen in such case, does flink keep track of what it has read from S3 source bucket automatically, or does it need custom code to be built. Does flink also guarentee exactly once processing in S3 source case.
In Flink 1.11 the FileSystem SQL Connector is much improved; that will be an excellent solution for this use case.
With the DataStream API you can use FileProcessingMode.PROCESS_CONTINUOUSLY with readFile to monitor a bucket and ingest new files as they are atomically moved into it. Flink keeps track of the last-modified timestamp of the bucket, and ingests any children modified since that timestamp -- doing so in an exactly-once way (the read offsets into those files are included in checkpoints).

How to copy an S3 bucket onto Kubernetes nodes

I wanted to copy an S3 bucket on Kubernetes nodes as a DaemonSet, as the new node will also get the s3 bucket copy as soon it gets launched,
I prefer an S3 copy to the Kubernetes node because copying S3 to directly to the pod as an AWS API would mean multiple calls as multiple pods require it and it will take time to copy content each time when the pod is launching.
Assuming that your S3 content is static and doesn't change often. I believe more than a DaemonSet it makes more sense to use a one time Job to copy the whole S3 bucket to a local disk. It's not clear how you would signal the kube-scheduler that your node is not ready until the S3 bucket is fully copied. But, perhaps you can taint your node before the job is finished and remove the taint after the job finishes.
Note also that S3 is inherently slow and meant to be used for processing (reading/writing) single files at a time, so if your bucket has a large amount of data it would take a long time to copy to the node disk.
If your S3 content is dynamically (constantly changing) then it would be more challenging since you would have to files in sync. Your apps would probably have to cache architecture where you would go to the local disk to find files and if they are not there, then make a request to S3.

Can I trust aws-cli to re-upload my data without corrupting when the transfer fails?

I extensively use S3 to store encrypted and compressed backups of my workstations. I use the aws cli to sync them to S3. Sometimes, the transfer might fail when in progress. I usually just retry it and let it finish.
My question is: Does S3 has some kind of check to make sure that the previously failed transfer didn't leave corrupted files? Does anyone know if syncing again is enough to fix the previously failed transfer?
Thanks!
Individual files uploaded to S3 are never partially uploaded. Either the entire file is completed and S3 stores the file as an S3 object, or the upload is aborted and S3 object is never stored.
Even in the multi-part upload case, multiple parts can be uploaded but they never form a complete S3 object unless all of the pieces are uploaded and the "Complete Multipart Upload" operation is performed. So there is no need worry about corruption via partial uploads.
Syncing will certainly be enough to fix the previously failed transfer.
Yes, looks like AWS CLI does validate what it uploads and takes care of corruption scenarios by employing MD5 checksum.
From https://docs.aws.amazon.com/cli/latest/topic/s3-faq.html
The AWS CLI will perform checksum validation for uploading and downloading files in specific scenarios.
The AWS CLI will calculate and auto-populate the Content-MD5 header for both standard and multipart uploads. If the checksum that S3 calculates does not match the Content-MD5 provided, S3 will not store the object and instead will return an error message back the AWS CLI.

In NiFi is it possible to read selectively through FetchS3Object processor?

In Apache NiFi, using FetchS3Object to read from an S3 bucket, I see it can reads all the object in bucket and as they are added. Is it possible:
To configure the processor to read only objects added now onwards, not the one already present?
How can I make it read a particular folder in the bucket?
NiFi seems great, just missing examples in their documentation for atleast the popular processors.
A combination of ListS3 and FetchS3Object processors will do this:
ListS3 - to enumerate your S3 bucket and generate flowfiles referencing each object. You can configure the Prefix property to specify a particular folder in the bucket to enumerate only a subset. ListS3 keeps track of what it has read using NiFi's state feature, so it will generate new flowfiles as new objects are added to the bucket.
FetchS3Object - to read S3 objects into flowfile content. You can use the output of ListS3 by configuring the FetchS3Object's Bucket property to ${s3.bucket} and Object Key property to ${filename}.
Another approach would be to configure your S3 bucket to send SNS notifications, subscribe an SQS queue. NiFi would read from the SQS queue to receive the notifications, filter objects of interest, and process them.
See Monitoring An S3 Bucket in Apache NiFi for more on this approach.
Use GetSQS and fetchS3Object processor and configure your GETSQS processor to listen for notification for newly added file. It's a event driven approach as whenever a new file comes SQS queue sends notification to nifi.
Use below link to get full clarifications:
AWS-NIFI integration

Using data present in S3 inside EMR mappers

I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks
S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.