Fetching S3 object Nifi - amazon-s3

I want to fetch one particular file from S3, only once. So I used listS3 and FetchS3object processors. But whenever I start the listS3 processor and stop it after 1 second, by that time it generates thousands of flowfiles and when these flowfiles are passed to FetchS3object processor, the same file is fetched thousands of time.
I even changed the scheduling time in listS3 processor, but still, the same effect exists.
Could someone tell me where I was going wrong with the configurations?

Related

What is the best way to know the latest file in s3 bucket?

I have a process which is uploading files to S3. The rate at which these are files are pumped to S3 is not constant. Another process needs to look into the latest files uploaded in this bucket and update, say watermark. Now we need the best-effort strategy to make this "latest file" available information as soon as possible.
S3 has event notification integration with SNS/SQS. Since I don't need a fan-out, I thought I could simply do a S3 -> SQS integration. But on digging deep into SQS, I see that though there are no limits on the number of SQS queues you can have per account (I would need quite a lot of queues if I were to assign SQS per partition in S3), there is a limit on max number of messages you can receive per call - 10
Though I can setup SQS per partition, i.e Q1 for root/child1, Q2 for root/child2, etc. The number of files getting pumped into these child folder itself could be massive. In that case, instead of trying to drain everything in the queue - JUST to get the latest file in the child directory, is there any other mechanism I could apply?
Note I am not 100% done with my POC and I certainly don't have the metrics - but given that long-polling (the more you wait, the more delay in getting out the latest file information. so short poll is probably what I should be using - but then there is a possibility that it does not send the request to all SQS servers so I would need multiple calls to get the latest event out from SQS. Need to find a balance there), 10 per call limit, etc, I just doubt if I am using the right tool for the problem here. Am I missing something? or am I terribly wrong about SQS?
I am yet to experiment SNS - does it do rate limiting for events? "If there are 10000 events per minute I will only send you the latest one" sort of?
Please let me know what is the best way to get the latest file uploaded in S3 when the rate of files uploaded is high.

copy multiple objects into one object in amazon S3

I stuck with the following problem: I need to upload objects in small parts (512KB), so I can not use multipart upload (since the minimum 5MB restriction). On the grounds of that, I have to put my parts in a "partitions" bucket and run a Cron task to download partitions and upload a single concatenated object into a "completed" bucket.
I would like to clarify, however, that there is no more elegant way to do this except direct download and concatenation. AWS CLI suggests one can copy objects as a whole, but I see no way to copy and concatenate several objects into one. Is there a way to do this via AWS S3 means?
UPD: I am not guaranteed 512KB chunk size (in fact, it is 512KB to 16MB), but it is usually 512KB and this limit takes origin from vendor of my IP cameras so I can not really change that. And I know the result size beforehead, the camera tells me "I am going to upload 33MB" with a separate call to my backend, but I have no control over number of chunks or their size except the guaranteed boundaries above.

apache nifi S3 PutObject stuck

Sorry if this is a dumb question, very new to nifi.
Have set up a process group to dump sql queries to CSV and then upload them to S3. Worked fine with small queries, but appears to be stuck with larger files.
The input queue to the PutS3Object processor has a limit of 1GB, but the file it is trying to put is almost 2 GB. I have set the multi-part parameters in the S3 processor to be 100M but it is still stuck.
So my theory is the S3PutObject needs a complete file before it starts uploading. Is this correct? Is there no way to get it uploading in a "streaming" manner? Or do I just have to up the input queue size?
Or am I on the wrong track and there is something else holding this all up.
The screenshot suggests that the large file is in PutS3Object's input queue, and PutS3Object is actively working on it (from the 1 thread indicator in the top-right of the processor box).
As it turns out, there were no errors, just a delay from processing a large file.

What are the guarantees for Apache Flume HDFS sink file writes?

Could somebody shed some light on what happens if the Flume agent gets killed in the middle of the HDFS file write (say using Avro format)? Will the file get corrupted and all events there lost?
I understand that there are transactions between different elements of the Flume data chain (source->channel->sink). But I believe that the HDFS files may stay open between consecutive channel->sink transactions (as .tmp). So if one transaction of say 100 events is successful (and the events are stored in a file, transaction committed) and the next one fails in the middle of the HDFS write could it be that the original 100 events from the first transaction are not readable (because the file corruption for instance?). How come Flume can assure that the original 100 events from the first transaction are not affected by this type of failure? Or maybe there is no guarantee there?
If the Flume agent is killed in the middle of HDFS file write , the file won't get corrupted and there will be no data loss.
IF flume is writing to a file say FlumeData123456789.tmp when the flume agent is killed, then all the records written into that file up till that point will remain intact and the file will be saved as FlumeData123456789.

Node-Local Map reduce job

I am currently attempting to write a map-reduce job where the input data is not in HDFS and cannot be loaded into HDFS basically because the programs using the data cannot use data from HDFS and there is too much to copy it into HDFS, at least 1TB per node.
So I have 4 directories on each of the 4 nodes in my cluster. Ideally I would like my mappers to just receive the paths for these 4 local directories and read them, using something like file:///var/mydata/... and then 1 mapper can work with each directory. i.e. 16 Mappers in total.
However to be able to do this I need to ensure that I get exactly 4 mappers per node and exactly the 4 mappers which have been assigned the paths local to that machine. These paths are static and so can be hard coded into my fileinputformat and recordreader, but how do I guarantee that given splits end up on a given node with a known hostname. If it were in HDFS I could use a varient on FileInputFormat setting isSplittable to false and hadoop would take care of it but as all the data is local this causes issues.
Basically all I want is to be able to crawl local directory structures on every node in my cluster exactly once, process a collection of SSTables in these directories and emit rows (on the mapper), and reduce the results (in the reduce step) into HDFS for further bulk processing.
I noticed that the inputSplits provide a getLocations function but I believe that this does not guarantee locality of execution, only optimises it and clearly if I try and use file:///some_path in each mapper I need to ensure exact locality otherwise I may end up reading some directories repeatedly and other not at all.
Any help would be greatly appreciated.
I see there are three ways you can do it.
1.) Simply load the data into HDFS, which you do not it want to do. But it is worth trying as it will be useful for future processing
2.) You can make use of NLineInputFormat. Create four different files with the URLs of the input files in each of your node.
file://192.168.2.3/usr/rags/data/DFile1.xyz
.......
You load these files into HDFS and write your program on these files to access the data data using these URLs and process your data. If you use NLineInputFormat with 1 line. You will process 16 mappers, each map processing an exclusive file. The only issue here, there is a high possibility that the data on one node may be processed on another node, however there will not be any duplicate processing
3.) You can further optimize the above method by loading the above four files with URLs separately. While loading any of these files you can remove the other three nodes to ensure that the file exactly goes to the node where the data files are locally present. While loading choose the replication as 1 so that the blocks are not replicated. This process will increase the probability of the maps launched processing the local files to a very high degree.
Cheers
Rags