Apache NiFi s3 File movement using Wait and Notify Processor - amazon-s3

I have a NiFi flow that is fetching files from an s3 bucket. The files are processed using "ExecuteScript" and "UpdateAttribute" processors and then put into a local drive after modification using the "PutFile" processor.
I need to move the original file in the s3 bucket to another "archive" bucket. This needs to happen after the "PutFile" has run successfully. If any of the processors fail, the file needs to be transferred to another s3 "failed" bucket.
I tried using the Wait and Notify processor to achieve this, but the Wait processor seems to have the modified files in the queue and not the original one.
In the Wait/Notify, I am using ${uuid} as the Release Signal Identifier.
Flow Image here: https://i.stack.imgur.com/k8Kqk.png
Is there any other better way to achieve this?
Any tips, suggestions or help appreciated! Thanks in advance.
P.S New to NiFi entirely, so I might have missed something obvious, but Google did not help me much with this.

Related

Read S3 file based on the path that comes in Kafka - Apache Flink

I have a pipeline that listens to a Kafka topic that receives the s3 file-name & path. The pipeline has to read the file from S3 and do some transformation & aggregation.
I see the Flink has support to read the S3 file directly as source connector, but this use case is to read as part of the transformation stage.
I don't believe this is currently possible.
An alternative might be to keep a Flink session cluster running, and dynamically create and submit a new Flink SQL job running in batch mode to handle the ingestion of each file.
Another approach you might be tempted by would be to implement a RichFlatMapFunction that accepts the path as input, reads the file, and emits its records one by one. But this is likely to not work very well unless the files are rather small because Flink really doesn't like to have user functions that run for long periods of time.

Spark write parquet job completed but have a long delay to start new job

I am running Spark 2.4.4 on AWS EMR and experienced a long delay after the spark write parquet file to S3. I checked the S3 write process should be completed in few seconds (data files and _success file found in the S3). But it still delayed around 5 mins to start the following jobs.
I saw someone said this is called "Parquet Tax". I have tried the proposed fixes from those articles but still cannot resolve the issue. Can anyone give me a hand? thanks so much.
You can start with spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2.
You can set this config by using any of the following methods:
When you launch your cluster, you can put spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 in the Spark config.
spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version",
"2")
When you write data using Dataset API, you can set it in the option, i.e. dataset.write.option("mapreduce.fileoutputcommitter.algorithm.version",
"2").
That's the overhead of the commit-by-rename committer having to fake rename by copying and deleting files.
Switch to a higher performance committer, e.g ASF Spark's "zero rename committer" or the EMR clone, "fast spark committer"

What's the recommended way to write Serilog logs to Amazon S3?

I'm looking to use Serilog to write structured log data to an Amazon S3 bucket, then analyze using Databricks. I assumed there would be an S3 sink for Serilog but I found I was wrong. I think perhaps using the File sink along with something else might be the ticket, but I'm unsure what that might look like. I suppose I could mount the S3 bucket on my EC2 instance and write to it, but I'm told that's problematic. Could one of you fine folks point me in the right direction?
I would now recommend to use Serilog.Sinks.AmazonS3 which was created for exactly the scenario described.
Disclaimer: I'm the maintainer of the project :)
As of this writing, there are no Sinks that write to Amazon S3, so you'd have to write your own.
I'd start by taking a look at the Serilog.Sinks.AzureBlobStorage sink, as it probably can serve as a base for you to write a sink for Amazon S3.
Links to the source code for several other sinks are available in the wiki and can give you some more ideas too: https://github.com/serilog/serilog/wiki/Provided-Sinks

apache nifi S3 PutObject stuck

Sorry if this is a dumb question, very new to nifi.
Have set up a process group to dump sql queries to CSV and then upload them to S3. Worked fine with small queries, but appears to be stuck with larger files.
The input queue to the PutS3Object processor has a limit of 1GB, but the file it is trying to put is almost 2 GB. I have set the multi-part parameters in the S3 processor to be 100M but it is still stuck.
So my theory is the S3PutObject needs a complete file before it starts uploading. Is this correct? Is there no way to get it uploading in a "streaming" manner? Or do I just have to up the input queue size?
Or am I on the wrong track and there is something else holding this all up.
The screenshot suggests that the large file is in PutS3Object's input queue, and PutS3Object is actively working on it (from the 1 thread indicator in the top-right of the processor box).
As it turns out, there were no errors, just a delay from processing a large file.

Where is my AWS EMR reducer output for my completed job (should be on S3, but nothing there)?

I'm having an issue where my Hadoop job on AWS's EMR is not being saved to S3. When I run the job on a smaller sample, the job stores the output just fine. When I run the same command but on my full dataset, the job completes again, but there is nothing existing on S3 where I specified my output to go.
Apparently there was a bug with AWS EMR in 2009, but it was "fixed".
Anyone else ever have this problem? I still have my cluster online, hoping that the data is buried on the servers somewhere. If anyone has an idea where I can find this data, please let me know!
Update: When I look at the logs from one of the reducers, everything looks fine:
2012-06-23 11:09:04,437 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Creating new file 's3://myS3Bucket/output/myOutputDirFinal/part-00000' in S3
2012-06-23 11:09:04,439 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Outputstream for key 'output/myOutputDirFinal/part-00000' writing to tempfile '/mnt1/var/lib/hadoop/s3/output-3834156726628058755.tmp'
2012-06-23 11:50:26,706 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Outputstream for key 'output/myOutputDirFinal/part-00000' is being closed, beginning upload.
2012-06-23 11:50:26,958 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Outputstream for key 'output/myOutputDirFinal/part-00000' upload complete
2012-06-23 11:50:27,328 INFO org.apache.hadoop.mapred.Task (main): Task:attempt_201206230638_0001_r_000000_0 is done. And is in the process of commiting
2012-06-23 11:50:29,927 INFO org.apache.hadoop.mapred.Task (main): Task 'attempt_201206230638_0001_r_000000_0' done.
When I connect to this task's node, the temp directory mentioned is empty.
Update 2: After reading Difference between Amazon S3 and S3n in Hadoop, I'm wondering if my problem is using "s3://" instead of "s3n://" as my output path. In my both my small sample (that stores fine), and my full job, I used "s3://". Any thoughts on if this could be my problem?
Update 3: I see now that on AWS's EMR, s3:// and s3n:// both map to the S3 native file system (AWS EMR documentation).
Update 4: I re-ran this job two more times, each time increasing the number of servers and reducers. The first of these two finished with 89/90 reducer outputs being copied to S3. The 90th said it successfully copied according to logs, but AWS Support says file is not there. They've escalated this problem to their engineering team. My second run with even more reducers and and servers actually finished with all data being copied to S3 (thankfully!). One oddness though is that some reducers take FOREVER to copy the data to S3 -- in both of these new runs, there was a reducer whose output took 1 or 2 hours to copy to S3, where as the other reducers only took 10 minutes max (files are 3GB or so). I think this is relates to something wrong with the S3NativeFileSystem used by EMR (e.g. the long hanging -- which I'm getting billed for of course; and the alleged successful uploads that don't get uploaded). I'd upload to local HDFS first, then to S3, but I was having issues on this front as well (pending AWS engineering team's review).
TLDR; Using AWS EMR to directly store on S3 seems buggy; their engineering team looking into.
This turned out to be a bug on AWS's part, and they've fixed it in the latest AMI version 2.2.1, briefly described in these release notes.
The long explanation I got from AWS is that when the reducer files are > the block limit for S3 (i.e. 5GB?), then multipart is used, but there was not proper error-checking going on, so that is why it would sometimes work, and other times not.
In case this continues for anyone else, refer to my case number, 62849531.