apache nifi S3 PutObject stuck - amazon-s3

Sorry if this is a dumb question, very new to nifi.
Have set up a process group to dump sql queries to CSV and then upload them to S3. Worked fine with small queries, but appears to be stuck with larger files.
The input queue to the PutS3Object processor has a limit of 1GB, but the file it is trying to put is almost 2 GB. I have set the multi-part parameters in the S3 processor to be 100M but it is still stuck.
So my theory is the S3PutObject needs a complete file before it starts uploading. Is this correct? Is there no way to get it uploading in a "streaming" manner? Or do I just have to up the input queue size?
Or am I on the wrong track and there is something else holding this all up.

The screenshot suggests that the large file is in PutS3Object's input queue, and PutS3Object is actively working on it (from the 1 thread indicator in the top-right of the processor box).
As it turns out, there were no errors, just a delay from processing a large file.

Related

Mule 4 Batch Process on large input file

I have a huge file that can have anywhere from few hundred thousand to 5 million records. Its tab-delimited file. I need to read the file from ftp location , transform it and finally write it in a FTP location.
I was going to use FTP connector get the repeatable stream and put it into mule batch. Inside mule batch process idea was to use a batch step to transform the records and finally in batch aggregate FTP write the file to destination in append mode 100 records at a time.
Q1. Is this a good approach or is there some better approach?
Q2. How does mule batch load and dispatch phase work (https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#load-and-dispatch ) Is it waiting for entire stream of millions of records to be read in memory before dispatching a mule batch instance ?
Q3. While doing FTP write in batch aggregate there is a chance that parallel threads will start appending content to FTP at same time thereby corrupting the records. Is that avoidable. I read about File locks (https://docs.mulesoft.com/ftp-connector/1.5/ftp-write#locks) . My assumption is it will simply raise File lock exception and not necessarily wait to write FTP in append mode.
Q1. Is this a good approach or is there some better approach?
See answer Q3, this might not work for you. You could instead use a foreach and process the file sequentially though that will increase the time for processing significantly.
Q2. How does mule batch load and dispatch phase work
(https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#load-and-dispatch
) Is it waiting for entire stream of millions of records to be read in
memory before dispatching a mule batch instance ?
Batch doesn't load big numbers of records in memory, it uses file based queues. And yes, it loads all records in the queue before starting to process them.
Q3. While doing FTP write in batch aggregate there is a chance that
parallel threads will start appending content to FTP at same time
thereby corrupting the records. Is that avoidable. I read about File
locks (https://docs.mulesoft.com/ftp-connector/1.5/ftp-write#locks) .
My assumption is it will simply raise File lock exception and not
necessarily wait to write FTP in append mode
The file write operation will throw a FILE:FILE_LOCK error if the file is already locked. Note that Mule 4 doesn't manage errors through exceptions, it uses Mule errors.
If you are using DataWeave flatfile to parse the input file, note that it will load the file in memory and use significantly more memory than the file itself to process it, so you probably are going to get an out of memory error anyway.

Apache NiFi s3 File movement using Wait and Notify Processor

I have a NiFi flow that is fetching files from an s3 bucket. The files are processed using "ExecuteScript" and "UpdateAttribute" processors and then put into a local drive after modification using the "PutFile" processor.
I need to move the original file in the s3 bucket to another "archive" bucket. This needs to happen after the "PutFile" has run successfully. If any of the processors fail, the file needs to be transferred to another s3 "failed" bucket.
I tried using the Wait and Notify processor to achieve this, but the Wait processor seems to have the modified files in the queue and not the original one.
In the Wait/Notify, I am using ${uuid} as the Release Signal Identifier.
Flow Image here: https://i.stack.imgur.com/k8Kqk.png
Is there any other better way to achieve this?
Any tips, suggestions or help appreciated! Thanks in advance.
P.S New to NiFi entirely, so I might have missed something obvious, but Google did not help me much with this.

How to resolve this error in Google Data Fusion: "Stage x contains a task of very large size (2803 KB). The maximum recommended task size is 100 KB."

I need to move data from an parameterized S3 Bucket into Google Cloud Storage. Basic Data dump. I don't own the S3 bucket. It has the following syntax,
s3://data-partner-bucket/mykey/folder/date=2020-10-01/hour=0
I was able to transfer data at the hourly granularity using the Amazon S3 Client provided by Data Fusion. I wanted to bring over a days worth of data so I reset the path in the client to:
s3://data-partner-bucket/mykey/folder/date=2020-10-01
It seemed like it was working until it stopped. The status is "Stopped." When I review the logs just before it stopped I see a warning, "Stage 0 contains a task of very large size (2803 KB). The maximum recommended task size is 100 KB."
I examined the data in the S3 bucket. Each folder contains a series of log files. None of them are "big". The largest folder contains a total of 3MB of data.
I saw a similar question for this error, but the answer involved Spark coding that I don't have access to in Data Fusion.
Screenshot of Advanced Settings in Amazon S3 Client
These are the settings I see in the client. Maybe there is another setting somewhere I need to set? What do I need to do so that Data Fusion can import these files from S3 to GCS?
When you deploy the pipeline you are redirected to a new page with a Ribbon at the top. one of the tools in the Ribbon is Configure.
In the resources section of the Configure Modal you can specify the memory resources. Fiddled around with the numbers. 1000MB worked. 6MB was not enough. (For me.)
I processed 756K records in about 46 min.

Is boto slow for large files?

I'm doing a few performance tests for uploading large files, on the order to 100MB+. I've read postings about breaking things up and uploading pieces in parallel, but I'm just trying to figure out how fast a large file can go.
When I do my upload and watch the performance with collectl, second-by-second, I'm never getting over 5MB/sec. On the other hand if I reduce the filesize to just 50MB I can do uploads at 20MB/sec.
Is there some magic going on that's based on filesize? is there a way to make my single 100MB file upload faster? What would happen if it were 500MB or even 5G?
hmm, I tried it a number of times and consistently got 5MB/sec and now when I tried it again I got over 15. Is this because I'm sharing bandwidth?
-mark
There is definitely not any magic going on in boto that would account for the variability you are observing. There are so many variables in this equation, e.g. your own connection to the internet, your provider's connection to the backbone, overall network traffic, the load on S3, etc. that it is extremely difficult to get a definitive answer.
In general, I have found that I can achieve the best performance by using multipart upload and some sort of concurrency. The s3put command line utility in boto provides an example of one way to do this. Also, if your S3 bucket is located in a specific region you might see better performance if you connect to that particular endpoint rather than the generic S3 endpoint.

Moving 1 million image files to Amazon S3

I run an image sharing website that has over 1 million images (~150GB). I'm currently storing these on a hard drive in my dedicated server, but I'm quickly running out of space, so I'd like to move them to Amazon S3.
I've tried doing an RSYNC and it took RSYNC over a day just to scan and create the list of image files. After another day of transferring, it was only 7% complete and had slowed my server down to a crawl, so I had to cancel.
Is there a better way to do this, such as GZIP them to another local hard drive and then transfer / unzip that single file?
I'm also wondering whether it makes sense to store these files in multiple subdirectories or is it fine to have all million+ files in the same directory?
One option might be to perform the migration in a lazy fashion.
All new images go to Amazon S3.
Any requests for images not yet on Amazon trigger a migration of that one image to Amazon S3. (queue it up)
This should fairly quickly get all recent or commonly fetched images moved over to Amazon and will thus reduce the load on your server. You can then add another task that migrates the others over slowly whenever the server is least busy.
Given that the files do not exist (yet) on S3, sending them as an archive file should be quicker than using a synchronization protocol.
However, compressing the archive won't help much (if at all) for image files, assuming that the image files are already stored in a compressed format such as JPEG.
Transmitting ~150 Gbytes of data is going to consume a lot of network bandwidth for a long time. This will be the same if you try to use HTTP or FTP instead of RSYNC to do the transfer. An offline transfer would be better if possible; e.g. sending a hard disc, or a set of tapes or DVDs.
Putting a million files into one flat directory is a bad idea from a performance perspective. while some file systems would cope with this fairly well with O(logN) filename lookup times, others do not with O(N) filename lookup. Multiply that by N to access all files in a directory. An additional problem is that utilities that need to access files in order of file names may slow down significantly if they need to sort a million file names. (This may partly explain why rsync took 1 day to do the indexing.)
Putting all of your image files in one directory is a bad idea from a management perspective; e.g. for doing backups, archiving stuff, moving stuff around, expanding to multiple discs or file systems, etc.
One option you could use instead of transferring the files over the network is to put them on a harddrive and ship it to amazon's import/export service. You don't have to worry about saturating your server's network connection etc.