Mule 4 Batch Process on large input file - mule

I have a huge file that can have anywhere from few hundred thousand to 5 million records. Its tab-delimited file. I need to read the file from ftp location , transform it and finally write it in a FTP location.
I was going to use FTP connector get the repeatable stream and put it into mule batch. Inside mule batch process idea was to use a batch step to transform the records and finally in batch aggregate FTP write the file to destination in append mode 100 records at a time.
Q1. Is this a good approach or is there some better approach?
Q2. How does mule batch load and dispatch phase work (https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#load-and-dispatch ) Is it waiting for entire stream of millions of records to be read in memory before dispatching a mule batch instance ?
Q3. While doing FTP write in batch aggregate there is a chance that parallel threads will start appending content to FTP at same time thereby corrupting the records. Is that avoidable. I read about File locks (https://docs.mulesoft.com/ftp-connector/1.5/ftp-write#locks) . My assumption is it will simply raise File lock exception and not necessarily wait to write FTP in append mode.

Q1. Is this a good approach or is there some better approach?
See answer Q3, this might not work for you. You could instead use a foreach and process the file sequentially though that will increase the time for processing significantly.
Q2. How does mule batch load and dispatch phase work
(https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#load-and-dispatch
) Is it waiting for entire stream of millions of records to be read in
memory before dispatching a mule batch instance ?
Batch doesn't load big numbers of records in memory, it uses file based queues. And yes, it loads all records in the queue before starting to process them.
Q3. While doing FTP write in batch aggregate there is a chance that
parallel threads will start appending content to FTP at same time
thereby corrupting the records. Is that avoidable. I read about File
locks (https://docs.mulesoft.com/ftp-connector/1.5/ftp-write#locks) .
My assumption is it will simply raise File lock exception and not
necessarily wait to write FTP in append mode
The file write operation will throw a FILE:FILE_LOCK error if the file is already locked. Note that Mule 4 doesn't manage errors through exceptions, it uses Mule errors.
If you are using DataWeave flatfile to parse the input file, note that it will load the file in memory and use significantly more memory than the file itself to process it, so you probably are going to get an out of memory error anyway.

Related

apache nifi S3 PutObject stuck

Sorry if this is a dumb question, very new to nifi.
Have set up a process group to dump sql queries to CSV and then upload them to S3. Worked fine with small queries, but appears to be stuck with larger files.
The input queue to the PutS3Object processor has a limit of 1GB, but the file it is trying to put is almost 2 GB. I have set the multi-part parameters in the S3 processor to be 100M but it is still stuck.
So my theory is the S3PutObject needs a complete file before it starts uploading. Is this correct? Is there no way to get it uploading in a "streaming" manner? Or do I just have to up the input queue size?
Or am I on the wrong track and there is something else holding this all up.
The screenshot suggests that the large file is in PutS3Object's input queue, and PutS3Object is actively working on it (from the 1 thread indicator in the top-right of the processor box).
As it turns out, there were no errors, just a delay from processing a large file.

Retrieving and using partial results from Pool

I have three functions that read, process and write respectively. Each function was optimized (to the best of my knowledge) to work independently. Now, I am trying to pass each result of each function the next one in the chain as soon as it is available instead of waiting for the entire list. I am not really sure how I can connect them. Here's what I have so far.
def main(files_to_load):
loaded_files = load(files_to_load)
with ThreadPool(processes=cpu_count()) as pool:
proccessed_files = pool.map_async(processing_function_with_Pool, iterable=loaded_files).get()
write(proccessed_files)
As you can see, my main() function waits for all the files to load (about 500Mb) stores them to memory and sends them to processing_function_with_Pool() which divides the files into chunks to be processed.After all the processing is done, the files will start to be written to disk. I feel like there's a lot of unnecessary waiting between these three steps. How can I connect everything?
Now your logic is reading all the files sequentially (I guess) and storing them at once in memory.
I'd recommend you to send to processing_function_with_Pool just a list with the file names to be processed.
The processing_function_with_Pool will take care of reading, processing the file and writing the results back.
In this way you'll take advantage of dealing with IO concurrently.
If the processing_function_with_Pool is doing CPU-bound work, I'd suggest you to switch to a Pool of processes.

What are the guarantees for Apache Flume HDFS sink file writes?

Could somebody shed some light on what happens if the Flume agent gets killed in the middle of the HDFS file write (say using Avro format)? Will the file get corrupted and all events there lost?
I understand that there are transactions between different elements of the Flume data chain (source->channel->sink). But I believe that the HDFS files may stay open between consecutive channel->sink transactions (as .tmp). So if one transaction of say 100 events is successful (and the events are stored in a file, transaction committed) and the next one fails in the middle of the HDFS write could it be that the original 100 events from the first transaction are not readable (because the file corruption for instance?). How come Flume can assure that the original 100 events from the first transaction are not affected by this type of failure? Or maybe there is no guarantee there?
If the Flume agent is killed in the middle of HDFS file write , the file won't get corrupted and there will be no data loss.
IF flume is writing to a file say FlumeData123456789.tmp when the flume agent is killed, then all the records written into that file up till that point will remain intact and the file will be saved as FlumeData123456789.

Synchronous processing works with Batch Processing?

I do have bunch of xml files say hundreds in my source directory. I have made my flow processing strategy to be synchronous to execute only 1 xml file at a time as performance
is not much priority to me. But i do have batch processing in my flow. So what i under stand is flow thread is creating a child thread to execute my Batch processing and control moves forward. My whole transformation code lies in batch processing which takes 30secs to execute a xml. So nothing much logic in my main flow except file inbound EP and batch execute component(to trigger batch job). So file inbound endpoint is keep on pollingfiles and whole bunch xmls getting picked in very less time make my mule memory out and unexpected behavior occurs.
Came to know fork-join pattern very late and it may or not fit into my req.
So is there any configuration to make my batch process completely and
execute and pick the next files. Help me out. I already made processing strategy synchronous!!
Shouldn't you in this case just adjust the polling frequency at the file inbound endpoint?
https://docs.mulesoft.com/mule-user-guide/v/3.7/file-connector
Polling Frequency
(Applies to inbound File endpoints only.)
Specify how often the endpoint should check for incoming messages. The default value is 1000 ms.
Set maxThreadsActive and maxBufferSize
https://docs.mulesoft.com/mule-user-guide/v/3.6/tuning-performance#calculating-threads

import text file to SQL Server 2008 Database real time

I have a machine which creates a new log file at the beginning of the day(12am) and updates the log file whenever there is any changes until the end of the day.
How do I import the data in real time (30 sec, 1min or whenever there is any changes) to my SQL server database?
Will SQL Server 2008 be able to access the active log file? If not will it be easier if i let my machine create a new log file whenever there is any updates? But if it is so, how do i import so many log files with different names in real time. ( I must be able to scale the solution up to multiple machines)
Thx a lot
You can log each new line with a reversed time stamp.
Since you need to log only when the file changes you can implement an in memory queue
which reads from the file and stores the data.
Then implement a producer consumer model wherein one thread reads and loads data from the queue and the consumer logs to the database.
A windows service then can keep reading from the queue and log to the SQL Server.
(Since it's a producer consumer there will not be any busy waiting in case the queue is empty)
Somehow you will also have to notify the producer thread whenever every log is made. This can be done through Sockets/or some other means in case you have access to the code which is doing the logging.
If you have no control over the application producing the file then you have little option but to poll the file. Write an application that regularly polls the file and writes the deltas to the database. The application will need to record a high water mark that it has last read to.
Another wrinkle is that if the application does not close the file between writes then the last accessed time stamp might not be updated, so checking the age of the file may not be reliable. In this case you need to implement something like this process:
Open the log file
Seek to your last recorded EOF position
Try reading
If successful, process the new data until you get to the new EOF.
Update your persistent EOF position
Close the file
You will need to make sure that the number of bytes read aligns with your file seek position. If the log file is unicode then it may not have a 1:1 mapping between bytes and characters. You may need to read chunks of the file in binary mode and do the translation to characters from the buffer.
Once you have the log file entries parsed then you can just insert the data, or use SQLBulkCopy for larger data volumes.
If you can relax your latency constraints and the log file is small enough then you could possibly just implement a process that copies the log file to a staging area and reloads the whole thing periodically.
How about an SSIS package being called by an SQL Server Scheduled Job?