Retrieving and using partial results from Pool - file-io

I have three functions that read, process and write respectively. Each function was optimized (to the best of my knowledge) to work independently. Now, I am trying to pass each result of each function the next one in the chain as soon as it is available instead of waiting for the entire list. I am not really sure how I can connect them. Here's what I have so far.
def main(files_to_load):
loaded_files = load(files_to_load)
with ThreadPool(processes=cpu_count()) as pool:
proccessed_files = pool.map_async(processing_function_with_Pool, iterable=loaded_files).get()
write(proccessed_files)
As you can see, my main() function waits for all the files to load (about 500Mb) stores them to memory and sends them to processing_function_with_Pool() which divides the files into chunks to be processed.After all the processing is done, the files will start to be written to disk. I feel like there's a lot of unnecessary waiting between these three steps. How can I connect everything?

Now your logic is reading all the files sequentially (I guess) and storing them at once in memory.
I'd recommend you to send to processing_function_with_Pool just a list with the file names to be processed.
The processing_function_with_Pool will take care of reading, processing the file and writing the results back.
In this way you'll take advantage of dealing with IO concurrently.
If the processing_function_with_Pool is doing CPU-bound work, I'd suggest you to switch to a Pool of processes.

Related

Release all document locks at a specified time in Marklogic

We are planning to implement a locking mechanism for our documents using xdmp:lock-acquire API in MarkLogic with no timeout option. The document would be locked until the user edits and save the document. As part of this, we are in need to release all the locks at a specified time, say 12.00 AM everyday.
For this, we could use xdmp:lock-release API, but if there are many documents it would take some time to get complete.
Can someone suggest a better way to achieve this in MarkLogic?
If you have a potentially large set of locks that you need to process, and are concerned about timeouts or other issues from doing all of the work in a single transaction, then you can break up the work into smaller chunks or individual transactions.
There are a variety of batch processing tools and frameworks to do that. CoRB is one option that makes it easy to plug in custom selectors and processing scripts, and to execute against giant sets.
If you are looking to initiate the work from a MarkLogic scheduled task and perform all of the work within MarkLogic, then you could spawn multiple tasks to process subsets.
A simple example demonstrating how to set a "chunk size" for each transaction and to keep spawning more work:
declare function local:release-locks($locks, $chunk-size){
if (exists($locks))
then (
(: release all of these locks(you might apply some sort of filter to restrict to a subset,
and maybe a try/catch in case the lock gets released before this runs) :)
$locks[1 to $chunk-size] ! xdmp:node-uri(.) ! xdmp:lock-release(.),
(: now spawn the next set to be released in a separate transaction :)
xdmp:spawn-function(function(){
local:release-locks(subsequence($locks, $chunk-size+1), $chunk-size)
},
<options xmlns="xdmp:eval">
<update>true</update>
<commit>auto</commit>
</options>)
)
else () (: nothing left to do, stop spawning work :)
};
let $locks := xdmp:document-locks()
let $chunk-size := 1000
local:release-locks($locks, $chunk-size)
If you are looking to go down this route, there are some libraries available:
https://github.com/bradmann/marklogic-spawnlib
https://github.com/mblakele/taskbot
The risk of spawning multiple items onto the task server is that if there is a restart or interruption, some tasks may not execute and all locks may not be released. But if you are just looking to release all of the locks, you could then just re-run the script to kick off another round.

Mule batch processing vs foreach vs splitter-aggregator

In Mule, I have quite many records to process, where processing includes some calculations, going back and forth to database etc.. We can process collections of records with these options
Batch processing
ForEach
Splitter-Aggregator
So what are the main differences between them? When should we prefer one to others?
Mule batch processing option does not seem to have batch job scope variable definition, for example. Or, what if I want to benefit multithreading to fasten the overall task? Or, which is better if I want to modify the payload during processing?
When you write "quite many" I assume it's too much for main memory, this rules out spliter/aggregator because it has to collect all records to return them as a list.
I assume you have your records in a stream or iterator, otherwise you probably have a memory problem...
So when to use for-each and when to use batch?
For Each
The most simple solution, but it has some drawbacks:
It is single threaded (so may be too slow for your use case)
It is "fire and forget": You can't collect anything within the loop, e.g. a record count
There is not support handling "broken" records
Within the loop, you can have several steps (message processors) to process your records (e.g. for the mentioned database lookup).
May be a drawback, may be an advantage: The loop is synchronous. (If you want to process asynchronous, wrap it in an async-scope.)
Batch
A little more stuff to do / to understand, but more features:
When called from a flow, always asynchronous (this may be a drawback).
Can be standalone (e.g. with a poll inside for starting)
When the data generated in the loading phase is too big, it is automatically offloaded to disk.
Multithreading for free (number of threads configurable)
Handling for "broken records": Batch steps may be executed for good/broken records only.
You get statitstics at the end (number of records, number of successful records etc.)
So it looks like you better use batch.
For Splitter and Aggregator , you are responsible for writing the splitting logic and then joining them back at the end of processing. It is useful when you want to process records asynchronously using different server. It is less reliable compared to other option, here parallel processing is possible.
Foreach is more reliable but it process records iteratively using single thread ( synchronous), hence parallel processing is not possible. Each records creates a single message by default.
Batch processing is designed to process millions of records in a very fast and reliable way. By default 16 threads will process your records and it is reliable as well.
Please go through the link below for more details.
https://docs.mulesoft.com/mule-user-guide/v/3.8/splitter-flow-control-reference
https://docs.mulesoft.com/mule-user-guide/v/3.8/foreach
I have been using approach to pass on records in array to stored procedure.
You can call stored procedure inside for loop and setting batch size of the for loop accordingly to avoid round trips. I have used this approach and performance is good. You may have to create another table to log results and have that logic in stored procedure as well.
Below is the link which has all the details
https://dzone.com/articles/passing-java-arrays-in-oracle-stored-procedure-fro

Processing data while it is loading

We have a tool which loads data from some optical media, and once it's all copied to the hard drive runs it through a third-party tool for processing. I would like to optimise this process so each file is processed as it is read in. Trouble is, the third-party tool (which naturally I cannot change) has a 12 second startup overhead. What is the best way I can deal with this, in terms of finishing the entire process as soon as possible? I can pass any number of files to the processing tool in each run, so I need to be able to determine exactly when to run the tool to get the fastest result overall. The data being copied could be anything from one large file (which can't be processed until it's fully copied) to hundreds of small files.
The simplest would be to create and run 2 threads, one that runs the tool and one that loads data. Start 12 seconds timer and trigger both threads. Upon each file load completion check the passed time. If 12 seconds passed, fetch the data into the thread running the tool. Restart loading the data in parallel to processing of previous bulk. Once previous bulk processing completes restart the 12 sec timer and continue checking it upon every file load completion. Repeat till no more data remains.
For better results a more complex solution might be required. You can do some benchmarking to get an evaluation of average data loading time. Since it might be different for small and large files, several evaluations may be needed for different categories of files (according to size). Optimal resources utilization would be the one that processes the data in the same rate the new data arrives. Processing time includes the 12 seconds startup. The benchmarking should give you a ratio of processing threads number vs. reading threads number (you can also decrease/increase the number of active reading threads according to the incoming file sizes). Actually, it's a variation of producer-consumer problem with multiple producers and consumers.

Does writeToFile:atomically: blocks asynchronous reading?

A few times while using my application I am processing some large data in the background. (To be ready when the user needs it. Something kind of indexing.) When this background process finished it needs to the save the data in a cache file, but since this is really large it take some seconds.
But at the same time the user may open some dialog which displays images and text loaded from the disk. If this happens at the same time while the background process data is saved, the user interface needs to wait until the saving process is completed. (This is not wanted, since the user then have to wait 3-4 seconds until the images and texts from the disk are loaded!)
So I am looking a way to throttling the writing to disk. I thought of splitting up the data in chunks and inserting a short delay between saving the different chunks. In this delay, the user interface will be able to load the needed texts and images, so the user will not recognize a delay.
At the moment I am using [[array componentsJoinedByString:'\n'] writeToFile:#"some name.dic" atomically:YES]. This is very high-level solution which doesn't allow any customization. How can I implement without large data into one file without saving all the data as one-shot?
Does writeToFile:atomically: blocks asynchronous reading?
No. It is like writing to a temporary file. Once completed successfully, then renaming the temporary file to the destination (replacing the pre-existing file at the destination, if it exists).
You should consider how you can break your data up, so it is not so slow. If it's all divided by strings/lines and it takes seconds, and easy approach to divide the database would be by first character. Of course, a better solution could likely be imagined, based on how you access, search, and update the index/database.
…inserting a short delay between saving the different chunks. In this delay, the user interface will be able to load the needed texts and images, so the user will not recognize a delay.
Don't. Just implement the move/replace of the atomic write yourself (writing to a temporary file during index and write). Then your app can serialize read and write commands explicitly for fast, consistent and correct accesses to these shared resources.
You have to look to the class NSFileHandle.
Using combination of seekToEndOfFile and writeData:(NSData *)data you can do the work you wish.

Will doing fork multiple times affect performance?

I need to read log files (.CSV) using fastercsv and save the contents of it in a db (each cell value is a record). The thing is there are around 20-25 log files which has to be read daily and those log files are really large (each CSV file is more then 7Mb). I had forked the reading process so that user need not have to wait a long time but still reading 20-25 files of that size is taking time (more then 2hrs). Now I want to fork reading of each file i.e there will be around 20-25 child process getting created, my question is can I do that? If yes will it affect the performance and is fastercsv able to handle this?
ex:
for report in #reports
pid = fork {
.
.
.
}
Process.dispatch(pid)
end
PS:I'm using rails 3.0.7 and Its going to happen in server which is running in amazon's large instance(7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform)
If the storage is all local (and I'm not sure you can really say that if you're in the cloud), then forking isn't likely to provide a speedup because the slowest part of the operation is going to be disc I/O (unless you're doing serious computation on your data). Hitting the disc via several processes isn't going to speed that up at once, though I suppose if the disc had a big cache it might help a bit.
Also, 7MB of CSV data isn't really that much - you might get a better speedup if you found a quicker way to insert the data. Some databases provide a bulk load function, where you can load in formatted data directly, or you could turn each row into an INSERT and file that straight into the database. I don't know how you're doing it at the moment so these are just guesses.
Of course, having said all that, the only way to be sure is to try it!