Processing data while it is loading - optimization

We have a tool which loads data from some optical media, and once it's all copied to the hard drive runs it through a third-party tool for processing. I would like to optimise this process so each file is processed as it is read in. Trouble is, the third-party tool (which naturally I cannot change) has a 12 second startup overhead. What is the best way I can deal with this, in terms of finishing the entire process as soon as possible? I can pass any number of files to the processing tool in each run, so I need to be able to determine exactly when to run the tool to get the fastest result overall. The data being copied could be anything from one large file (which can't be processed until it's fully copied) to hundreds of small files.

The simplest would be to create and run 2 threads, one that runs the tool and one that loads data. Start 12 seconds timer and trigger both threads. Upon each file load completion check the passed time. If 12 seconds passed, fetch the data into the thread running the tool. Restart loading the data in parallel to processing of previous bulk. Once previous bulk processing completes restart the 12 sec timer and continue checking it upon every file load completion. Repeat till no more data remains.
For better results a more complex solution might be required. You can do some benchmarking to get an evaluation of average data loading time. Since it might be different for small and large files, several evaluations may be needed for different categories of files (according to size). Optimal resources utilization would be the one that processes the data in the same rate the new data arrives. Processing time includes the 12 seconds startup. The benchmarking should give you a ratio of processing threads number vs. reading threads number (you can also decrease/increase the number of active reading threads according to the incoming file sizes). Actually, it's a variation of producer-consumer problem with multiple producers and consumers.

Related

DynamoDB large transaction writes are timing out

I have a service that receives events that vary in size from ~5 - 10k items. We split these events up into chunks and these chunks need to be written in transactions because we do some post-processing that depends on a successful write of all the items in the chunk. Ordering of the events is important so we can't Dead Letter them to process at a later time. We're running into an issue where we receive very large (10k) events and they're clogging up the event processor causing a timeout (currently set to 15s). I'm trying to find a way to increase the processing speed of these large events to eliminate timeouts.
I'm open to ideas but curious if there are there any pitfalls of running transaction writes concurrently? E.g. splitting the event into chunks of 100 and having X threads run through them to write to dynamo concurrently.
There is no concern on multi-threading writes to DynamoDB so long as you have the capacity to handle the extra throughput.
I would also advise at trying smaller batches, as with 100 items in a batch, if one happens to fail for any reason then they all fail. Typically I suggest aiming for batch sizes of approx 10. But of course this depends on your use-case.
Also ensure that no threads are targeting the same item at the same time, as this would result in conflicting writes causing large amounts of failed batches.
In summary, batch small as possible, ensure your table has adequate capacity and ensure you don't hit the same items concurrently.

Optimizing Neptune Bulk Load Jobs?

Currently we have an automation engine running to queue up billions of nodes/edges for our Neptune historical load.
The data pulls off Kafka and writes bulk CSVs into S3 to initiate the load. Currently I'm uploading files after each batch pulls a couple million records off the queue.
I'm using oversubscribe param and looked at the high-level docs for bulk optimizations. I'm seeing I can get about 36M records an hour, but looking to go faster. Do I want the output files to be larger? I can only run one job at a time and my queue is constantly filled up to the 65 cap limit.
In general, larger files should give better performance than smaller ones as the worker threads running the load will divide the file up amongst themselves. Larger instances also help the loads go faster. If possible, a db.r5.12xlarge is a good choice when you have a lot of data to load. You can scale it back down again once the volume of writes you need to achieve slows down and a smaller instance will suffice.

SSAS Process Default behavior

I'm trying to make sense of Process Default behavior on SSAS 2017 Enterprise Edition.
My cube is processed daily in this standard sequence:
Loop through 30 dimensions and performing Process Add or Process Update as required.
Process approximately 80 partitions for the previous day.
Exec a Process Default as the final step.
Everything works just fine, and for the amount of data involved, performs really well. However I have observed that after the process default completes, if I re-run the process default step manually (with no other activity having occurred whatsoever), it will take exactly the same time as the first run.
My understanding was that this step basically scans the cube looking for unprocessed objects and will process any objects found to be unprocessed. Given the flow of dimension processing, and subsequent partition processing, I'd certainly expect some objects to be unprocessed on the first run - particularly aggregations and indexes.
The end to end processing time is around 65 mins, but 10 mins of this is the final process default step.
What would explain this is that if the process default isn't actually finding anything to do, and the elapsed time is the cost of scanning the meta data. Firstly it seems an excessive amount of time, but also if I don't run the step, the cube doesn't come online, which suggests it is definitely doing something.
I've had a trawl through Profiler to try to find events to capture what process default is doing, but I'm not able to find anything that would capture the event specifically. I've also monitored the server performance during the step, and nothing is under any real load.
Any suggestions or clarifications..?

Retrieving and using partial results from Pool

I have three functions that read, process and write respectively. Each function was optimized (to the best of my knowledge) to work independently. Now, I am trying to pass each result of each function the next one in the chain as soon as it is available instead of waiting for the entire list. I am not really sure how I can connect them. Here's what I have so far.
def main(files_to_load):
loaded_files = load(files_to_load)
with ThreadPool(processes=cpu_count()) as pool:
proccessed_files = pool.map_async(processing_function_with_Pool, iterable=loaded_files).get()
write(proccessed_files)
As you can see, my main() function waits for all the files to load (about 500Mb) stores them to memory and sends them to processing_function_with_Pool() which divides the files into chunks to be processed.After all the processing is done, the files will start to be written to disk. I feel like there's a lot of unnecessary waiting between these three steps. How can I connect everything?
Now your logic is reading all the files sequentially (I guess) and storing them at once in memory.
I'd recommend you to send to processing_function_with_Pool just a list with the file names to be processed.
The processing_function_with_Pool will take care of reading, processing the file and writing the results back.
In this way you'll take advantage of dealing with IO concurrently.
If the processing_function_with_Pool is doing CPU-bound work, I'd suggest you to switch to a Pool of processes.

Loading from Google cloud storage to Big Query seems slow

I'm running a test using Big Query. Basically I have 50,000 files, each of which are 27MB in size, on average. Some larger, some smaller.
Timing each file upload reveals:
real 0m49.868s
user 0m0.297s
sys 0m0.173s
Using something similar to:
time bq load --encoding="UTF-8" --field_delimiter="~" dataset gs://project/b_20130630_0003_1/20130630_0003_4565900000.tsv schema.json
Running command: "bq ls -j" and subsequently running "bq show -j " reveals that I have the following errors:
Job Type State Start Time Duration Bytes Processed
load FAILURE 01 Jul 22:21:18 0:00:00
Errors encountered during job execution. Exceeded quota: too many imports per table for this table
After checking the database, the rows seems to of loaded fine which is puzzling since, given the error, I would of expected nothing to of gotten loaded. The problem is that I really don't understand how I reached my quota limit since I've only just started
uploading files recently and thought the limit was 200,000 requests.
All the data is currently on Google Cloud Storage so I would expect the data loading to happen fairly quickly since the interaction is between cloud storage and Big Query both of which are in the cloud.
By my calculations the entire load is going to take: (50,000 * 49 seconds) 28 days.
Kinda hoping these numbers are wrong.
Thanks.
The quota limit per table is 1000 loads per day. This is to encourage people to batch their loads, since we can generate a more efficient representation of the table if we can see more of the data at once.
BigQuery can perform load jobs in parallel. Depending on the size of your load, a number of workers will be assigned to your job. If your files are large, those files will be split among workers; alternately if you pass multiple files, each worker may process a different file. So the time that it takes for one file is not indicative of the time that it takes to run a load job with multiple files.