DynamoDB large transaction writes are timing out - kotlin

I have a service that receives events that vary in size from ~5 - 10k items. We split these events up into chunks and these chunks need to be written in transactions because we do some post-processing that depends on a successful write of all the items in the chunk. Ordering of the events is important so we can't Dead Letter them to process at a later time. We're running into an issue where we receive very large (10k) events and they're clogging up the event processor causing a timeout (currently set to 15s). I'm trying to find a way to increase the processing speed of these large events to eliminate timeouts.
I'm open to ideas but curious if there are there any pitfalls of running transaction writes concurrently? E.g. splitting the event into chunks of 100 and having X threads run through them to write to dynamo concurrently.

There is no concern on multi-threading writes to DynamoDB so long as you have the capacity to handle the extra throughput.
I would also advise at trying smaller batches, as with 100 items in a batch, if one happens to fail for any reason then they all fail. Typically I suggest aiming for batch sizes of approx 10. But of course this depends on your use-case.
Also ensure that no threads are targeting the same item at the same time, as this would result in conflicting writes causing large amounts of failed batches.
In summary, batch small as possible, ensure your table has adequate capacity and ensure you don't hit the same items concurrently.

Related

Optimizing Neptune Bulk Load Jobs?

Currently we have an automation engine running to queue up billions of nodes/edges for our Neptune historical load.
The data pulls off Kafka and writes bulk CSVs into S3 to initiate the load. Currently I'm uploading files after each batch pulls a couple million records off the queue.
I'm using oversubscribe param and looked at the high-level docs for bulk optimizations. I'm seeing I can get about 36M records an hour, but looking to go faster. Do I want the output files to be larger? I can only run one job at a time and my queue is constantly filled up to the 65 cap limit.
In general, larger files should give better performance than smaller ones as the worker threads running the load will divide the file up amongst themselves. Larger instances also help the loads go faster. If possible, a db.r5.12xlarge is a good choice when you have a lot of data to load. You can scale it back down again once the volume of writes you need to achieve slows down and a smaller instance will suffice.

Mule batch processing vs foreach vs splitter-aggregator

In Mule, I have quite many records to process, where processing includes some calculations, going back and forth to database etc.. We can process collections of records with these options
Batch processing
ForEach
Splitter-Aggregator
So what are the main differences between them? When should we prefer one to others?
Mule batch processing option does not seem to have batch job scope variable definition, for example. Or, what if I want to benefit multithreading to fasten the overall task? Or, which is better if I want to modify the payload during processing?
When you write "quite many" I assume it's too much for main memory, this rules out spliter/aggregator because it has to collect all records to return them as a list.
I assume you have your records in a stream or iterator, otherwise you probably have a memory problem...
So when to use for-each and when to use batch?
For Each
The most simple solution, but it has some drawbacks:
It is single threaded (so may be too slow for your use case)
It is "fire and forget": You can't collect anything within the loop, e.g. a record count
There is not support handling "broken" records
Within the loop, you can have several steps (message processors) to process your records (e.g. for the mentioned database lookup).
May be a drawback, may be an advantage: The loop is synchronous. (If you want to process asynchronous, wrap it in an async-scope.)
Batch
A little more stuff to do / to understand, but more features:
When called from a flow, always asynchronous (this may be a drawback).
Can be standalone (e.g. with a poll inside for starting)
When the data generated in the loading phase is too big, it is automatically offloaded to disk.
Multithreading for free (number of threads configurable)
Handling for "broken records": Batch steps may be executed for good/broken records only.
You get statitstics at the end (number of records, number of successful records etc.)
So it looks like you better use batch.
For Splitter and Aggregator , you are responsible for writing the splitting logic and then joining them back at the end of processing. It is useful when you want to process records asynchronously using different server. It is less reliable compared to other option, here parallel processing is possible.
Foreach is more reliable but it process records iteratively using single thread ( synchronous), hence parallel processing is not possible. Each records creates a single message by default.
Batch processing is designed to process millions of records in a very fast and reliable way. By default 16 threads will process your records and it is reliable as well.
Please go through the link below for more details.
https://docs.mulesoft.com/mule-user-guide/v/3.8/splitter-flow-control-reference
https://docs.mulesoft.com/mule-user-guide/v/3.8/foreach
I have been using approach to pass on records in array to stored procedure.
You can call stored procedure inside for loop and setting batch size of the for loop accordingly to avoid round trips. I have used this approach and performance is good. You may have to create another table to log results and have that logic in stored procedure as well.
Below is the link which has all the details
https://dzone.com/articles/passing-java-arrays-in-oracle-stored-procedure-fro

How to handle very long running processes in NServiceBus

I'm using NServiceBus to handle some asynchronous tasks. Occasionally I have a task where I need to process 10,000 records, so this takes a few hours.
My problem is that when I handle these records all together, I cannot use NServiceBus default transaction handling.
Also - if I split these records up into 10,000 smaller messages, they will clog up MSMQ for a few hours, and users who are expecting functions to take a few minutes, will be waiting hours.
Is there a way in NServiceBus to prioritise different messages?
I'd consider breaking it down into smaller batches (not necessarily one message per record) and having a separate endpoint service specifically for this process so that other stuff is not held up. If breaking it into batches and you care about the when they all complete then I'd recommend using a saga to track that state.

When should I prefer batch analysis over interactive analysis?

The incentive to use batch queries instead of interactive mode queries was pricing, but with newer price changes there is no cost difference anymore - so is there any other incentive (quota, performance, other...) to use batch queries?
With the price change, there are two primary reasons to use batch priority:
it lets you queue up your jobs.
it lets you run low priority queries in a way that doesn't impact high priority ones.
There are a number of rate limits that affect interactive (i.e. non-batch) queries -- you can have at most 20 running concurrently, there are concurrent byte limits and 'large query' limits. If those limits are hit, the query will fail immediately. This is because BigQuery assumes that an interactive query is something you need run immediately.
When you use batch, if you ever hit a rate limit, the query will be queued and retried later. There are still similar rate limits, but they operate separately from interactive rate limits, so your batch queries won't affect your interactive ones.
One example might be that you have periodic queries that you run daily or hourly to build dashboards. Maybe you have 100 queries that you want to run. If you try to run them all at once as interactive, some will fail because of concurrent rate limits. Additionally, you don't necessarily want these queries to interfere with other queries you are running manually from the BigQuery Web UI. So you can run the dashboard queries at batch priority and the other queries will run normally as interactive.
One other point to note is that the scheduling for Batch queries has changed so the average wait times should come down considerably. Instead of waiting a half hour or so, batch queries should start within a minute or two (subject to queueing, etc).

Processing data while it is loading

We have a tool which loads data from some optical media, and once it's all copied to the hard drive runs it through a third-party tool for processing. I would like to optimise this process so each file is processed as it is read in. Trouble is, the third-party tool (which naturally I cannot change) has a 12 second startup overhead. What is the best way I can deal with this, in terms of finishing the entire process as soon as possible? I can pass any number of files to the processing tool in each run, so I need to be able to determine exactly when to run the tool to get the fastest result overall. The data being copied could be anything from one large file (which can't be processed until it's fully copied) to hundreds of small files.
The simplest would be to create and run 2 threads, one that runs the tool and one that loads data. Start 12 seconds timer and trigger both threads. Upon each file load completion check the passed time. If 12 seconds passed, fetch the data into the thread running the tool. Restart loading the data in parallel to processing of previous bulk. Once previous bulk processing completes restart the 12 sec timer and continue checking it upon every file load completion. Repeat till no more data remains.
For better results a more complex solution might be required. You can do some benchmarking to get an evaluation of average data loading time. Since it might be different for small and large files, several evaluations may be needed for different categories of files (according to size). Optimal resources utilization would be the one that processes the data in the same rate the new data arrives. Processing time includes the 12 seconds startup. The benchmarking should give you a ratio of processing threads number vs. reading threads number (you can also decrease/increase the number of active reading threads according to the incoming file sizes). Actually, it's a variation of producer-consumer problem with multiple producers and consumers.