How to improve the performance of my NServiceBus Saga under load - nservicebus

I have a very simple Saga built with NSB7 using SQL Transport and NHibernate persistence.
The Saga listens on a queue and for each message received runs through 4 handlers. These are called in a sequential order, with 2 handlers run in parallel and the last handler only runs once both the parallel handlers are complete. The last handler writes a record to DB
Let's say for a single message, each handler takes 1 second. When a new message is received, which starts the Saga, the expected result is that 3-4 seconds later the record is written to the DB.
If the queue backs up with say 1000 messages, once they begin processing again, it takes almost 2000 seconds before a new record is created in the last handler. Basically, instead of running through the expected 4 second processing time for each message, they effectively bunch up in the initial handlers until the queue is emptied and then does that again for the next handler and on and on.
Any ideas on how I could improve the performance of this system when under load so that a constant stream of processed messages come out the end rather than the bunching of messages and long delay before a single new record comes out the other side?
Thanks
Will

There is documentation for saga concurrency issues: https://docs.particular.net/nservicebus/sagas/concurrency#high-load-scenarios
I still don't fully understand the issue though. Every message that instantiates a saga, should create a record in the database after the message was processed. Not after 1000 messages. How else is NServiceBus going to guarantee consistency?
Next to that, you probably should not have the single message be processed by 4 handlers. If it really needs to work like this, use publish/subscribe and create different endpoints. The saga should be done with processing as soon as possible, especially under high load scenarios.

Related

How to make a Saga handler Reentrant

I have a task that can be started by the user, that could take hours to run, and where there's a reasonable chance that the user will start the task multiple times during a run.
I've broken the processing of the task up into smaller batches, but the way the data looks it's very difficult to tell what's still to be processed. I batch it using messages that each process a bite sized chunk of the data.
I have thought of using a Saga to control access to starting this process, with a Saga property called Processing that I set at the start of the handler and then unset at the end of the handler. The handler does some work and sends the messages to process the data. I check the value at the start of the handler, and if it's set, then just return.
I'm using Azure storage for Saga storage, if it makes a difference for the next bit. I'm also using NSB 6
I have a few questions though:
Is this the correct approach to re-entrancy with NSB?
When is a change to Saga data persisted? (and is it different depending on the transport?)
Following on from the above, if I set a Saga value in a handler, wait a while and then reset it to its original value will it change the persistent storage at all?
Seem to be cross posted in the Particular Software google group:
https://groups.google.com/forum/#!topic/particularsoftware/p-qD5merxZQ
Sagas are very often used for such patterns. The saga instance would track progress and guard that the (sub)tasks aren't invoked multiple times but could also take actions if the expected task(s) didn't complete or is/are over time.
The saga instance data is stored after processing the message and not when updating any of the saga data properties. The logic you described would not work.
The correct way would be having a saga that orchestrates your process and having regular handlers that do the actual work.
In the saga handle method that creates the saga check if the saga was already created or already the 'busy' status and if it does not have this status send a message to do some work. This will guard that the task is only initiated once and after that the saga is stored.
The handler can now do the actual task, when it completes it can do a 'Reply' back to the saga
When the saga receives the reply it can now start any other follow up task or raise an event and it can also 'complete'.
Optimistic concurrency control and batched sends
If two message are received that create/update the same saga instance only the first writer wins. The other will fail because of optimistic concurrency control.
However, if these messages are not processed in parallel but sequential both fail unless the saga checks if the saga instance is already initialized.
The following sample demonstrates this: https://github.com/ramonsmits/docs.particular.net/tree/azure-storage-saga-optimistic-concurrency-control/samples/azure/storage-persistence/ASP_1
The client sends two identical message bodies. The saga is launched and only 1 message succeeds due to optimistic concurrency control.
Due to retries eventually the second copy will be processed to but the saga checks the saga data for a field that it knows would normally be initialized by by a message that 'starts' the saga. If that field is already initialized it assumes the message is already processed and just returns:
It also demonstrates batches sends. Messages are not immediately send until the all handlers/sagas are completed.
Saga design
The following video might help you with designing your sagas and understand the various patterns:
Integration Patterns with NServiceBus: https://www.youtube.com/watch?v=BK8JPp8prXc
Keep in mind that Azure Storage isn't transactional and does not provide locking, it is only atomic. Any work you do within a handler or saga can potentially be invoked more than once and if you use non-transactional resources then make sure that logic is idempotent.
So after a lot of testing
I don't believe that this is the right approach.
As Archer says, you can manipulate the saga data properties as much as you like, they are only saved at the end of the handler.
So if the saga receives two simultaneous messages the check for Processing will pass both times and I'll have two processes running (and in my case processing the same data twice).
The saga within a saga faces a similar problem too.
What I believe will work (and has done during my PoC testing) is using a database unique index to help out. I'm using entity framework and azure sql, so database access is not contained within the handler's transaction (this is the important difference between the database and the saga data). The database will also operate across all instances of the endpoint and generally seems like a good solution.
The table that I'm using has each of the columns that make up the saga 'id', and there is a unique index on them.
At the beginning of the handler I retrieve a row from the database. If there is a row, the handler returns (in my case this is okay, in others you could throw an exception to get the handler to run again). The first thing that the handler does (before any work, although I'm not 100% sure that it matters) is to write a row to the table. If the write fails (probably because of the unique constraint being violated) the exception puts the message back on the queue. It doesn't really matter why the database write fails, as NSB will handle it.
Then the handler does the work.
Then remove the row.
Of course there is a chance that something happens during processing of the work, so I'm also using a timestamp and another process to reset it if it's busy for too long. (still need to define 'too long' though :) )
Maybe this can help someone with a similar problem.

NServiceBus Timeoutsdispatcher queue is being flooded with messages during stress tests

I'm doing some stress tests on a saga that uses 2 timeouts. During the test about 21K saga's get created. So that would mean 42K timeouts, but I notice that the timeoutsdispatcher queue of the saga is getting flooded with 100's of thousands of messages until it crashes because the MSMQ storage limit is hit.
I'm seeing this behavior since I switched the persistence mechanism from RavenDB to SQL Server.
Does anyone have an idea what could be wrong?
Transport: MSMQ
Persistence: NHibernate
Packages used:
NHibernate version 4.0.4.4000
NServiceBus version 5.2.14
NServiceBus.Host version 6.0.0
NServiceBus.Log4Net version 1.0.0
NServiceBus.NHibernate version 6.2.7
Test setup:
* endpoint 1 is sending 22000 messages to endpoint 2.
* endpoint 2 hosts a saga that is started by that message.
* each saga publishes an event and then requests 2 timeouts: 1 at 4 minutes, 1 at 10 minutes.
Observed behavior:
* endpoint 1 sends the 22K messages in under a minute.
* endpoint 2 (the saga) processes 5 to 10 messages per second.
* after 4 minutes the first timeouts are fired, while endpoint 2 is still processing messages from its queue and thus is still creating new saga instances.
* from that moment on, the timeoutsdispatcher queue of the saga endpoint is getting filled with messages.
* after 10 minutes or so, the timeoutsdispatcher queue already contains over 170K messages and is still filling up.
* That continues until endpoint 2 crashes because the MSMQ storage limit is hit, or all messages from the input queue are processed. If the latter occurs first, the timeoutsdispatcher queue message count starts to decrease until it eventually reaches 0.
Did you perform the same stress test with RavenDB? And is SQL Server on a machine that's more-or-less equally powerful, with fast drives?
Update
Some checks for your saga
Is the [Unique] attribute used and is it used properly? In other words, do you use unique ids for every incoming message? So that every incoming message that is spawning 2 timeouts, will create a unique saga instance? If every incoming message is accessing the same Saga, this would be a great case for extremely limiting throughput. Imagine the Saga instance was created already once, else the explanation would become to complex. So Message1 comes in, tries to find the row in the database, finds and locks it. The second message comes in at the same time, finds the row but it's locked. It will go into retry. Message3 up until Message100 come in (if concurrency is set to 100) and all try to do the same thing, immediately failing. You can see this will limit throughput for a while :)
Are the correct indexes on your Saga table(s) and Timeout tables?
What is your maximum concurrency level set to?
Based on the number of message, you say you send 22k messages, resulting in 44k timeout messages. Image all of these timeouts are in MSMQ. Imagine messages are really, really small, like 1Kb. Header information added by NServiceBus might take up 2Kb. That's 44.000 times 3Kb is roughly 135 megabytes. So there's no way that can fill up a default MSMQ installation which has a quota of 1GB by default.
This probably means your deadletter queue is filled up completely. Find more information on MSMQ connectionstrings and set the appropriate connectionstring. For example
<connectionStrings>
<add name="NServiceBus/Transport"
connectionString="deadLetter=false;journal=false;"/>
</connectionStrings>
Messages with TimeToBeReceived attribute set (link) end up in deadletter queue. Also purging queues will make all messages go to deadletter queue. Unless you set the proper connectionstring.

NServicebus saga performance

I have a trouble with NSB saga performance. We have one single saga that orchestrate long running session. Saga sends a lot of messages to different processors and than gets its replies.
I see that sagas queue contains tons of incoming messages. Each messages processing is very fast, but there is a delay between handling next message. Here is a part of log file:
16:26:42 [14][DEBUG] Finished handling message.
16:26:46 [15][DEBUG] ChildContainerBehavior
16:26:46 [15][DEBUG] MessageHandlingLoggingBehavior
16:26:46 [15][DEBUG] Received message with ID 28b285ce-3b77-4a69-a13a-a3bf009717fd from sender xxxHost#PROCESSOR01
We see a 4 seconds delay. That is very slow. Please help, what is wrong with my saga?
Thanks!
Since you have a monolithic saga, you will have some contention on the state record that backs the saga in storage. You will want to consider breaking up your endpoint or redesigning how you gather the information. Check out this Routing Slip implementation.

MSMQ + WCF - Retry with Growing Delay

I am using MSMQ 4 with WCF. We have a Microsoft Dynamics plugin putting a message on an queue. A service picks up the message and makes an HTTP request to another web server. The web server responds by putting another message on a different queue. A second service picks up the messages and sends the response back to Dynamics...
We have our retry queue set up to retry 3 times and then wait for 5 minutes before retrying again. The Dynamics system some times takes so long (due to other plugins) that we can round-trip before the database transaction commits. The user's aren't seeing the update come through for another 5 minutes.
I am curious if there is a way to configure the retry mechanism to retry incrementally. So, the first time it fails, it only waits a few seconds. If it fails a second time, it waits twice that. And the time between retries just keeps growing.
The problem with just reducing the time between retries is that a bad message could easily fill up a log file.
It turns out there is no built-in way of doing this. One slightly involved option is to create multiple queues, each with its own retry/poison sub-queues, each with a growing retry delay. You can reuse the same handler for each queue - the only thing that changes is the configuration. You also need a handler that can read the poison sub-queues (service) and move the message to the next queue in the chain (client).
So, you set receiveErrorHandling to Move. The maxRetryCycles and receiveRetryCount are just 1. Each queue will use a growing retryCycleDelay. Each queue you create will have a poison sub-queue created for it automatically. You simply read from each poison sub-queue and use a client to move it to the next queue.
I am sure someone could write some code that would automatically create N queues with a growing retryCycleDelay and hook it up all programmatically. Since it is the same handler/client for every queue, it wouldn't be a big deal.

MassTransit Saga saves late with NHibernateSagaRepository (Starbucks example)

I have converted the Starbucks example to use RabbitMQ and NHibernate.. However, there is a bug/challenge/issue with the DrinkPreparationSaga and when it actually gets saved to the database vs. when the PaymentCompleteMessage gets submitted.
How the code works (out of the box, this isn't anything I changed)... The new instance of the saga isn't saved to the database until AFTER the Initial state completes and it transitions to its next state.
The problem is that in the sample Starbucks application the DrinkPreparationSaga starts off with a very slow method that prints out coffee making sounds once every 1 a second 10 times..
So there is 10 seconds between when the Saga is actually created and when its saved to the database.. The bigger problem with that is, that any other messages that are destined to that instances of the saga (by CorrolationId) are thrown in the error queue because the Saga doesn't exist.
Shouldn't the NHibernateSagaRepository immediately save the new Saga Instance, then run the workflow, then update the saga post workflow? I can't seem to think of another way to make the example work, but that would require a decent bit of reorganizing in the NHibernateSagaRepository.
Thanks in advance.
The reason sagas are not saved before processing the message is that some members of the saga may not be nullable (or allow nulls) and they are not set before the initial saga message is processed.
And you're correct. Take a look at the Riktig sample (http://github.com/phatboyg/Riktig) to see how the Automatonymous sagas are used and how the correlation of another service (in this case, the image retrieval service) is used. Sagas should not actually perform work, but coordinate the state of a transaction. The earlier Starbucks example was a naive implementation that we built one morning in Austin. It is long overdue for an update (for more reasons, including that it still uses Magnum state machines, which are soon deprecated in favor of Automatonymous).