How to handle very long running processes in NServiceBus - nservicebus

I'm using NServiceBus to handle some asynchronous tasks. Occasionally I have a task where I need to process 10,000 records, so this takes a few hours.
My problem is that when I handle these records all together, I cannot use NServiceBus default transaction handling.
Also - if I split these records up into 10,000 smaller messages, they will clog up MSMQ for a few hours, and users who are expecting functions to take a few minutes, will be waiting hours.
Is there a way in NServiceBus to prioritise different messages?

I'd consider breaking it down into smaller batches (not necessarily one message per record) and having a separate endpoint service specifically for this process so that other stuff is not held up. If breaking it into batches and you care about the when they all complete then I'd recommend using a saga to track that state.

Related

RabbitMQ - allow only one process per user

To keep it short, here is a simplified situation:
I need to implement a queue for background processing of imported data files. I want to dedicate a number of consumers for this specific task (let's say 10) so that multiple users can be processed at in parallel. At the same time, to avoid problems with concurrent data writes, I need to make sure that no one user is processed in multiple consumers at the same time, basically all files of a single user should be processed sequentially.
Current solution (but it does not feel right):
Have 1 queue where all import tasks are published (file_queue_main)
Have 10 queues for file processing (file_processing_n)
Have 1 result queue (file_results_queue)
Have a manager process (in this case in node.js) which consumes messages from file_queue_main one by one and decides to which file_processing queue to distribute that message. Basically keeps track of in which file_processing queues the current user is being processed.
Here is a little animation of my current solution and expected behaviour:
Is RabbitMQ even the tool for the job? For some reason, it feels like some sort of an anti-pattern. Appreciate any help!
The part about this that doesn't "feel right" to me is the manager process. It has to know the current state of each consumer, and it also has to stop and wait if all processors are working on other users. Ideally, you'd prefer to keep each process ignorant of the others. You're also getting very little benefit out of your processing queues, which are only used when a processor is already working on a message from the same user.
Ultimately, the best solution here is going to depend on exactly what your expected usage is and how likely it is that the next message is from a user that is already being processed. If you're expecting most of your messages coming in at any one time to be from 10 users or fewer, what you have might be fine. If you're expecting to be processing messages from many different users with only the occasional duplicate, your processing queues are going to be empty much of the time and you've created a lot of unnecessary complexity.
Other things you could do here:
Have all consumers pull from the same queue and use some sort of distributed locking to prevent collisions. If a consumer gets a message from a user that's already being worked on, requeue it and move on.
Set up your queue routing so that messages from the same user will always go to the same consumer. The downside is that if you don't spread the traffic out evenly, you could have some consumers backed up while others sit idle.
Also, if you're getting a lot of messages in from the same user at once that must be processed sequentially, I would question if they should be separate messages at all. Why not send a single message with a list of things to be processed? Much of the benefit of event queues comes from being able to treat each event as a discrete item that can be processed individually.
If the user has a unique ID, or the file being worked on has a unique ID then hash the ID to get the processing queue to enter. That way you will always have the same user / file task queued on the same processing queue.
I am not sure how this will affect queue length for the processing queues.

How can I get SQL Service Broker to actually use all available Queue Readers?

I've built a data collection framework around service broker. There are several procs that fill the queue with various jobs. Then a listener (activated procedure) that takes the jobs, decides what needs to be done with that item, and hands it off to the correct collection proc.
The activation queue has a MAX_QUEUE_READERS of 10, but almost never reaches that limit. Instead it will take far longer to process with just 1 or 2 activated tasks as seen from dm_broker_activated_tasks.
How can I incentivize or even force the higher number of workers?
EDIT: THIS MS doc says it only checks for activation every 5 sec.
Does that mean if my tasks take less that 5 seconds I have no way to parallelize them through service broker?
Service Broker has a specific concept for parallelism, namely the conversation group. Only messages from different groups can be processed in parallel. How this manifests is that a RECEIVE will lock the conversation group for the dequeued message and no other RECEIVE can dequeue messages from the same conversation group.
So even if you do have more messages in your queue, if they belong to the same conversation group then SQL Server cannot activate more parallel readers.
Even if you don't manage conversation groups explicitly (almost nobody does), they are managed implicitly by the fact that a conversation handle is also a group. Basically, every time you issue a single BEGIN DIALOG followed by several SEND on the same handle, they will not be processable in parallel. If you issue separate BEGIN DIALOG for each SEND they are processable in parallel, but you loose the order guarantee.

Bigquery streaming inserts taking time

During load testing of our module we found that bigquery insert calls are taking time (3-4 s). I am not sure if this is ok. We are using java biguqery client libarary and on an average we push 500 records per api call. We are expecting a million records per second traffic to our module so bigquery inserts are bottleneck to handle this traffic. Currently it is taking hours to push data.
Let me know if we need more info regarding code or scenario or anything.
Thanks
Pankaj
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
The approach you've chosen if takes hours that means it does not scale, and won't scale. You need to rethink the approach with async processes. In order to finish sooner, you need to run in parallel multiple workers, the streaming performance will be the same. Just having 10 workers in parallel it means time will be 10 times less.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.

What's the recommended way to queue "delayed execution" messages via ServiceStack/Redis MQ?

I would like to queue up messages to be processed, only after a given duration of time elapses (i.e., a minimum date/time for execution is met), and/or at processing time of a message, defer its execution to a later point in time (say some prerequisite checks are not met).
For example, an event happens which defines a process that needs to run no sooner than 1 hour from the time of the initial event.
Is there any built in/suggested model to orchestrate this using https://github.com/ServiceStack/ServiceStack/wiki/Messaging-and-Redis?
I would probably build this in a two step approach.
Queue the Task into your Queueing system, which will process it into a persistence store: SQL Server, MongoDB, RavenDB.
Have a service polling your "Queued" tasks for when they should be reinserted back into the Queue.
Probably the safest way, since you don't want to lose these jobs presumably.
If you use RabbitMQ instead of Redis you could use Dead Letter Queues to get the same behavior. Dead letter queues essentially are catchers for expired messages.
So you push your messages into a queue with no intention of processing them, and they have a specific expiration in minutes. When they expire they pop over into the queue that you will process out of. Pretty slick way to queue things for later.
You could always use https://github.com/dominionenterprises/mongo-queue-csharp or https://github.com/dominionenterprises/mongo-queue-php or https://github.com/gaillard/mongo-queue-java which provides delayed messages and other uncommon features.

What is the point of the immediate multiple retries in messaging systems?

I've recently been reading up on messaging systems and have specifically looked at both RabbitMQ and NServiceBus. As I have understood it, if a message fails for some reason it is tried again immidiately a number of times. Both systems then offers the possibility to try again later, for example in 5 seconds. When the five seconds have passed the message is sent again a number of times.
I quote Vaughn Vernon in Implementing Domain-Driven Design (p.502):
The other way to handle this is to simply retry the send until it succeeds, perhaps using a Capped Exponential Back-off. In the case of RabbitMQ, retries could fail for quite a while. Thus, using a combination of message NAKs and retries could be the best approach. Still, if our process retries three times every five minutes, it could be all we need.
For NServiceBus, this is called second level retries, and when the retry happens, it happens multiple times.
Why does it need to happen multiple times? Why does it not retry once every five minutes? What is the chance that the first retry after five minutes fails and the second retry, probably just milliseconds later, should succeed?
And in case it does not need to due to some configuration (does it?), why do all the examples I have found have multiple retries?
My background is NServiceBus so my answer may be couched in those terms.
First level retries are great for very transient errors. Deadlocks are a perfect example of this. You try to change the database, and your transaction is chosen as the deadlock victim. In these cases, a first level retry is perfect. Most of the time, one first level retry is all you need. If there is a lot of contention in the database, maybe 2 or 3 retries will be good enough.
Second level retries are for your less transient errors. Think about things like a web service being down for 10 seconds, or a SQL Server database in a failover cluster switching over, which can take 30-60 seconds. If you retry a few milliseconds later, it's not going to do you any good, but 10, 20, 30 seconds later you might have a good shot.
However, the crux of the question is after 5 first level retries and then a delay, why try again 5 times before an additional delay?
First, on your first second-level retry, it's still possible that you could get a deadlock or other very transient error. After all, the goal is usually not to make as slow a system as possible so it would be preferable to not have to wait an additional delay before retrying if the problem is truly transient. Of course there's no way for the infrastructure to know just how transient the problem is.
The second reason is that it's just easier to configure if they're all the same. X levels of retry and Y tries per level = X*Y total tries and only 2 numbers in the configuration file. In NServiceBus, it's these 2 values plus the back-off time span, so the config looks like this:
<SecondLevelRetriesConfigEnabled="true" TimeIncrease ="00:00:10" NumberOfRetries="3" />
<TransportConfig MaxRetries="3" />
That's fairly simple. Try 3 times. Wait 10 seconds. Try 3 times. Wait 20 seconds. Try 3 times. Wait 30 seconds. Try 3 times. Then you're done and you move on to an error queue.
Configuring different values for each level would require a much more complex config story.
First Level Retries exist to compensate for quick issues like networking and database locks. This is configurable in NSB, so if you don't want them, you can turn them off. Second Level Retries are to compensate for longer outages. For example we use SLRs to compensate for a database that recycles every night at the same time.
The OOTB functionality increases the duration between SLRs because it assumes that if it didn't work the previous time, you will need more time to fix it. There exists a Retry Policy that is overridable, so you can change how the SLRs work.
In NSB, the FLRs always come first and SLRs don't come into play unless the transaction is still failing after FLRs. In addition, you can disable SLRs altogether and build your own custom Fault Manager which have additionally functionality. We have a process where we have a Fault Manager that sends issues to a staffed help desk, as that is the only way to solve a particular subset of issues.