What is the point of the immediate multiple retries in messaging systems? - rabbitmq

I've recently been reading up on messaging systems and have specifically looked at both RabbitMQ and NServiceBus. As I have understood it, if a message fails for some reason it is tried again immidiately a number of times. Both systems then offers the possibility to try again later, for example in 5 seconds. When the five seconds have passed the message is sent again a number of times.
I quote Vaughn Vernon in Implementing Domain-Driven Design (p.502):
The other way to handle this is to simply retry the send until it succeeds, perhaps using a Capped Exponential Back-off. In the case of RabbitMQ, retries could fail for quite a while. Thus, using a combination of message NAKs and retries could be the best approach. Still, if our process retries three times every five minutes, it could be all we need.
For NServiceBus, this is called second level retries, and when the retry happens, it happens multiple times.
Why does it need to happen multiple times? Why does it not retry once every five minutes? What is the chance that the first retry after five minutes fails and the second retry, probably just milliseconds later, should succeed?
And in case it does not need to due to some configuration (does it?), why do all the examples I have found have multiple retries?

My background is NServiceBus so my answer may be couched in those terms.
First level retries are great for very transient errors. Deadlocks are a perfect example of this. You try to change the database, and your transaction is chosen as the deadlock victim. In these cases, a first level retry is perfect. Most of the time, one first level retry is all you need. If there is a lot of contention in the database, maybe 2 or 3 retries will be good enough.
Second level retries are for your less transient errors. Think about things like a web service being down for 10 seconds, or a SQL Server database in a failover cluster switching over, which can take 30-60 seconds. If you retry a few milliseconds later, it's not going to do you any good, but 10, 20, 30 seconds later you might have a good shot.
However, the crux of the question is after 5 first level retries and then a delay, why try again 5 times before an additional delay?
First, on your first second-level retry, it's still possible that you could get a deadlock or other very transient error. After all, the goal is usually not to make as slow a system as possible so it would be preferable to not have to wait an additional delay before retrying if the problem is truly transient. Of course there's no way for the infrastructure to know just how transient the problem is.
The second reason is that it's just easier to configure if they're all the same. X levels of retry and Y tries per level = X*Y total tries and only 2 numbers in the configuration file. In NServiceBus, it's these 2 values plus the back-off time span, so the config looks like this:
<SecondLevelRetriesConfigEnabled="true" TimeIncrease ="00:00:10" NumberOfRetries="3" />
<TransportConfig MaxRetries="3" />
That's fairly simple. Try 3 times. Wait 10 seconds. Try 3 times. Wait 20 seconds. Try 3 times. Wait 30 seconds. Try 3 times. Then you're done and you move on to an error queue.
Configuring different values for each level would require a much more complex config story.

First Level Retries exist to compensate for quick issues like networking and database locks. This is configurable in NSB, so if you don't want them, you can turn them off. Second Level Retries are to compensate for longer outages. For example we use SLRs to compensate for a database that recycles every night at the same time.
The OOTB functionality increases the duration between SLRs because it assumes that if it didn't work the previous time, you will need more time to fix it. There exists a Retry Policy that is overridable, so you can change how the SLRs work.
In NSB, the FLRs always come first and SLRs don't come into play unless the transaction is still failing after FLRs. In addition, you can disable SLRs altogether and build your own custom Fault Manager which have additionally functionality. We have a process where we have a Fault Manager that sends issues to a staffed help desk, as that is the only way to solve a particular subset of issues.

Related

Are delayed messages in Redis reliable?

I have an architecture solution that relies on the delayed messages.
In short:
There are many clients (mostly mobile devices running android or ios) that can process a given job.
I am creating a job delegation (in RDBMS) for a given client expecting it to be picked up within a certain period of time and the "chosen" client receives a push notification that there is something for it to process. IMO the details about the algorithm of choosing single client out of many is irrelevant to the problem so skipping this part.
When the client pulls a job delegation then the status of it is changed from pending to processing.
As mentioned clients are mobile devices and are often carried by people in move and thus can, due to many reasons, be unable to pull the job delegation from the server and process it.
That's why during the creation of the job delegation, there is also a delayed message dispatched in Redis which is supposed to check in now() + 40 seconds if the job was pulled or not (so if the status is pending or not).
If the delegation hasn't been pulled by the client (status = pending) server times it out and creates a new job delegation with status = pending for a different client. As so on as so for.
It works pretty well except the fact that I've noticed the "check if should timeout" jobs do not ALWAYS run at the time I would expect them to be run. The average is 7 seconds later and the max is 29 seconds later for the analyzed sample of few thousands of jobs. Redis is used as a queue but also as a key-value cache store and in general heavily utilized by the system. May it become that much impacted by the load? I've sort of "reproduced" the issue also on my local environment with a containerized setup with much less load so I doubt it's entirely due to the Redis being busy.
The delay in execution (vs expected) is quite a problem here because it may happen that, especially in case of trying few clients from the list, the total time since creation of the job till it's successfully processed can increase a lot.
So back to the original question. Is the delayed messaging functionality in Redis reliable?
Are there any good recommended docs about it?
Are there any more reliable solutions designed to solve that issue?
Expecting that messages set to be executed in a given timestamp is executed no later than 2-3 seconds from that timestamp.

How to handle very long running processes in NServiceBus

I'm using NServiceBus to handle some asynchronous tasks. Occasionally I have a task where I need to process 10,000 records, so this takes a few hours.
My problem is that when I handle these records all together, I cannot use NServiceBus default transaction handling.
Also - if I split these records up into 10,000 smaller messages, they will clog up MSMQ for a few hours, and users who are expecting functions to take a few minutes, will be waiting hours.
Is there a way in NServiceBus to prioritise different messages?
I'd consider breaking it down into smaller batches (not necessarily one message per record) and having a separate endpoint service specifically for this process so that other stuff is not held up. If breaking it into batches and you care about the when they all complete then I'd recommend using a saga to track that state.

Bigquery streaming inserts taking time

During load testing of our module we found that bigquery insert calls are taking time (3-4 s). I am not sure if this is ok. We are using java biguqery client libarary and on an average we push 500 records per api call. We are expecting a million records per second traffic to our module so bigquery inserts are bottleneck to handle this traffic. Currently it is taking hours to push data.
Let me know if we need more info regarding code or scenario or anything.
Thanks
Pankaj
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
The approach you've chosen if takes hours that means it does not scale, and won't scale. You need to rethink the approach with async processes. In order to finish sooner, you need to run in parallel multiple workers, the streaming performance will be the same. Just having 10 workers in parallel it means time will be 10 times less.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.

Google BigQuery: Slow streaming inserts performance

We are using BigQuery as event logging platform.
The problem we faced was very slow insertAll post requests (https://cloud.google.com/bigquery/docs/reference/v2/tabledata/insertAll).
It does not matter where they are fired - from server or client side.
Minimum is 900ms, average is 1500s, where nearly 1000ms is connection time.
Even if there is 1 request per second (so no throttling here).
We use Google Analytics measurement protocol and timings from the same machines are 50-150ms.
The solution described in BigQuery streaming 'insertAll' performance with PHP suugested to use queues, but it seems to be overkill because we send no more than 10 requests per second.
The question is if 1500ms is normal for streaming inserts and if not, how to make them faster.
Addtional information:
If we send malformed JSON, response arrives in 50-100ms.
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
Also the failure rate fits the 99.9% uptime we have in the SLA, so there is no reason for objection.
There's something to keep in mind in regards to the SLA, it's a very strictly defined structure, the details are here. The 99.9% is uptime not directly translated into fail rate. What this means is that if BQ has a 30 minute downtime one month, and then you do 10,000 inserts within that period but didn't do any inserts in other times of the month, it will cause the numbers to be skewered. This is why we suggest a exponential backoff algorithm. The SLA is explicitly based on uptime and not error rate, but logically the two correlates closely if you do streaming inserts throughout the month at different times with backoff-retry setup. Technically, you should experience on average about 1/1000 failed insert if you are doing inserts through out the month if you have setup the proper retry mechanism.
You can check out this chart about your project health:
https://console.developers.google.com/project/YOUR-APP-ID/apiui/apiview/bigquery?tabId=usage&duration=P1D
It happens that my response is on the linked other article, and I proposed the queues, because it made our exponential-backoff with retry very easy, and working with queues is very easy. We use Beanstalkd.
To my experience any request to bigquery will take long. We've tried using it as a database for performance data but eventually are moving out due to slow response times. As far as I can see. BQ is built for handling big requests within a 1 - 10 second response time. These are the requests BQ categorizes as interactive. BQ doesn't get faster by doing less. We stream quite some records to BQ but always make sure we batch them up (per table). And run all requests asynchronously (or if you have to in another theat).
PS. I can confirm what Pentium10 sais about faillures in BQ. Make sure you retry the stuff that fails and if it fails again log it to file for retrying it another time.

Fault tolerant / high availability producer

I'm working on a distributed producer/consumer system using a messaging queue. The part that I'm interested in parallelising is the consumer side of it, and I'm happy with what I have for that.
However, I'm not sure what to do about the producer. I only need one producer running at a time since the load of the producing part of my system is not too high, but I want a reliable way of managing it, as in starting, stopping, restarting, and mainly, monitor it so that if the producer host fails another one can pick up.
If it helps, I'm happy with my consumer algorithm, the one that queues jobs, since it's fault tolerant to be down for a period of time and pick up the stuff that happened during the time it was down.
I'm sure there are tools or at least known patterns to do this and not reinvent the wheel.
I'm using rabbitmq but can use activemq, or even refactor into storm or something like that if needed, my code is not complex so far.
After a couple of weeks thinking about it, the simplest solution came to mind, and I'm actually very pleased with it, so I'll share it in case you find it useful, or point out if you think of any downsides, it seems to be working fine so far.
I have created the simplest table in my DB, called heartbeat, with a single timestamp field called ts, and is meant to have a single row all the time.
I start all of my potential producers every 5 minutes (quartz), and they do an update of the table if the ts fields is older than now() - 5 minutes. Because the update call is blocking, I'll have no db threading issues. Now, if the update returns > 0 it means that it actually modified the value of ts, and then I execute the actual producing code (queue jobs). If the update returns 0, it did not modify the table, because someone else did less than 5 minutes ago, and therefore this producer won't do anything until it checks again in 5 minutes.
Obviously the 5 minutes value is configurable, and this allows for a very neat upgrade with small changes to be able to execute several producers at the same time, if I ever had that need.