I'm working on a distributed producer/consumer system using a messaging queue. The part that I'm interested in parallelising is the consumer side of it, and I'm happy with what I have for that.
However, I'm not sure what to do about the producer. I only need one producer running at a time since the load of the producing part of my system is not too high, but I want a reliable way of managing it, as in starting, stopping, restarting, and mainly, monitor it so that if the producer host fails another one can pick up.
If it helps, I'm happy with my consumer algorithm, the one that queues jobs, since it's fault tolerant to be down for a period of time and pick up the stuff that happened during the time it was down.
I'm sure there are tools or at least known patterns to do this and not reinvent the wheel.
I'm using rabbitmq but can use activemq, or even refactor into storm or something like that if needed, my code is not complex so far.
After a couple of weeks thinking about it, the simplest solution came to mind, and I'm actually very pleased with it, so I'll share it in case you find it useful, or point out if you think of any downsides, it seems to be working fine so far.
I have created the simplest table in my DB, called heartbeat, with a single timestamp field called ts, and is meant to have a single row all the time.
I start all of my potential producers every 5 minutes (quartz), and they do an update of the table if the ts fields is older than now() - 5 minutes. Because the update call is blocking, I'll have no db threading issues. Now, if the update returns > 0 it means that it actually modified the value of ts, and then I execute the actual producing code (queue jobs). If the update returns 0, it did not modify the table, because someone else did less than 5 minutes ago, and therefore this producer won't do anything until it checks again in 5 minutes.
Obviously the 5 minutes value is configurable, and this allows for a very neat upgrade with small changes to be able to execute several producers at the same time, if I ever had that need.
Related
I have an architecture solution that relies on the delayed messages.
In short:
There are many clients (mostly mobile devices running android or ios) that can process a given job.
I am creating a job delegation (in RDBMS) for a given client expecting it to be picked up within a certain period of time and the "chosen" client receives a push notification that there is something for it to process. IMO the details about the algorithm of choosing single client out of many is irrelevant to the problem so skipping this part.
When the client pulls a job delegation then the status of it is changed from pending to processing.
As mentioned clients are mobile devices and are often carried by people in move and thus can, due to many reasons, be unable to pull the job delegation from the server and process it.
That's why during the creation of the job delegation, there is also a delayed message dispatched in Redis which is supposed to check in now() + 40 seconds if the job was pulled or not (so if the status is pending or not).
If the delegation hasn't been pulled by the client (status = pending) server times it out and creates a new job delegation with status = pending for a different client. As so on as so for.
It works pretty well except the fact that I've noticed the "check if should timeout" jobs do not ALWAYS run at the time I would expect them to be run. The average is 7 seconds later and the max is 29 seconds later for the analyzed sample of few thousands of jobs. Redis is used as a queue but also as a key-value cache store and in general heavily utilized by the system. May it become that much impacted by the load? I've sort of "reproduced" the issue also on my local environment with a containerized setup with much less load so I doubt it's entirely due to the Redis being busy.
The delay in execution (vs expected) is quite a problem here because it may happen that, especially in case of trying few clients from the list, the total time since creation of the job till it's successfully processed can increase a lot.
So back to the original question. Is the delayed messaging functionality in Redis reliable?
Are there any good recommended docs about it?
Are there any more reliable solutions designed to solve that issue?
Expecting that messages set to be executed in a given timestamp is executed no later than 2-3 seconds from that timestamp.
I am trying to get my head around an issue I have recently encountered and I hope someone will be able to point me in the most reasonable direction of solving it.
I am using Riak KV store and working on CRDT data, where I have some sort of counter inside each CRDT item stored in database.
I have a rabbitmq queue, where each message is a request to increase or decrease a certain amount of aforementioned counters.
Finally, I have a group of service-workers, that listens on the queue, and for each request try to change the amount of counters accordingly.
The issue I have is as follows: While a single worker is processing a request, it may get stuck for a while on a write operation to database – let’s say on a second change of counters out of three. It’s connection with rabbitmq gets lost (timeout) so the message-request gets back on to the queue (I cannot afford to miss one). Then it is picked up by second worker, which begins all processing anew. However, the first worker finishes its work, and as a results I have processed a single message twice.
I can split those increments into single actions, but this still leaves me with dilemma – can still change value of counter twice, if some worker gets stuck on a write operation for a long period.
I do not have possibility of making Riak KV CRDT writes work faster, nor can I accept missing out a message-request. I need to implement some means of checking whether a request was already processed before.
My initial thoughts were to use some alternative, quick KV store for storing rabbitMQ message ID if they are being processed. That way other workers could tell whether they are not starting to process a message that is already parsed elsewhere.
I could use any help and pointers to materials I can read.
You can't have "exactly one delivery" semantic. You can reduce double-sent messages or missed deliveries, so it's up to you to decide which misbehavior is the least inconvenient.
First of all are you sure it's the CRDTs that are too slow ? Are you using simple counters or counters inside maps ? In my experience they are quite fast, although slower than kv. You could try:
- having simple CRDTs (no maps), and more CRDTs objects, to lower their stress( can you split the counters in two ?)
- not using CRDTs but using good old sibling resolution on client side on simple key/values.
- accumulate the count updates orders and apply them in batch, but then you're accepting an increase in latency so it's equivalent to increasing the timeout.
Can you provide some metrics? Like how long the updates take, what numbers you'd expect, if it's as slow when you have few updates or many updates, etc
During load testing of our module we found that bigquery insert calls are taking time (3-4 s). I am not sure if this is ok. We are using java biguqery client libarary and on an average we push 500 records per api call. We are expecting a million records per second traffic to our module so bigquery inserts are bottleneck to handle this traffic. Currently it is taking hours to push data.
Let me know if we need more info regarding code or scenario or anything.
Thanks
Pankaj
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
The approach you've chosen if takes hours that means it does not scale, and won't scale. You need to rethink the approach with async processes. In order to finish sooner, you need to run in parallel multiple workers, the streaming performance will be the same. Just having 10 workers in parallel it means time will be 10 times less.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.
I've recently been reading up on messaging systems and have specifically looked at both RabbitMQ and NServiceBus. As I have understood it, if a message fails for some reason it is tried again immidiately a number of times. Both systems then offers the possibility to try again later, for example in 5 seconds. When the five seconds have passed the message is sent again a number of times.
I quote Vaughn Vernon in Implementing Domain-Driven Design (p.502):
The other way to handle this is to simply retry the send until it succeeds, perhaps using a Capped Exponential Back-off. In the case of RabbitMQ, retries could fail for quite a while. Thus, using a combination of message NAKs and retries could be the best approach. Still, if our process retries three times every five minutes, it could be all we need.
For NServiceBus, this is called second level retries, and when the retry happens, it happens multiple times.
Why does it need to happen multiple times? Why does it not retry once every five minutes? What is the chance that the first retry after five minutes fails and the second retry, probably just milliseconds later, should succeed?
And in case it does not need to due to some configuration (does it?), why do all the examples I have found have multiple retries?
My background is NServiceBus so my answer may be couched in those terms.
First level retries are great for very transient errors. Deadlocks are a perfect example of this. You try to change the database, and your transaction is chosen as the deadlock victim. In these cases, a first level retry is perfect. Most of the time, one first level retry is all you need. If there is a lot of contention in the database, maybe 2 or 3 retries will be good enough.
Second level retries are for your less transient errors. Think about things like a web service being down for 10 seconds, or a SQL Server database in a failover cluster switching over, which can take 30-60 seconds. If you retry a few milliseconds later, it's not going to do you any good, but 10, 20, 30 seconds later you might have a good shot.
However, the crux of the question is after 5 first level retries and then a delay, why try again 5 times before an additional delay?
First, on your first second-level retry, it's still possible that you could get a deadlock or other very transient error. After all, the goal is usually not to make as slow a system as possible so it would be preferable to not have to wait an additional delay before retrying if the problem is truly transient. Of course there's no way for the infrastructure to know just how transient the problem is.
The second reason is that it's just easier to configure if they're all the same. X levels of retry and Y tries per level = X*Y total tries and only 2 numbers in the configuration file. In NServiceBus, it's these 2 values plus the back-off time span, so the config looks like this:
<SecondLevelRetriesConfigEnabled="true" TimeIncrease ="00:00:10" NumberOfRetries="3" />
<TransportConfig MaxRetries="3" />
That's fairly simple. Try 3 times. Wait 10 seconds. Try 3 times. Wait 20 seconds. Try 3 times. Wait 30 seconds. Try 3 times. Then you're done and you move on to an error queue.
Configuring different values for each level would require a much more complex config story.
First Level Retries exist to compensate for quick issues like networking and database locks. This is configurable in NSB, so if you don't want them, you can turn them off. Second Level Retries are to compensate for longer outages. For example we use SLRs to compensate for a database that recycles every night at the same time.
The OOTB functionality increases the duration between SLRs because it assumes that if it didn't work the previous time, you will need more time to fix it. There exists a Retry Policy that is overridable, so you can change how the SLRs work.
In NSB, the FLRs always come first and SLRs don't come into play unless the transaction is still failing after FLRs. In addition, you can disable SLRs altogether and build your own custom Fault Manager which have additionally functionality. We have a process where we have a Fault Manager that sends issues to a staffed help desk, as that is the only way to solve a particular subset of issues.
We are working with a .NET 3.5 app which is fast approaching legacy status. We have an existing SOAP service which reads records from our database and saves them to a third party MS SQL database, sending all the data rows in a single batch.
This has always worked fine, but recently we've taken on a much larger client than any we've had before, and they are transmitting much larger batches, so much so that they have begun to fail. We've upped the time out and max memory sizes in IIS, and maxed out the maxRequestLength in the web.config, but we are still bumping up against size problems.
So, I understand that long term, we should consider moving away from SOAP and into WCF, and plans for that are in the works. But in the mean time, we need a short term fix for this new client. And of course, to make the business and sales people happy, we need it kinda quickly.
I'm wondering what the best-practice approach might be. Initially I'm thinking something like this, but I could be thinking inside the box too much:
Establish a bench mark of # of records over which we don’t want to attempt to sync all at once.
Before attempting to save the data, check the number of records against that bench mark
If it's above it, then break the transmission down into segments which are each below that benchmark. SELECT TOP 10000 * FROM table WHERE sent = false, etc., if the benchmark is 10000. Then update sent to true for those records once submitted. Repeat.
Obviously, this will slow the process down, so to handle the user experience, we may want to toss in a status bar so they can see the progress.
Am I on the right track?
In addition to the comments from John, you should consider if you are solving the problem in the most optimal way.
It looks like you are triggering a one way sync between 2 database by calling a web service. This approach leads to the time out and memory problems that you are experiencing.
If your goal is to do the one way sync, you could use a free framework such as Microsofts sync framework: http://msdn.microsoft.com/en-US/sync