Are delayed messages in Redis reliable? - redis

I have an architecture solution that relies on the delayed messages.
In short:
There are many clients (mostly mobile devices running android or ios) that can process a given job.
I am creating a job delegation (in RDBMS) for a given client expecting it to be picked up within a certain period of time and the "chosen" client receives a push notification that there is something for it to process. IMO the details about the algorithm of choosing single client out of many is irrelevant to the problem so skipping this part.
When the client pulls a job delegation then the status of it is changed from pending to processing.
As mentioned clients are mobile devices and are often carried by people in move and thus can, due to many reasons, be unable to pull the job delegation from the server and process it.
That's why during the creation of the job delegation, there is also a delayed message dispatched in Redis which is supposed to check in now() + 40 seconds if the job was pulled or not (so if the status is pending or not).
If the delegation hasn't been pulled by the client (status = pending) server times it out and creates a new job delegation with status = pending for a different client. As so on as so for.
It works pretty well except the fact that I've noticed the "check if should timeout" jobs do not ALWAYS run at the time I would expect them to be run. The average is 7 seconds later and the max is 29 seconds later for the analyzed sample of few thousands of jobs. Redis is used as a queue but also as a key-value cache store and in general heavily utilized by the system. May it become that much impacted by the load? I've sort of "reproduced" the issue also on my local environment with a containerized setup with much less load so I doubt it's entirely due to the Redis being busy.
The delay in execution (vs expected) is quite a problem here because it may happen that, especially in case of trying few clients from the list, the total time since creation of the job till it's successfully processed can increase a lot.
So back to the original question. Is the delayed messaging functionality in Redis reliable?
Are there any good recommended docs about it?
Are there any more reliable solutions designed to solve that issue?
Expecting that messages set to be executed in a given timestamp is executed no later than 2-3 seconds from that timestamp.

Related

RabbitMQ - allow only one process per user

To keep it short, here is a simplified situation:
I need to implement a queue for background processing of imported data files. I want to dedicate a number of consumers for this specific task (let's say 10) so that multiple users can be processed at in parallel. At the same time, to avoid problems with concurrent data writes, I need to make sure that no one user is processed in multiple consumers at the same time, basically all files of a single user should be processed sequentially.
Current solution (but it does not feel right):
Have 1 queue where all import tasks are published (file_queue_main)
Have 10 queues for file processing (file_processing_n)
Have 1 result queue (file_results_queue)
Have a manager process (in this case in node.js) which consumes messages from file_queue_main one by one and decides to which file_processing queue to distribute that message. Basically keeps track of in which file_processing queues the current user is being processed.
Here is a little animation of my current solution and expected behaviour:
Is RabbitMQ even the tool for the job? For some reason, it feels like some sort of an anti-pattern. Appreciate any help!
The part about this that doesn't "feel right" to me is the manager process. It has to know the current state of each consumer, and it also has to stop and wait if all processors are working on other users. Ideally, you'd prefer to keep each process ignorant of the others. You're also getting very little benefit out of your processing queues, which are only used when a processor is already working on a message from the same user.
Ultimately, the best solution here is going to depend on exactly what your expected usage is and how likely it is that the next message is from a user that is already being processed. If you're expecting most of your messages coming in at any one time to be from 10 users or fewer, what you have might be fine. If you're expecting to be processing messages from many different users with only the occasional duplicate, your processing queues are going to be empty much of the time and you've created a lot of unnecessary complexity.
Other things you could do here:
Have all consumers pull from the same queue and use some sort of distributed locking to prevent collisions. If a consumer gets a message from a user that's already being worked on, requeue it and move on.
Set up your queue routing so that messages from the same user will always go to the same consumer. The downside is that if you don't spread the traffic out evenly, you could have some consumers backed up while others sit idle.
Also, if you're getting a lot of messages in from the same user at once that must be processed sequentially, I would question if they should be separate messages at all. Why not send a single message with a list of things to be processed? Much of the benefit of event queues comes from being able to treat each event as a discrete item that can be processed individually.
If the user has a unique ID, or the file being worked on has a unique ID then hash the ID to get the processing queue to enter. That way you will always have the same user / file task queued on the same processing queue.
I am not sure how this will affect queue length for the processing queues.

When to use - Delayed Job vs RabbitMQ

Can someone give me the clarity of the advantages of using RabbitMQ(message queue) instead of Delayed Job(background processing) ?
Basically I want to know when to use background processing and messaging queue ?
My web application has 3 components one main server which will handle all user requests and 2 app servers where all the background jobs(like es reindex, es record update, sending emails, crons) should be run.
I saw articles which say Database as a queue(delayed job) is very bad as the consumers will be polling the database for new jobs and updating the statuses of jobs which will lock the tables. Then how does rabbit MQ or other messaging queues store to avoid this problem.
There are other alternatives for delayed job like sidekiq which will run over redis instead of mysql. It is better to use sidekiq instead of rabbitmq?
And are there any advantages of using sidekiq over delayed job?
You have 2 workers and 1 web server: I guess your web app dispatches some delayed jobs to your workers. So you need a way to store the data related to those background jobs.
For that, you can use both a database (like Redis, this is what sidekick is doing) or a message queue (like RabbitMQ). A message queue is a specialized system that is very efficient for this use case (allowing a much higher throughput). A database would let you have a better introspection (as you can request the jobs table to see what your current situation is), while the queuing system would be more efficient but also is more a black box and will require new skills.
If you do not have performance issues, the simpler the better, even a simple mysql database should be enough. If you want a more powerful system or need a lot of monitoring you can also consider using a specialized hosted service such as zenaton (I'm founder) that will do all the heavy lifting for you, including scheduling or more sophisticated orchestration of your background jobs.
Both perform the same task, i.e executing jobs in the background, but go about it differently.
With delayed job one uses some sort of a database for storage, queries for the jobs thereafter then processes them. It's simple to set up but the performance and scalability aren't great.
RabbitMQ or its alternatives Redis e.t.c are harder to set up but their performance, flexibility and scalability is great, we are talking in the upwards of 5000 jobs per second besides you have tend to use less code.
Another option is to use task orchestration system like Cadence Workflow. It supports both delayed execution and queueing, but provides higher level programming model and tons of features that neither queues or delayed execution frameworks.
Cadence offers a lot of advantages over using queues for task processing.
Built it exponential retries with unlimited expiration interval
Failure handling. For example it allows to execute a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverble failures (SAGA)
Gives complete visibility into current state of the update. For example when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Cadence every event is recorded.
Ability to cancel an update in flight.
Built in distributed CRON
See the presentation that goes over Cadence programming model.

Bigquery streaming inserts taking time

During load testing of our module we found that bigquery insert calls are taking time (3-4 s). I am not sure if this is ok. We are using java biguqery client libarary and on an average we push 500 records per api call. We are expecting a million records per second traffic to our module so bigquery inserts are bottleneck to handle this traffic. Currently it is taking hours to push data.
Let me know if we need more info regarding code or scenario or anything.
Thanks
Pankaj
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
The approach you've chosen if takes hours that means it does not scale, and won't scale. You need to rethink the approach with async processes. In order to finish sooner, you need to run in parallel multiple workers, the streaming performance will be the same. Just having 10 workers in parallel it means time will be 10 times less.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.

Google BigQuery: Slow streaming inserts performance

We are using BigQuery as event logging platform.
The problem we faced was very slow insertAll post requests (https://cloud.google.com/bigquery/docs/reference/v2/tabledata/insertAll).
It does not matter where they are fired - from server or client side.
Minimum is 900ms, average is 1500s, where nearly 1000ms is connection time.
Even if there is 1 request per second (so no throttling here).
We use Google Analytics measurement protocol and timings from the same machines are 50-150ms.
The solution described in BigQuery streaming 'insertAll' performance with PHP suugested to use queues, but it seems to be overkill because we send no more than 10 requests per second.
The question is if 1500ms is normal for streaming inserts and if not, how to make them faster.
Addtional information:
If we send malformed JSON, response arrives in 50-100ms.
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
Also the failure rate fits the 99.9% uptime we have in the SLA, so there is no reason for objection.
There's something to keep in mind in regards to the SLA, it's a very strictly defined structure, the details are here. The 99.9% is uptime not directly translated into fail rate. What this means is that if BQ has a 30 minute downtime one month, and then you do 10,000 inserts within that period but didn't do any inserts in other times of the month, it will cause the numbers to be skewered. This is why we suggest a exponential backoff algorithm. The SLA is explicitly based on uptime and not error rate, but logically the two correlates closely if you do streaming inserts throughout the month at different times with backoff-retry setup. Technically, you should experience on average about 1/1000 failed insert if you are doing inserts through out the month if you have setup the proper retry mechanism.
You can check out this chart about your project health:
https://console.developers.google.com/project/YOUR-APP-ID/apiui/apiview/bigquery?tabId=usage&duration=P1D
It happens that my response is on the linked other article, and I proposed the queues, because it made our exponential-backoff with retry very easy, and working with queues is very easy. We use Beanstalkd.
To my experience any request to bigquery will take long. We've tried using it as a database for performance data but eventually are moving out due to slow response times. As far as I can see. BQ is built for handling big requests within a 1 - 10 second response time. These are the requests BQ categorizes as interactive. BQ doesn't get faster by doing less. We stream quite some records to BQ but always make sure we batch them up (per table). And run all requests asynchronously (or if you have to in another theat).
PS. I can confirm what Pentium10 sais about faillures in BQ. Make sure you retry the stuff that fails and if it fails again log it to file for retrying it another time.

Fault tolerant / high availability producer

I'm working on a distributed producer/consumer system using a messaging queue. The part that I'm interested in parallelising is the consumer side of it, and I'm happy with what I have for that.
However, I'm not sure what to do about the producer. I only need one producer running at a time since the load of the producing part of my system is not too high, but I want a reliable way of managing it, as in starting, stopping, restarting, and mainly, monitor it so that if the producer host fails another one can pick up.
If it helps, I'm happy with my consumer algorithm, the one that queues jobs, since it's fault tolerant to be down for a period of time and pick up the stuff that happened during the time it was down.
I'm sure there are tools or at least known patterns to do this and not reinvent the wheel.
I'm using rabbitmq but can use activemq, or even refactor into storm or something like that if needed, my code is not complex so far.
After a couple of weeks thinking about it, the simplest solution came to mind, and I'm actually very pleased with it, so I'll share it in case you find it useful, or point out if you think of any downsides, it seems to be working fine so far.
I have created the simplest table in my DB, called heartbeat, with a single timestamp field called ts, and is meant to have a single row all the time.
I start all of my potential producers every 5 minutes (quartz), and they do an update of the table if the ts fields is older than now() - 5 minutes. Because the update call is blocking, I'll have no db threading issues. Now, if the update returns > 0 it means that it actually modified the value of ts, and then I execute the actual producing code (queue jobs). If the update returns 0, it did not modify the table, because someone else did less than 5 minutes ago, and therefore this producer won't do anything until it checks again in 5 minutes.
Obviously the 5 minutes value is configurable, and this allows for a very neat upgrade with small changes to be able to execute several producers at the same time, if I ever had that need.