Redis, XREADGROUP stream with block for a year, stupid? - redis

Are there any downsides to telling XREADGROUP to block until there is a message rather than the client having to poll?
From:
https://redis.io/commands/xreadgroup
It is not clear that this means:
"On the other side when XREADGROUP blocks, XADD will pay the O(N) time in order to serve the N clients blocked on the stream getting new data."
Can someone shed some light on the blocking mechanisms of streams in Redis?

"On the other side when XREADGROUP blocks, XADD will pay the O(N) time in order to serve the N clients blocked on the stream getting new data."
Say, the stream is empty, and N clients call XREADGROUP with different group names. Since the stream is empty, these clients will block until there's new message.
When you call XADD to add a message to the stream, Redis need to send replies to these N blocking clients. That's why XADD will pay O(N) time.
Are there any downsides to telling XREADGROUP to block until there is a message rather than the client having to poll?
If N is very large, i.e. too many clients blocking on the stream, XADD command might block Redis for a while, since it's single-threaded. If N is small, there won't be performance impact.

Related

Is it a good practice to create a channel for each user in redis message bus

We are using redis message bus and handling messages using a channel. But if our application is deployed in multiple instances then the request and response is passed to all the instances. To avoid this scenario which of the below approach is better?
Create a channel for each instance of the application
Create a channel for each user
Any suggestions will be highly appreciated
The limiting factor here is the number of subscribers to the same channel. Number of channels can be large as such. So you can choose the granularity accordingly. Read more here:
https://groups.google.com/forum/#!topic/redis-db/R09u__3Jzfk
All the complexity on the end is on the PUBLISH command, that performs
an amount of work that is proportional to:
a) The number of clients receiving the message.
b) The number of clients subscribed to a pattern, even if they'll not
match the message.
This means that if you have N clients subscribed to 100000 different
channels, everything will be super fast.
If you have instead 10000 clients subscribed to the same channel,
PUBLISH commands against this channel will be slow, and take maybe a
few milliseconds (not sure about the actual time taken). Since we have
to send the same message to everybody.
Similar question asked before : How does Redis PubSub subscribe mechanism works?

Why Redis Publish operation's time complexity is O(M+N)

As per Redis website, Time complexity of a publish is O(N+M).
Time complexity: O(N+M) where N is the number of clients subscribed to the receiving channel and M is the total number of subscribed patterns (by any client).
I might be wrong, but why can't it send message to all the client asynchronously using some asyn API (as writing message to network stream is I/O bound operation)?

message queue for processing streams of data

I have streams of input data that I process. Each stream is sent in chunks of data. I can only process the N+1st chunk of data of a stream i after I finished processing the Nth chunk of data of the same stream i. Therefore, parallelization can happen by processing multiple streams at once, but I can never split one stream on multiple workers.
Chunks of one stream are added to the queue in order (although chunks from several streams can be added at the same time).
Most message queues, like RabbitMQ, guarantee ordered delivery when multiple workers operate on one queue. However, to achieve the behaviour I would like, I'd need to restrict the number of workers to 1 for each queue, so that the next chunk is always only processed when the previous chunk was finished. To parallelize, I could create a queue for each stream, or a queue for each worker, and have another process that redirects the streams to the worker queues. In fact, the one-queue-per-worker approach is what I do right now, using RabbitMQ's consistent-hashing and shovels. Of course, in terms of load balancing and dynamic scaling of the number of workers, that is far from ideal.
I've read a lot about Kafka, and how it is designed for time-series data (like logs). Yet, I couldn't figure out how I could apply Kafka - or any other message queue out there - to solve my problem.
I would greatly appreciate some hints on how to best use a message queue for my problem.
You could use Kafka, but you'd have to use some stream identification to hash messages on the Producer side, so that messages from one stream always go to the same partition.
Then, on the Consumer side, you'd have to use the low-level consumer to spawn as much consuming threads as you have partitions, where each thread would consume from a single partition.
That would mean that you always process messages in order within each of your streams.
I haven't yet checked out how Kafka 0.9 Producer works, but there were some changes, so you should probably look into those if you want to use the latest version.
Why don't you push the next chunk only after receiving the delivery acknowledgement of the former chunk to the worker? Or some kind of a flag that the former chuck is processed by the worker, flag is set to true & then push the next chunk.
If you need to parallelize work create several queues with unique routing keys, based on routing keys push the chunks to respective queues. And have separate flags for every routing key.

AMQP basic.get concurrent consumers pulling from Queue

When using RabbitMQ as Message Broker, I have a scenario where multiple concurrent consumers pull messages from a Queue using the basic.get AMQP method and use explicit acknowledgement for deleting the message from the Queue. Assuming the following setup
Q has messages M1, M2, M3 and has consumers C1, C2 and C3 (each having its own connection and channel) connected to it.
How is concurrency handled in the basic.get method? Is the call to basic.get method synchronized to handle concurrent consumers each using its own connection and channel? C1, C2 and C3 issue a basic.get call to receive a message at the same time (assume the server receives all 3 requests simultaneously).
C1 requests a message using basic.get and gets M1. When C2 requests for a message, since its using a different connection, does it get M1 again?
How can consumers pull messages in batches of a predefined size?
Your questions really hit at the heart of queuing and process theory, so I will answer from that standpoint (RabbitMQ is really a generic message broker as far as my answers are concerned, as this applies to any message broker).
How is concurrency handled in the basic.get method? Is the call to
basic.get method synchronized to handle concurrent consumers each
using its own connection and channel? C1, C2 and C3 issue a basic.get
call to receive a message at the same time (assume the server receives
all 3 requests simultaneously).
Answer 1: RabbitMQ is designed to be a reliable message broker. It contains internal processes and controls to ensure that the same message does not get passed out multiple times to different consumers. Now, due to the impracticality of testing the scenario that you describe, does it work perfectly? Who knows. That is why properly-designed applications using message-based architecture will use idempotent transactions, such that if the same transaction is processed multiple times, the result will be the same as if the transaction was processed once.
Takeaway: Design your application so that the answer to this question is unimportant.
C1 requests a message using basic.get and gets M1. When C2 requests
for a message, since its using a different connection, does it get M1
again?
Answer 2: No. Subject to the assumptions of my previous answer, the RabbitMQ broker will not serve the same message back once it has been delivered. Depending on the settings of the channel and queue, the message may be automatically acknowledged upon delivery and will never be redelivered. Other settings will have the message requeue automatically upon the "death" of the processing thread/channel or a negative acknowledgment from your processing thread. This is important functionality, since a "poison" message could repeatedly wreak havoc in your application if it could be served up to multiple consumers. Takeaway: you may safely rely on this assumption in designing your application.
How can consumers pull messages in batches of a predefined size?
Answer: They can't, nor would it make sense for them to. In any queuing system, the fundamental assumption is that items are removed from the queue in single file. Attempts to violate this assumption result in unpredictable behavior; furthermore, single-piece flow is commonly the most efficient method of processing. However, in the real world, there are cases where batch sizes > 1 are necessary. In such cases, it makes sense to load the batch into its own single message, so this may require a separate processing thread that pulls messages from the queue and batches them together, or put them in batches initially. Keep in mind that once you have multiple consumers, there is no possible way to guarantee single messages will be processed in order. Takeaway: Batching should be avoided wherever possible, but where it is not practical to avoid, you may not assume that batches will contain individual messages in any particular order.
You might wanna read the RabbitMQ Api guide and the introduction to Amqp.
First of all, avoid consuming messages using basicGet in your consumers. Rather use the Consumer interface basicConsume. This allows RabbitMq to push you messages as they arrive on the queue. Everything else is a waist of resources here as it boils down to busy polling.
When using basicConsume RabbitMq will even push you more messages in the background up to a certain prefetch count. This allows you to process multiple messages concurrently as well as minimizing the time you need to wait for your next message to process (if some message is available).
Concurrency is not an issue at all, that's what you're using a queue for!
When having multiple consumers on one queue, a message will always only be delivered to one consumer (as long as the message is ACKed). Otherwise you need private queues for each consumer and route your messages accordingly.
Btw, if you're able to share the connection among your consumers, you should do so.
Just make sure to use one channel per thread.
There is no special configuration required for that scenario. Each client will atomically fetch and receive one message from the queue, just as you would like to happen.

Redis Pub/Sub with Reliability

I've been looking at using Redis Pub/Sub as a replacement to RabbitMQ.
From my understanding Redis's pub/sub holds a persistent connection to each of the subscribers, and if the connection is terminated, all future messages will be lost and dropped on the floor.
One possible solution is to use a list (and blocking wait) to store all the message and pub/sub as just a notification mechanism. I think this gets me most of the way there, but I still have some concerns about the failure cases.
what happens when a subscriber dies, and comes back online, how should it process all it's pending messages?
when a malformed message comes though the system, how do you handle those exceptions? DeadLetter Queue?
is there a standard practice to implementing a retry policy?
When a subscriber (consumer) dies, your list will continue to grow until the client returns. Your producer could trim the list (from either side) once it reaches a specific limit, but that is something you would need to handle at the application level. If you include a timestamp within each message, your consumer can then act on the age of a message, assuming you have application logic you want to enforce on message age.
I'm not sure how a malformed message would enter the system, as the connection to Redis is usually TCP with the its integrity assurances. But if this happens, perhaps due to a bug in message encoding at the producer layer, you could provide a general mechanism for handling errors by keeping a queue-per-producer that received consumer's exception messages.
Retry policies will depend greatly on your application needs. If you need 100% assurance that a message has been received and processed, then you should consider using Redis transactions (MULTI/EXEC) to wrap the work done by a consumer, so you can ensure that a client doesn't remove a message unless it has completed its work. If you need explicit acknowlegement, then you could use an explicit ACK message on a queue dedicated to the producer process(es).
Without knowing more about your application needs, it's hard to know how to choose wisely. Generally, if your messages require full ACID protection, then you probably also need to use redis transactions. If your messages are only meaningful when they are timely, then transactions may not be needed. It sounds as though you can't tolerate dropped messages, so your approach of using a list is good. If you need to implement a priority queue for your messages, you can use the sorted set (the Z-commands) to store your messages, using their priority as the score value, along with a polling consumer.
If you want a pub/sub system where subscribers won't lose messages when they die, consider using Redis Streams instead of Redis Pub/sub.
Redis Streams have their own architecture and pros/cons to Redis Pub/sub. With Redis Streams, a subscriber can issue the command:
the last message I received was X, now give me the next message;
if there is no new message, then wait for one to arrive.
Antirez's article linked above is a good intro to Redis streams with more info.
What I did is use a sorted set using the timestamp as the score and the key to the data as the member value. I use the score from the last item to retrieve the next few ones and then get the keys. Once the work is done I wrap both the zrem and the del in a MULTI/EXEC transaction.
Essentially what Edward said, but with the twist of storing the keys in the sorted set, as my messages can be pretty big.
Hope this helps!