I was reading this tutorial about RabbitMQ. In it's description of what a queue is in RabbitMQ, it says the following:
A queue is only bound by the host's memory & disk limits, it's
essentially a large message buffer.
In this context, what is a message buffer? Is it a common data structure?
In this context, what is a message buffer? Is it a common data structure?
A buffer in Computer Science typically refers to a data structure or memory region which holds data temporarily while it is moved from one location to another.
You will find buffers widespread at many levels of abstraction throughout the hardware/software stack. They are especially common around interaction points with hardware devices (reading/writing data to/from software and peripherals, for example) and in networking code which writes data to/from network sockets. They are particularly useful where it is necessary to decouple a producer & consumer (different processes may read/write the buffered data, for example, or do so at different speeds) or in cases where users of a resource must queue prior to being serviced.
In the RabbitMQ context, "message buffer" refers to Rabbit's message queue data structure. A queue is a region of memory, backed by a persistent copy of messages on disk, in which RabbitMQ stores messages submitted by producers1 while it awaits a consumer to read the queue and process the message. The RabbitMQ broker acts as an intermediary to decouple the producer and consumer processes from each other.
1Of course, RabbitMQ offers its users advanced routing logic for submitted messages. Messages submitted by users may be committed directly to a queue (buffer) for delivery, or they may traverse a more complex set of routes which dynamically delivers the message to zero or more queues for delivery to multiple consumer processes.
Related
We are building spark based jobs. Processing each message delivered by the queue takes time. There is a need to be able to reprioritize one already sent to the queue.
I am aware there is priority queue implementation available, but not sure how to re-prioritize the existing message in the queue?
One bad workaround is to push that message again as higher priority, so that it handled on priority. Later drop the message with same content which had low or no priority when it's turns comes next.
Is there a natural way we can handle this situation or any other queues that supports scenario better?
Unfortunately there isn't. Queues are to be considered as lists of messages in flight. It is not possible to delete/update them.
Your approach of submitting a higher priority message is the only feasible solution.
RabbitMQ is a messaging system (such as the postal one), it is not a DataBase or a storage service. The storage in form of queues is a necessary feature as much as the postal service needs storage for postcards in transit. It is optimized for the purpose and does not allow to access the messages easily.
I`ve been reading about the principles of AMQP messaging confirms. (https://www.rabbitmq.com/confirms.html). Really helpful and wel written article but one particular thing about consumer aknowledgments is really confusing, here is the quote:
Another things that's important to consider when using automatic acknowledgement mode is that of consumer overload.
Consumer overload? Message queue is processed and kept in RAM by broker (if I understand it correctly). What overload is it about? Does consumer have some kind of second queue?
Another part of that article is even more confusing:
Consumers therefore can be overwhelmed by the rate of deliveries, potentially accumulating a backlog in memory and running out of heap or getting their process terminated by the OS.
What backlog? How is this all works together? What part of job is done by consumer (besides consuming message and processing it of course)? I thought that broker is keeping queues alive and forwards the messages but now I am reading about some mysterious backlogs and consumer overloads. This is really confusing, can someone explain it a bit or at least point me to the good source?
I believe the documentation you're referring to deals with what, in my opinion, is sort of a design flaw in either AMQP 0-9-1 or RabbitMQ's implementation of it.
Consider the following scenario:
A queue has thousands of messages sitting in it
A single consumer subscribes to the queue with AutoAck=true and no pre-fetch count set
What is going to happen?
RabbitMQ's implementation is to deliver an arbitrary number of messages to a client who has not pre-fetch count. Further, with Auto-Ack, prefetch count is irrelevant, because messages are acknowledged upon delivery to the consumer.
In-memory buffers:
The default client API implementations of the consumer have an in-memory buffer (in .NET it is some type of blocking collection (if I remember correctly). So, before the message is processed, but after the message is received from the broker, it goes into this in-memory holding area. Now, the design flaw is this holding area. A consumer has no choice but to accept the message coming from the broker, as it is published to the client asynchronously. This is a flaw with the AMQP protocol specification (see page 53).
Thus, every message in the queue at that point will be delivered to the consumer immediately and the consumer will be inundated with messages. Assuming each message is small, but takes 5 minutes to process, it is entirely possible that this one consumer will be able to drain the entire queue before any other consumers can attach to it. And since AutoAck is turned on, the broker will forget about these messages immediately after delivery.
Obviously this is not a good scenario if you'd like to get those messages processed, because they've left the relative safety of the broker and are now sitting in RAM at the consuming endpoint. Let's say an exception is encountered that crashes the consuming endpoint - poof, all the messages are gone.
How to work around this?
You must turn Auto-Ack off, and generally it is also a good idea to set reasonable pre-fetch count (usually 2-3 is sufficient).
Being able to signal back pressure a basic problem in distributed systems. Without explicit acknowledgements, the consumer does not have any way to say "Slow down" to broker. With auto-ack on, as soon as the TCP acknowledgement is received by broker, it deletes the message from its memory/disk.
However, it does not mean that the consuming application has processed the message or ave enough memory to store incoming messages. The backlog in the article is simply a data structure used to store unprocessed messages (in the consumer application)
I want to know how does RabbitMQ store the messages physically in its RAM and Disk?
I know that RabbitMQ tries to keep the messages in memory (But I don't know how the messages are put in the Ram). But the messages can be spilled into disk when the messages are with persistent mode or when the broker has the memory pressure. (But I don't know how the messages are stored in Disk.)
I'd like to know the internals about these. Unfortunately, the official documentation in its homepage do not expose the internal details.
Which document should I read for this?
RabbitMQ uses a custom DB to store the messages, the db is usually located here:
/var/lib/rabbitmq/mnesia/rabbit#hostname/queues
Starting form the version 3.5.5 RabbitMQ introduced the new New Credit Flow
https://www.rabbitmq.com/blog/2015/10/06/new-credit-flow-settings-on-rabbitmq-3-5-5/
Let’s take a look at how RabbitMQ queues store messages. When a
message enters the queue, the queue needs to determine if the message
should be persisted or not. If the message has to be persisted, then
RabbitMQ will do so right away[3]. Now even if a message was persisted
to disk, this doesn’t mean the message got removed from RAM, since
RabbitMQ keeps a cache of messages in RAM for fast access when
delivering messages to consumers. Whenever we are talking about paging
messages out to disk, we are talking about what RabbitMQ does when it
has to send messages from this cache to the file system.
This post blog is enough detailed.
I also suggest to read about lazy queue:
https://www.rabbitmq.com/lazy-queues.html
and
https://www.rabbitmq.com/blog/2015/12/28/whats-new-in-rabbitmq-3-6-0/
Lazy Queues This new type of queues work by sending every message that
is delivered to them straight to the file system, and only loading
messages in RAM when consumers arrive to the queues. To optimize disk
reads messages are loaded in batches.
I have been looking at message queues (currently between Kafka and RabbitMQ) for one of my projects where these are biggest must have features.
Must have features
Messages in queues should be persistent. (only until they are processed successfully by consumers.)
Messages in queues should be removed only when downstream consumers were able to process the message successfully. Basically, a consumer should ACK. that it processed a message successfully.
Good to have features
To increase throughput, consumers should be able to pull batch of messages from queue.
If you are going with Kafka it will only retains message for a configurable duration of time after which the messages will be discarded to free up spaces no matter consumed or not.
And it is simply the responsibilities of the Kafka consumers to keep a track of what has been consumed.
IMHO if you require to keep the messages persisted for ever than consider using a different storage medium (database may be).
Provided that both the client subscribed and the server publishing the message retain the connection, is Redis guaranteed to always deliver the published message to the subscribed client eventually, even under situations where the client and/or server are massively stressed? Or should I plan for the possibility that Redis might ocasionally drop messages as things get "hot"?
Redis does absolutely not provide any guaranteed delivery for the publish-and-subscribe traffic. This mechanism is only based on sockets and event loops, there is no queue involved (even in memory). If a subscriber is not listening while a publication occurs, the event will be lost for this subscriber.
It is possible to implement some guaranteed delivery mechanisms on top of Redis, but not with the publish-and-subscribe API. The list data type in Redis can be used as a queue, and as the the foundation of more advanced queuing systems, but it does not provide multicast capabilities (so no publish-and-subscribe).
AFAIK, there is no obvious way to easily implement publish-and-subscribe and guaranteed delivery at the same time with Redis.
Redis does not provide guaranteed delivery using its Pub/Sub mechanism. Moreover, if a subscriber is not actively listening on a channel, it will not receive messages that would have been published.
I previously wrote a detailed article that describes how one can use Redis lists in combination with BLPOP to implement reliable multicast pub/sub delivery:
http://blog.radiant3.ca/2013/01/03/reliable-delivery-message-queues-with-redis/
For the record, here's the high-level strategy:
When each consumer starts up and gets ready to consume messages, it registers by adding itself to a Set representing all consumers registered on a queue.
When a producers publishes a message on a queue, it:
Saves the content of the message in a Redis key
Iterates over the set of consumers registered on the queue, and pushes the message ID in a List for each of the registered consumers
Each consumer continuously looks out for a new entry in its consumer-specific list and when one comes in, removes the entry (using a BLPOP operation), handles the message and moves on to the next message.
I have also made a Java implementation of these principles available open-source:
https://github.com/davidmarquis/redisq
These principles have been used to process about 1,000 messages per second from a single Redis instance and two instances of the consumer application, each instance consuming messages with 5 threads.