rabbitmq drop message if certain field value is not unique? - rabbitmq

I am using an ampq queue with my webcrawler - each crawler instance will get an url to crawl from a message in the queue then add the urls it has found to the queue.
As there will be multiple crawler instances each may find a the same url and add it to the queue.
Is there a built in way to tell rabbitmq to drop the message if the url is know, or to check the queue if a message with the url exists already?

No. There are no way to check message uniqueness with RabbitMQ mechanism.
AMQP queues and especially RabbitMQ queues are pure FIFO queues.
Probably, you have to implement uniqueness check on application side.
P.S.:
There are nifty workaround to declare queues with the same name as unique field (or it hash) with x-max-length set to 1 so duplicates will be lost if there are unprocessed message in queue. But this requires a lot of queues with urls (unique field - url hash) and thus is not the best solution , especially when it comes to consume all that messages from thousands of queues with non-obvious names.

Related

Spring AMQP. Deal Letter Exchange. Send message to the original queue

In my system, I use Topic Exchanges with lots of consumer queues. Each queue has it's own non-unique routing key (f.e. 'add.#' for all new entities or just '#' to consume all events).
I want to add support for retrying failed messages with some delay. The biggest issue that I see with Dead Letter Exchange approach is to send a message directly to the queue in which it failed. Routing keys for Queues are not unique, and even if I resubmit a message to the Exchange with the original routing key, it will be consumed by other queues.
One solution is having a "retry" exchange and every application will be subscribed to it with unique routing key (f.e. original queue name). But it sounds too complicated and I want to hide this infrastructure complexity from developers.
I came up with the idea to have a filter that will check the 'x-death' header, get the first queue (the queue where the error occurred in a first place), and process a message only for the appropriate queue. Otherwise - acknowledge the message.
Is it possible to implement this behavior with Spring AMQP? I'm looking into MessagePostProcessor, but how to Acknowledge a message from it?
If you really worry about only the target queue, so you need to consider a variant with republishing in the default exchange which has these abilities:
The default exchange is implicitly bound to every queue, with a routing key equal to the queue name. It is not possible to explicitly bind to, or unbind from the default exchange. It also cannot be deleted.
Pay attention to the routing key equal to the queue name part. I would consider to deal with a AmqpHeaders.CONSUMER_QUEUE and use its value as a routing key for republishing to the default exchange ("") during retry process.

To be sure about concurrency, same group of works in multiple queues (FIFO)

I have a question about multi consumer concurrency.
I want to send works to rabbitmq that comes from web request to distributed queues.
I just want to be sure about order of works in multiple queues (FIFO).
Because this request comes from different users eech user requests/works must be ordered.
I have found this feature with different names on Azure ServiceBus and ActiveMQ message grouping.
Is there any way to do this in pretty RabbitMQ ?
I want to quaranty that customer's requests must be ordered each other.
Each customer may have multiple requests but those requests for that customer must be processed in order.
I desire to process quickly incoming requests with using multiple consumer on different nodes.
For example different customers 1 to 1000 send requests over 1 millions.
If I put this huge request in only one queue it takes a lot of time to consume. So I want to share this process load between n (5) node. For customer X 's requests must be in same sequence for processing
When working with event-based systems, and especially when using multiple producers and/or consumers, it is important to come to terms with the fact that there usually is no such thing as a guaranteed order of events. And to get a robust system, it is also wise to design the system so the message handlers are idempotent; they should tolerate to get the same message twice (or more).
There are way to many things that may (and actually should be allowed to) interfere with the order;
The producers may deliver the messages in a slightly different pace
One producer might miss an ack (due to a missed package) and will resend the message
One consumer may get and process a message, but the ack is lost on the way back, so the message is delivered twice (to another consumer).
Some other service that your handlers depend on might be down, so that you have to reject the message.
That being said, there is one pattern that servicebus-systems like NServicebus use to enforce the order messages are consumed. There are some requirements:
You will need a centralized storage (like a sql-server or document store) that allows for conditional updates; for instance you want to be able to store the sequence number of the last processed message (or how far you have come in the process), but only if the already stored sequence/progress is the right/expected one. Storing the user-id and the progress even for millions of customers should be a very easy operation for most databases.
You make sure the queue is configured with a dead-letter-queue/exchange for retries, and then set your original queue as a dead-letter-queue for that one again.
You set a TTL (for instance 30 seconds) on the retry/dead-letter-queue. This way the messages that appear on the dead-letter-queue will automatically be pushed back to your original queue after some timeout.
When processing your messages you check your storage/database if you are in the right state to handle the message (i.e. the needed previous steps are already done).
If you are ok to handle it you do and update the storage (conditionally!).
If not - you nack the message, so that it is thrown on the dead-letter queue. Basically you are saying "nah - I can't handle this message, there are probably some other message in the queue that should be handled first".
This way the happy-path is to process a great number of messages in the right order.
But if something happens and a you get a message out of band, you will throw it on the retry-queue (the dead-letter-queue) and Rabbit will make sure it will get back in the queue to be retried at a later stage. But only after a delay.
The beauty of this is that you are able to handle most of the situations that may interfere with processing the message (out of order messages, dependent services being down, your handler being shut down in the middle of handling the message) in exact the same way; by rejecting the message and letting your infrastructure (Rabbit) take care of it being retried after a while.
(Assuming the OP is asking about things like ActiveMQs "message grouping:)
This isn't currently built in to RabbitMQ AFAIK (it wasn't as of 2013 as per this answer) and I'm not aware of it now (though I haven't kept up lately).
However, RabbitMQ's model of exchanges and queues is very flexible - exchanges and queues can be easily created dynamically (this can be done in other messaging systems but, for example, if you read ActiveMQ documentation or Red Hat AMQ documentation you'll find all of the examples in the user guides are using pre-declared queues in configuration files loaded at system startup - except for RPC-like request/response communication).
Also it is very easy in RabbitMQ for a consumer (i.e., message consuming thread) to consume from multiple queues.
So you could build, on top of RabbitMQ, a system where you got your desired grouping semantics.
One way would be to create dynamic queues: The first time a customer order was seen or a new group of customer orders a queue would be created with a unique name for all messages for that group - that queue name would be communicated (via another queue) to a consumer who's sole purpose was to load-balance among other consumers that were responsible for handling customer order groups. I.e., the load-balancer would pull off of its queue a message saying "new group with queue name XYZ" and it would find in a pool of order group consumer a consumer which could take this load and pass it a message saying "start listening to XYZ".
Another way to do it is with pub/sub and topic routing - each customer order group would get a unique topic - and proceed as above.
RabbitMQ Consistent Hash Exchange Type
We are using RabbitMQ and we have found a plugin. It use Consistent Hashing algorithm to distribute messages in order to consistent keys.
For more information about Consistent Hashing ;
https://en.wikipedia.org/wiki/Consistent_hashing
https://www.youtube.com/watch?v=viaNG1zyx1g
You can find this plugin from rabbitmq web page
plugin : rabbitmq_consistent_hash_exchange
https://www.rabbitmq.com/plugins.html

rabbitmq: can consumer persist message change before nack?

Before a consumer nacks a message, is there any way the consumer can modify the message's state so that when the consumer consumes it upon redelivery, it sees that changed state. I'd rather not reject + reenqueue new message, but please let me know if that's the only way to accomplish this.
My goal is to determine how many times specific messages are being redelivered. I see two ways of doing this:
(1) On the message itself as described above. The message would be a container of basic stats and the application payload message.
(2) In some external storage. We would uniquely identify the message by the message id that we set.
I know 2 is possible, but my question is if 1 is possible.
There is no way to do (1) like you want. You would need to change the message, thus the message would become another message. If you want to do something like that (and it's possible that you meant this with I'd rather not reject + reenqueue new message) - you should ACK the message, increment one field in it and publish it again (again, maybe this is what you meant when you said reenqueue it). So your message payload would have some ID, counter, and again (obviously different) payload that is the content.
Definitvly much better way is (2) for multiple reasons:
it does not interfere with business logic, that is this diagnostic part is isolated
you are leaving re-queueing to rabbitmq (as you are supposed to do), meaning that you are not worrying about losing messages and handling some message meta info which has no use for you business logic
it's actually supposed to be used - the ACKing and NACKing, that's why it's in the AMQP specification
since you do need the number of how many times specific messages have been redelivered, you have it somewhere externally, meaning that it's independent of (rabbitmq's) message persistence, lifetime, potentially queue durability mirroring etc
Even if this question was marked as solved some time ago, I want to mention that there is a way at least for the redelivery. It might be integrated after the original answer. There is a different type of queues in RabbitMQ called Quorum queues.
Quorum queues offer the option to set redelivery limit:
Quorum queues support poison message handling via a redelivery limit. This feature is currently unique to Quorum queues.
In order to archive this, RabbitMQ is counting the numbers of deliveries in the header. The header attribute is called: x-delivery-count

Does rabbitmq support to push the same data to multi consumers?

I have a rabbitmq cluster used as a working queue. There are 5 kinds of consumers who want to consume exactly the same data.
What I know for now is using fanout exchange to "copy" the data to 5 DIFFERENT queues. And the 5 consumers can consume different queue. This is kind of wasting resources because the data is the same in file queues.
My question is, does rabbitmq support to push the same data to multi consumers? Just like a message need to be acked for a specified times to be deleted.
I got the following answer from rabbitmq email group. In short, the answer is no... and what I did above is the correct way.
http://rabbitmq.1065348.n5.nabble.com/Does-rabbitmq-support-to-push-the-same-data-to-multi-consumers-td36169.html#a36170
... fanout exchange to "copy" the data to 5 DIFFERENT queues. And the 5 consumers can consume different queue. This is kind of wasting resources because the data is the same in file queues.
You can consume with 5 consumers from one queue if you do not want to duplicate messages.
does rabbitmq support to push the same data to multiple consumers
In AMQP protocol terms you publish message to exchange and then broker (RabbitMQ) decide what to do with messages - assume it figured out the queue message intended for (one or more) and then put that message on top of that queue (queues in RabbitMQ are classic FIFO queues which is somehow break AMQP implementation in RabbitMQ). Only after that message may be delivered to consumer (or die due to queue length limit or per-queue or per-message ttl, if any).
message need to be acked for a specified times to be deleted
There are no way to change message body or attributes after message being published (actually, Dead Letter Exchanges extension and some other may change routing key, for example and add,remove and change some headers, but this is very specific case). So if you want to track ack's number you have to re-publish consumed message with changed body or header (depends on where do you plan to store ack's counter, but headers fits pretty nice for this.
Also note, that there are redeliverd message attribute which denotes whether message was already was consumed, but then redelivered. This flag doesn't count redelivers number so it usage is quite limited.

Rabbitmq queue sharding

I have to implement this scenario:
An external application publish message to rabbitmq.
This message has a client_id property. We can place this id to routing key or message header or some other property.
I have to implement sharding in a exchange routng logic - the message should be delivered to specific queue based on the client_id range.
Is it possible to implement in a standard exchanges?
If not what exchange should I take as the base?
How to dynamicly change client_id ranges?
Take a look at the rabbitmq plugin. It's included in the RabbitMQ distribution from v3.6.0 onwards.
Just have your producer put enough info into the routing key that causes the message to go into the right queue on the other side of the Exchange.
So for example, create two queues called 1 and 2 and bind them with routing keys matching the names. Then have your producer decide which routing key to use when producing the event message. Customers with names starting with letters a-m go to 1, n-z go to 2, you get the idea. It pushes the sharding to the producer but that might be OK for your application.
AMQP doesn't have any explicit implementation of sharding, but its architecture should help you to do that.
Spreading messages to several queues is just a rabbitmq challenge (and part of amqp specification), and with routing, way you can attach hetereogeneous consumers to handle specific messages routed via the same exchange. Therefore, producer should push a specific key to be consumed by specific queue/consumer...
You can decide to make a static sharding, perhaps you have 10 queues with one consumer per queue. You could implement a consistent hashing function such that key is CLIENT_ID % 10.
Another ways and none static solutions could be propoused, and you can try to over this architecture.