Handling Stale Data in Prometheus Gauges - rabbitmq

Context: I want to build my own exporter for RabbitMQ. For that I've set an HTTP server that queries the management API, parses the response and builds the appropriate response with Prometheus format
I'm measuring number of messages in a queue to get alerted when a queue has too many messages in it. For that, i've set up the following gauge:
rabbitmq_queue_messages{queue_name="Q1"}
My question is: what happens if a queue gets deleted? for example:
at T1 the exporter returns rabbitmq_queue_messages{queue_name="Q1"} 5
at T2 the queue is being deleted for some reason
at T3 my exporter is being asked for the metrics again.
As I understand, at T3, even though the queue doesn't exist anymore, it will return the same rabbitmq_queue_messages{queue_name="Q1"} 5 response since this is how gauges work on Prometheus. For me it seems odd because at T3, Q1 doesn't exist anymore so I'd expect to stop receiving data points for this queue, instead of receiving stale data.
The workaround I found for this is to build a new prometheus registry on each request to the exporter to start with a clean sheet, but it seems a bit hacky and I really don't feel comfortable working this way.
So, how can I avoid having stale gauge data in a more Prometheus idiomatic way?

If this is a Java exporter, written using client_java, you can simply clear your Gauge (in myGauge.clear()) instead of building a whole new Registry.
Or, if that is too heavy-weight and you have a way to get notified when a queue is deleted, just call Gauge.remove(queueName) when you get the notification.
Edit: Never actually seen any Ruby code before, but it would seem that Registry.unregister("rabbitmq_queue_messages") might be the less heavy-handed way of clearing just the one metric (with all its label combinations, i.e. in your case for all queues). I don't see anything similar to the Java client's Gauge.remove() that would allow removing just one sample/label combination, but I might be missing something.

Related

ActiveMQ: How do I limit the number of messages being dispatched?

Let's say I have one ActiveMQ Broker and an undefined numbers of consumers.
Problem:
To process a message, consumers need an external service which is either "DATA1" or "DATA2" (specified in the message)
Each server, "DATA1" and "DATA2", can only handle 20 connections
So at most 20 "DATA1" and 20 "DATA2" messages must be dispatched at any time
Because of priorization, the messages must be enqueued in the same queue
Even if message A has a higher prio than message B, if A can't be processed because the external service has no free slots, message B needs to be processed instead
How can this be solved? As long as I was using message pulling (prefetch of 0), I was able to do this by using a BrokerPlugin that, on messagePull, achieved this by using semaphores and selectors. If the limits were reached, the pull returned null.
However, due to performance issues I had to set prefetch to 1 and use push instead. Therefore, my messagePull hack no longer works (it's never called).
So far I'm considering implementing a custom Cursor but I was wondering if someone knows a better solution.
Update the custom cursor worked but broke features like message removal. I tried with a custom Queue and QueueDispatchSelector (which is a pain to configure since there isn't a proper API to do so) and it mostly works but I still have synchronisation issues.
Also, a very suitable API seems to be DispatchPolicy, however, while it is referenced by Queue, it's never used.
Queues give you buffering for system processing time for free. Messages are delivered on demand. With prefetch=0 or prefetch=1, should effectively get you there. Messages will only be delivered to a consumer when the consumer is ready (ie.. during the consumer.receive() method).
consumer.receive() is a blocking call, so you should not need any custom plugin or other to delay delivery until the consumer process (and its required downstream services) are ready to handle it.
The behavior should work out-of-the-box, or there are some details to your use case that are not provided to shed more light on the scenario.

How to deal with application crashes with RabbitMQ

Recently, I have implemented RabbitMQ for a couple of use cases. Sending mails is one of them (which is quite common in practice)
My Problem Statement:
A web service(say service A) needs to publish 1000 messages in the queue (which will be picked by some mail sending engine). But unfortunately, after publishing 500 messages to the queue, my app crashes.
Now, if I hit the same service again then the 500 messages that were already pushed in the first go will be pushed again. Though the mails duplication isn't a big deal for now, but is definitely not desired. How to deal with this one. Any thoughts ?
Solutions that I came up with:
Using the batch feature - but it is not supported by AsyncRabbitTemplate so I'm restrained from using that.
Using the database. But that's definitely cumbersome. I won't use this one as well.
If you can identify the duplicates, you can use the Idempotent Receiver enterprise integration pattern on the consumer side.
Spring Integration has an implementation.
However, it's not clear why you are using the async template since that is for send and receive operations. This application sounds like it only needs to send the requests, not wait for a reply.
It's also not clear how batching can help since the crash could occur on the consumer side after it has processed half of the batch.
In either case, you need to track where you got to before the crash.

Dealing with dead letters in RabbitMQ

TL;DR: I need to "replay" dead letter messages back into their original queues once I've fixed the consumer code that was originally causing the messages to be rejected.
I have configured the Dead Letter Exchange (DLX) for RabbitMQ and am successfully routing rejected messages to a dead letter queue. But now I want to look at the messages in the dead letter queue and try to decide what to do with each of them. Some (many?) of these messages should be replayed (requeued) to their original queues (available in the "x-death" headers) once the offending consumer code has been fixed. But how do I actually go about doing this? Should I write a one-off program that reads messages from the dead letter queue and allows me to specify a target queue to send them to? And what about searching the dead letter queue? What if I know that a message (let's say which is encoded in JSON) has a certain attribute that I want to search for and replay? For example, I fix a defect which I know will allow message with PacketId: 1234 to successfully process now. I could also write a one-off program for this I suppose.
I certainly can't be the first one to encounter these problems and I'm wondering if anyone else has already solved them. It seems like there should be some sort of Swiss Army Knife for this sort of thing. I did a pretty extensive search on Google and Stack Overflow but didn't really come up with much. The closest thing I could find were shovels but that doesn't really seem like the right tool for the job.
Should I write a one-off program that reads messages from the dead letter queue and allows me to specify a target queue to send them to?
generally speaking, yes.
you could set up a delayed re-try to resend the message back to the original queue, using a combination of the delay message exchange plugin.
but this would only automate the retries on an interval, and you may not have fixed the problem before the retries happen.
in some circumstances this is ok - like when the error is caused by an external resource being temporarily unavailable.
in your case, though, i believe your thoughts on creating an app to handle the dead letters is the best way to go, for several reasons:
you need to search through the messages, which isn't possible RMQ
this means you'll need a database to store the messages from the DLX/queue
because you're pulling the messages out of the DLX/queue, you'll need to ensure you get all the header info from the message so that you can re-publish to the correct queue when the time comes.
I certainly can't be the first one to encounter these problems and I'm wondering if anyone else has already solved them.
and you're not!
there are many solutions to this problem that all come down to the solution you've suggested.
some larger "service bus" implementations have this type of feature built in to them. i believe NServiceBus (or the SaaS version of it) has this built in, for example - though I'm not 100% sure of it.
if you want to look into this further, do some search for the term "poison message" - this is generally the term used for this situation. I've found a few things on google with a quick search, that may help you down the path:
http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2013-January/025019.html
https://web.archive.org/web/20170809194056/http://tafakari.co.ke/2014/07/rabbitmq-poison-messages/
https://web.archive.org/web/20170809170555/http://kjnilsson.github.io/blog/2014/01/30/spread-the-poison/
hope that helps!

RabbitMQ: throttling fast producer against large queues with slow consumer

We're currently using RabbitMQ, where a continuously super-fast producer is paired with a consumer limited by a limited resource (e.g. slow-ish MySQL inserts).
We don't like declaring a queue with x-max-length, since all messages will be dropped or dead-lettered once the limit is reached, and we don't want to loose messages.
Adding more consumers is easy, but they'll all be limited by the one shared resource, so that won't work. The problem still remains: How to slow down the producer?
Sure, we could put a flow control flag in Redis, memcached, MySQL or something else that the producer reads as pointed out in an answer to a similar question, or perhaps better, the producer could periodically test for queue length and throttle itself, but these seem like hacks to me.
I'm mostly questioning whether I have a fundamental misunderstanding. I had expected this to be a common scenario, and so I'm wondering:
What is best practice for throttling producers? How is this done with RabbitMQ? Or do you do this in a completely different way?
Background
Assume the producer actually knows how to slow himself down with the right input. E.g. a hardware sensor or hardware random number generator, that can generate as many events as needed.
In our particular real case, we have an API that users can use to add messages. Instead of devouring and discarding messages, we'd like to apply back-pressure by having our API return an error if the queue is "full", so the caller/user knows to back-off, or have the API block until the consumer catches up. We don't control our user, so regardless of how fast the consumer is, I can create a producer that is faster.
I was hoping for something like the API for a TCP socket, where a write() can block and where a select() can be used to determine if a handle is writable. So either having the RabbitMQ API block or have it return an error if the queue is full.
For the x-max-length property, you said you don't want messages to be dropped or dead-lettered. I see there was an update in adding some more capabilities for this. As I see it is specified in the documentation:
"Use the overflow setting to configure queue overflow behaviour. If overflow is set to reject-publish, the most recently published messages will be discarded. In addition, if publisher confirms are enabled, the publisher will be informed of the reject via a basic.nack message"
So as I understand it, you can use queue limit to reject the new messages from publishers thus pushing some backpressure to the upstream.
I don't think that this is in any way rabbitmq specific. Basically you have a scenario, where there are two systems of different processing capabilities, and this mismatch will either pose a risk of overflowing the queue (whatever it would be), or even in case of a constant mismatch between producer and consumer, simply create more and more time-distance between event creation and its handling.
I used to deal with this kind of scenarios, and unfortunately there is no magic bullet. You either have to speed up even handling (better hardware, more suited software?) or throttle the event creation (which has nothing to do with MQ really).
Now, I would ask you what's the goal and how the events are produced. Are the events are produced constantly, with either unlimitted or just very high rate (for example readings from sensors - the more, the better), or are they created in batches/spikes (for example: user requests in specific time periods, batch loads from CRM system). I assume that the goal is to process everything cause you mention you don't want to loose any queued message.
If the output is constant, then some limiter (either internal counter, if the producer is the only producer, or external queue length checks if queue can be filled with some other system) is definitely in place.
IF eventsInTimePeriod/timePeriod > estimatedConsumerBandwidth
THEN LowerRate()
ELSE RiseRate()
In real world scenarios we used to simply limit the output manually to the estimated values and there were some alerts set for queue length, time from queue entry to queue leaving etc. Where such limiters were omitted (by mistake mostly) we used to find later some tasks that were supposed to be handled in few hours, that were waiting for three months for their turn.
I'm afraid it's hard to answer to "How to slow down the producer?" if we know nothing about it, but some ideas are: aforementioned rate check or maybe a blocking AddMessage method:
AddMessage(message)
WHILE(getQueueLength() > maxAllowedQueueLength)
spin(1000); // or sleep or whatever
mqAdapter.AddMessage(message)
I'd say it all depends on specific of the producer application and in general your architecture.

MassTransit Saga saves late with NHibernateSagaRepository (Starbucks example)

I have converted the Starbucks example to use RabbitMQ and NHibernate.. However, there is a bug/challenge/issue with the DrinkPreparationSaga and when it actually gets saved to the database vs. when the PaymentCompleteMessage gets submitted.
How the code works (out of the box, this isn't anything I changed)... The new instance of the saga isn't saved to the database until AFTER the Initial state completes and it transitions to its next state.
The problem is that in the sample Starbucks application the DrinkPreparationSaga starts off with a very slow method that prints out coffee making sounds once every 1 a second 10 times..
So there is 10 seconds between when the Saga is actually created and when its saved to the database.. The bigger problem with that is, that any other messages that are destined to that instances of the saga (by CorrolationId) are thrown in the error queue because the Saga doesn't exist.
Shouldn't the NHibernateSagaRepository immediately save the new Saga Instance, then run the workflow, then update the saga post workflow? I can't seem to think of another way to make the example work, but that would require a decent bit of reorganizing in the NHibernateSagaRepository.
Thanks in advance.
The reason sagas are not saved before processing the message is that some members of the saga may not be nullable (or allow nulls) and they are not set before the initial saga message is processed.
And you're correct. Take a look at the Riktig sample (http://github.com/phatboyg/Riktig) to see how the Automatonymous sagas are used and how the correlation of another service (in this case, the image retrieval service) is used. Sagas should not actually perform work, but coordinate the state of a transaction. The earlier Starbucks example was a naive implementation that we built one morning in Austin. It is long overdue for an update (for more reasons, including that it still uses Magnum state machines, which are soon deprecated in favor of Automatonymous).