I need to log with the highest possible precision the rate with which messages enter and leave a particular queue in Rabbit. I know the API already provides publishing and delivering rates, but I am interested in capturing raw incoming and outgoing values in a known period of time, so that I can estimate rates with higher precision and time periods of my choice.
Ideally, I would check on-demand (i.e. on a schedule of my choice) e.g. the current cumulative count of messages that have entered the queue so far ("published" messages), and the current cumulative count of messages consumed ("delivered" messages).
With these types of cumulative counts, I could:
Compute my own deltas of messages entering or exiting the queue, e.g. doing Δ_count = cumulative_count(t) - cumulative_count(t-1)
Compute throughput rates doing throughput = Δ_count / Δ_time
Potentially infer how long messages stay on the queue throughout the day.
The last two would ideally rely on the precise timestamps when those cumulative counts were calculated.
I am trying to solve this problem directly using RabbitMQ’s API, but I’m encountering a problem when doing so. When I calculate the message cumulative count in the queue, I get a number that I don’t expect.
For example consider the screenshot below.
The Δ_message_count between entries 90 and 91 is 1810-1633 = 177. So, as I stated, I suppose that the difference between published and delivered messages should be 177 as well (in particular, 177 more messages published than delivered).
However, when I calculate these differences, I see that the difference is not 177:
Δ of published (incoming) messages: 13417517652009 - 13417517651765 = 244
Δ of delivered (outgoing) messages: 1341751765667 - 1341751765450 = 217
so we get 244 - 217 =27 messages. This suggests that there are 177 - 27 = 150 messages "unaccounted" for.
Why?
I tried taking into account the redelivered messages given by the API but they were constant when I run my tests, suggesting that there were no redelivered messages, so I wouldn't expect that to play a role.
Related
I would like to monitor the maximum number of active connections that my ApplicationELB is managing over a 5-minute period.
The ApplicationELB publishes a metric called ActiveConnectionCount. The documentation describes this in part as:
The total number of concurrent TCP connections active from clients to the load balancer and from the load balancer to targets.
And further states:
The most useful statistic is Sum.
I believe that Sum would total all the active connections reported within the time frame. E.g. Let's say the ELB is maintaining 10 connections and it reports this number every second, then the Sum would be 3000 over a 5-minute period. This is not what I want. Furthermore, when I use SUM over a 5-minute period I'm getting 20k or so -- far more than the number of real concurrent connections which are at most a few hundred.
If I aggregate using Maximum then the number reported by AWS is zero (!?).
If I aggregate using Average then the number appears to be reasonable (ranging from 80 - 200), but also wildly inaccurate. That is, it is almost inversely correlates with new connections and response time. That is, during time so of the day when response time is low and new connections is low, average active connections is higher.
So, I guess, here are my questions:
(1) How can I achieve seeing maximum number of concurrent connections between ELB and clients/app server? (Ideally, I could separate these two, but it doesn't look like the ELB does that).
Less important, but I'm curious:
(2) Why does MAXIMUM yield zero, while AVERAGE yields 80-200?
(3) Why does the documentation say that SUM should be used?
Thanks for any help / insight!
How can I achieve seeing maximum number of concurrent connections
between ELB and clients/app server? (Ideally, I could separate these
two, but it doesn't look like the ELB does that).
Why does MAXIMUM yield zero, while AVERAGE yields 80-200?
As you said, the ELB does not do that. From the metrics you can also see something called "SampleCount" which is the number of samples taken during a period of time, by default 1 minute. If we could somehow access the counts in these samples, we could get a min and max sample. For whatever reason, it's currently broken or not implemented and min/max show as 0. Therefore, the most useful metric, in my opinion at least, is the average which takes the sum (of counts) and divides it by the SampleCount.
Why does the documentation say that SUM should be used?
Good question because if you think about it it doesn't make much sense and doesn't give you much information since it's just a sum of the count in all samples.
I have repetitive tasks that I want to process with a number of workers (i.e., competing consumers pattern). The probability of failure during the task is fairly low so in case of such rare events, I would like to try again after a short period of time, say 1 second.
A sequence of consecutive failures is even less probable but still possible, so for a few initial retries, I would like to stick to a 1-second delay.
However, if the sequence of failures reaches some point, then the most likely there is some external reason that may cause these failures. So from that point, I would like to start extending the delay.
Let's say that the desired distribution of delays looks like this:
first appearance in the queue - no delay
retry 1 - 1 second
retry 2 - 1 second
retry 3 - 1 second
retry 4 - 5 second
retry 5 - 10 seconds
retry 6 - 20 seconds
retry 7 - 40 seconds
retry 8 - 80 seconds
retry 9 - 160 seconds
retry 10 - 320 seconds
another retry - drop the message
I have found a lot of information about DLXes (Dead Letter Exchanges) that can partially solve the problem. It appears to be easy to achieve an infinite number of retries with the same delay. At the same time, I haven't found a way to increase the delay or to stop after certain number of retries.
I'm looking for the purest RabbitMQ solution possible. However, I'm interested in anything that works.
There is a plugin available for this. I think you can use it to achieve what you need.
I've used it for something in a similar fashion for handling custom retries with dynamic delays.
RabbitMQ Delayed Message Plugin
Using a combination of DLXes and expire/TTL times, you can accomplish this except for the case when you want to change the redelivery time, for instance, implementing an exponential backoff.
The only way I could make it work using a pure RabbitMQ approach is to set the expire time to the smallest time needed and then use the x-death array to figure out how many times the message has been killed and then reject (ie. DLX it again) or ack the message accordingly.
Let's say you set expire time to 1 minute and you need to backoff 1 minute first time, then 5 minutes and then 30 minutes. This translates to x-death.count = 1, followed by 5 and then 30. Any other time you just reject the message.
Note that this can create lots of churn if you have many retry-messages. But if retries are rare, go for it.
We are using Redis as a Queue which has on an average about ~3k rps. But when we check the instantaneous_ops_per_sec, this value consistently reports higher than expected, by about 20%, in this case, reports ~4k ops per sec.
To verify this, I have taken a dump of MONITOR for about 10 seconds and checked the number of incoming commands.
grep "1489722862." monitor_output | wc -l
Where 1489722862 is the timestamp. Even this count matches with what is being produced in the queue and what is being consumed from the queue.
This is a master-slave redis cluster setup.
Does instantaneous_ops_per_sec also account for the slave reads? If not, what is the other reason for which this count is significantly higher?
The instantaneous_ops_per_sec metric is calculated as the mean of the recent samples that the server took. The number of recent samples is hardcoded as 16 by STATS_METRIC_SAMPLES in server.h.
In the usb specification (Table 5-4) is stated that given an isochronous endpoint with a maxPacketSize of 128 Bytes as much as 10 transactions can be done per frame. This gives 128 * 10 * 1000 = 1.28 MB/s of theorical bandwidth.
At the same time it states
The host must not issue more than 1 transaction in a single frame for a specific isochronous endpoint.
Isn't it contradictory with the aforementioned table ?
I've done some tests and found that only 1 transaction is done per frame on my device. Also, I found on several web sites that just 1 transaction can be done per frame(ms). Of course I suppose the spec is the correct reference, so my question is, what could be the cause of receiving only 1 packet per frame ? Am I misunderstanding the spec and what i think are transactions are actually another thing ?
The host must not issue more than 1 transaction in a single frame for a specific isochronous endpoint.
Assuming USB Full Speed you could still have 10 isochronous 128 byte transactions per frame by using 10 different endpoints.
The Table 5-4 seems to miss calculations for chapter 5.6.4 "Isochronous Transfer Bus Access Constraints". The 90% rule reduces the max number of 128 byte isochr. transactions to nine.
In http://code.google.com/intl/es-ES/appengine/docs/quotas.html#Channel you
can read that with billing enabled the maximum channel created rate is 60
creations/minute. Does it mean that we can created only 86,400
channels/day. It's very low rate, isn't it? And if i have estimated that I
could have peaks of for example: 4,000 creations/minute... What i can do?
60 creations/minute are few creations if the channels are 1to1... Is this
correct?
My interpretation of that section is that you will NOT be able to create 4k connections per minute. Here is how I would think about it: over ANY 1-minute period, no more than 60 channels can be created. For example, you can create 60 channels at time T. Then, for the next 60 seconds you won't be able to create any. Or, you can create 30 at time T. Then, every 2 seconds, create a channel.
I believe another way to think about this is in terms of the token bucket algorithm.
Anyway, I believe you can fill out this form to request a higher limit. There is a link to that form from the docs that you linked to in your question.