AWS Cloudwatch ELB monitoring active connections - amazon-cloudwatch

I would like to monitor the maximum number of active connections that my ApplicationELB is managing over a 5-minute period.
The ApplicationELB publishes a metric called ActiveConnectionCount. The documentation describes this in part as:
The total number of concurrent TCP connections active from clients to the load balancer and from the load balancer to targets.
And further states:
The most useful statistic is Sum.
I believe that Sum would total all the active connections reported within the time frame. E.g. Let's say the ELB is maintaining 10 connections and it reports this number every second, then the Sum would be 3000 over a 5-minute period. This is not what I want. Furthermore, when I use SUM over a 5-minute period I'm getting 20k or so -- far more than the number of real concurrent connections which are at most a few hundred.
If I aggregate using Maximum then the number reported by AWS is zero (!?).
If I aggregate using Average then the number appears to be reasonable (ranging from 80 - 200), but also wildly inaccurate. That is, it is almost inversely correlates with new connections and response time. That is, during time so of the day when response time is low and new connections is low, average active connections is higher.
So, I guess, here are my questions:
(1) How can I achieve seeing maximum number of concurrent connections between ELB and clients/app server? (Ideally, I could separate these two, but it doesn't look like the ELB does that).
Less important, but I'm curious:
(2) Why does MAXIMUM yield zero, while AVERAGE yields 80-200?
(3) Why does the documentation say that SUM should be used?
Thanks for any help / insight!

How can I achieve seeing maximum number of concurrent connections
between ELB and clients/app server? (Ideally, I could separate these
two, but it doesn't look like the ELB does that).
Why does MAXIMUM yield zero, while AVERAGE yields 80-200?
As you said, the ELB does not do that. From the metrics you can also see something called "SampleCount" which is the number of samples taken during a period of time, by default 1 minute. If we could somehow access the counts in these samples, we could get a min and max sample. For whatever reason, it's currently broken or not implemented and min/max show as 0. Therefore, the most useful metric, in my opinion at least, is the average which takes the sum (of counts) and divides it by the SampleCount.
Why does the documentation say that SUM should be used?
Good question because if you think about it it doesn't make much sense and doesn't give you much information since it's just a sum of the count in all samples.

Related

Can someone explain this PromQL query to me?

I'm new to promQL and I am using it to create grafana dashboard to visualize various API metrics like throughput, latency etc.
For measuring latency I came across these queries being used together. Can someone explain how are they working
histogram_quantile(0.99, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))
histogram_quantile(0.95, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))
Also I want to write a query which will show me number of API calls with latency greater than 4sec. Can someone please help me there as well?
The provided queries are designed to return 99th and 95th percentiles for the http_request_duration_seconds{path="..."} metric of histogram type over requests received during the last 2 minutes (see 2m in square brackets).
Unfortunately the provided queries have some issues:
They use irate() function for calculating the per-second increase rate of every bucket defined in http_request_duration_seconds histogram. This function isn't recommended to use in general case, because it tends to return jumpy results on repeated queries - see this article for details. So it is better to use rate or increase instead when calculating histogram_quantile.
They multiply the calculated irate() by 30. This has no any effect on query results, since histogram_quantile() normalizes the provided per-bucket values.
So it is recommended to use the following query instead:
histogram_quantile(0.99,
sum(
increase(http_request_duration_seconds_bucket{path="..."}[2m])
) by (le)
)
This query works in the following way:
Prometheus selects all the time series matching the http_request_duration_seconds_bucket{path="..."} time series selector on the selected time range on the graph. These time series represent histogram buckets for the http_request_duration_seconds histogram. Each such bucket contains a counter, which counts the number of requests with duration not exceeding the value specified in the le label.
Prometheus calculates the increase over the last 2 minutes per each selected time series, e.g. how many requests hit every bucket during the last 2 minutes.
Prometheus calculates per-le sums over bucket values calculated at step 2 - see sum() function docs for details.
Prometheus calculates the estimated 99th percentile for the bucket results returned at step 3 by executing histogram_quantile function. The error of the estimation depends on the number of buckets and the le values. More buckets with better le distribution usually give lower error for the estimated percentile.

What is the meaning of totalSlotMs for a BigQuery job?

What is the meaning of the statistics.query.totalSlotMs value returned for a completed BigQuery job? Except for giving an indication of relative cost of one job vs the other, it's not clear how else one should interpret the number. For example, how does the slot-milliseconds number relate to the stack driver reported total slot usage for a given project (which needs to stay below 2000 for on demand BigQuery usage)?
The docs are a bit terse ('[Output-only] Slot-milliseconds for the job.')
The idea is to have a 'slots' metric in the same units at which slots of reservation are sold to customers.
For example, imagine that you have a 20-second query that is continuously consuming 4 slots. In that case, your query is using 80,000 totalSlotMs (4 * 20,000).
This way you can determine the average number of slots even if the peak number of slots differs as, in practice, the number of workers will fluctuate over the runtime of a query.

Waiting time of SUMO

I am using sumo for traffic signal control, and want to optimize the phase to reduce some objectives. During the process, I use the traci module as an output of states in traffic junction. The confusing part is traci.lane.getWaitingTime.
I don't know how the waiting time is calculated and also after I use two detectors as an output to observe, I think it is too large.
Can someone explain how the waiting time is calculated in SUMO?
The waiting time essentially counts the number of seconds a vehicle has a speed of less than 0.1 m/s. In the case of traci.lane this means it is the number of (nearly) standing vehicles multiplied with the time step length (since traci.lane returns the values for the last step).

Calculate % Processor Utilization in Redis

Using INFO CPU command on Redis, I get the following values back (among other values):
used_cpu_sys:688.80
used_cpu_user:622.75
Based on my understanding, the value indicates the CPU time (expressed in seconds) accumulated since the launch of the Redis instance, as reported by the getrusage() call (source).
What I need to do is calculate the % CPU utilization based on these values. I looked extensively for an approach to do so but unfortunately couldn't find a way.
So my questions are:
Can we actually calculate the % CPU utilization based on these 2 values? If the answer is yes, then I would appreciate some pointers in that direction.
Do we need some extra data points for this calculation? If the answer is yes, I would appreciate if someone can tell me what those data points would be.
P.S. If this question should belong to Server Fault, please let me know and I will post it there (I wasn't 100% sure if it belongs here or there).
You need to read the value twice, calculate the delta, and divide by the time elapsed between the two reads. That should give you the cpu usage in % for that duration.

How can I measure instantaneuous bandwidth using ns-3?

I have looked into FlowMon which counts packets over the entire simulation interval. Can I specify shorter intervals and many such intervals? Are there other options to measure instantaneous bandwidth (or at least, bandwidth averaged over short intervals like 1ms) in ns-3?
You could still use FlowMon for this. FlowMon accumulates the number of packets as you progress through the simulation so at different time intervals you can get values such as txBytes and do some calculations with that.
Take a look at NS3 scheduling, and try using that
Simulator::Schedule(Seconds(1), &StatsCalculationCallback);
Where the StatsCalculationsCallback is the function where you would calculate the difference between the value now and value before to get for that short interval. Then in the StatsCalculationsCallback function you would then also need to reschedule again for the next interval.