Kubernetes logs spamming Splunk - splunk-query

I'm having issues with Kubernetes containers spamming Splunk with hundreds of gigabytes of logs sometimes. I would like to put together a search to track containers that have a sudden log spike and generate an alert.
More specifically:
look at the average rate of events
find the peak
decide a percentage of that peak
and then trigger an alert when a container has breached the threshold.
The closest I have come up with is the below search, which has an average rate and standard deviation of that rate by hour:
index="apps" sourcetype="kube"
| bucket _time span=1h
| stats count as CountByHour by _time, kubernetes.container_name
| eventstats avg(CountByHour) as AvgByKCN stdev(CountByHour) as StDevByKCN by kubernetes.container_name

Related

How to get stats from combined aggregated bin data in AWS Cloudwatch Logs Insights

I have some AWS CloudWatch logs which output values every 5 seconds. I'd like to get the max over a rolling 10 minute interval and then get the average value per day based on that. Using the CloudWatch Logs Insights QuerySyntax I cannot seem to get the result of the first bin aggregation to use in the subsequent bin. I tried:
fields #timestamp, #message
| filter #LogStream like /mylog/
| parse #message '*' as threadCount
| stats max(threadCount) by bin(600s) as maxThreadCount
| stats avg(maxThreadCount) by bin(24h) as avgThreadCount
But the query syntax is invalid for multiple stats functions. Combining the last two lines into one like:
| stats avg(max(threadCount) by bin(600s)) by bin(24h) as threadCountAvg
Also is invalid. I can't seem to find much in the AWS logs. Am I out of luck? Anyone know a trick?

How to interpret "evicted_keys" from Redis Info

We are using ElastiCache for Redis, and are confused by its Evictions metric.
I'm curious what the unit is on the evicted_keys metric from Redis Info? The ElastiCache docs say it is a count: https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheMetrics.Redis.html but for our application we have observed the "Evictions" metric (which is derived from evicted_keys) fluctuates up and down, indicating it's not a count. I would expect a count to never decrease, since we cannot "un-evict" a key. I'm wondering if evicted_keys is actually a rate (eg, evictions/sec), which would explain why it can fluctuate.
Thanks you in advance for any responses!
From INFO command:
evicted_keys: Number of evicted keys due to maxmemory limit
To learn more about evictions see Using Redis as an LRU cache - Eviction policies
This counter is zero when the server starts, and it is only reset if you issue the CONFIG RESETSTAT command. However, on ElastiCache, this command is not available.
That said, ElastiCache derives the metric from this value, by calculating the difference between data-points.
Redis evicted_keys 0 5 12 18 22 ....
CloudWatch Evictions 0 5 7 6 4 ....
This is the usual pattern in CloudWatch metrics. This allows you to use SUM if you want the cumulative value, but also to detect rate changes or spikes easily.
Think for example you want to alarm if evictions are more than 10,000 over one minute period. If ElastiCache stores the cumulative value from Redis straight as a metric, this would be hard to accomplish.
Also, by committing the metric only as evicted keys for the period, you are protected of the data distortion of a server-reset or a value overflow. While the Redis INFO value would go back to zero, on ElastiCache you still get the value for the period and you can still do running sum over any period.

High precision queue statistics from RabbitMQ

I need to log with the highest possible precision the rate with which messages enter and leave a particular queue in Rabbit. I know the API already provides publishing and delivering rates, but I am interested in capturing raw incoming and outgoing values in a known period of time, so that I can estimate rates with higher precision and time periods of my choice.
Ideally, I would check on-demand (i.e. on a schedule of my choice) e.g. the current cumulative count of messages that have entered the queue so far ("published" messages), and the current cumulative count of messages consumed ("delivered" messages).
With these types of cumulative counts, I could:
Compute my own deltas of messages entering or exiting the queue, e.g. doing Δ_count = cumulative_count(t) - cumulative_count(t-1)
Compute throughput rates doing throughput = Δ_count / Δ_time
Potentially infer how long messages stay on the queue throughout the day.
The last two would ideally rely on the precise timestamps when those cumulative counts were calculated.
I am trying to solve this problem directly using RabbitMQ’s API, but I’m encountering a problem when doing so. When I calculate the message cumulative count in the queue, I get a number that I don’t expect.
For example consider the screenshot below.
The Δ_message_count between entries 90 and 91 is 1810-1633 = 177. So, as I stated, I suppose that the difference between published and delivered messages should be 177 as well (in particular, 177 more messages published than delivered).
However, when I calculate these differences, I see that the difference is not 177:
Δ of published (incoming) messages: 13417517652009 - 13417517651765 = 244
Δ of delivered (outgoing) messages: 1341751765667 - 1341751765450 = 217
so we get 244 - 217 =27 messages. This suggests that there are 177 - 27 = 150 messages "unaccounted" for.
Why?
I tried taking into account the redelivered messages given by the API but they were constant when I run my tests, suggesting that there were no redelivered messages, so I wouldn't expect that to play a role.

format splunk query by renaming search elements

I could use a little help with a splunk query I’m trying to use.
This query works fine for gathering the info I need:
index=prd_aws_billing (source="/*2017-12.csv") LinkedAccountId="1234567810" OR LinkedAccountId="123456789" ProductName="Amazon Elastic Compute Cloud" | stats sum(UnBlendedCost) AS Cost by ResourceId,UsageType,user_Name,user_Engagement
However I’d like to refine that a bit. I’d like to represent user_Engagement as just Engagement and user_Name as “Resource Name”.
I tried using AS to change the output, like I did to change UnBlendedCost to just “Cost”. But when I do that it kills my query, and nothing is returned. For instance if I do either:
index=prd_aws_billing (source="/*2017-12.csv") LinkedAccountId="123456789" OR LinkedAccountId="1234567810" ProductName="Amazon Elastic Compute Cloud" | stats sum(UnBlendedCost) AS Cost by ResourceId AS “Resource Name”,UsageType,user_Name,user_Engagement AS “Engagement”
Or
index=prd_aws_billing (source="/*2017-12.csv") LinkedAccountId="123456789" OR LinkedAccountId="1234567819" ProductName="Amazon Elastic Compute Cloud" ResourceID AS “Resource Name” user_Engagement AS “Engagement” | stats sum(UnBlendedCost) AS Cost by ResourceId AS “Resource Name”,UsageType,user_Name,user_Engagement AS “Engagement”
The query dies, and no info is returned. How can I reformat the search elements listed after the 'by' clause?
Use the |rename command. You can only use AS to rename the fields that are being transformed in a |stats.
index=prd_aws_billing (source="/*2017-12.csv") LinkedAccountId="1234567810" OR LinkedAccountId="123456789" ProductName="Amazon Elastic Compute Cloud"
| stats sum(UnBlendedCost) AS Cost by ResourceId,UsageType,user_Name,user_Engagement
| rename user_Name as "Resource Name" user_Engagement as Engagement

Redis instantaneous_ops_per_sec higher than actual throughput

We are using Redis as a Queue which has on an average about ~3k rps. But when we check the instantaneous_ops_per_sec, this value consistently reports higher than expected, by about 20%, in this case, reports ~4k ops per sec.
To verify this, I have taken a dump of MONITOR for about 10 seconds and checked the number of incoming commands.
grep "1489722862." monitor_output | wc -l
Where 1489722862 is the timestamp. Even this count matches with what is being produced in the queue and what is being consumed from the queue.
This is a master-slave redis cluster setup.
Does instantaneous_ops_per_sec also account for the slave reads? If not, what is the other reason for which this count is significantly higher?
The instantaneous_ops_per_sec metric is calculated as the mean of the recent samples that the server took. The number of recent samples is hardcoded as 16 by STATS_METRIC_SAMPLES in server.h.