Elastalert rule for CPU usage in percentage - elastalert

I am facing issue with elastalert rule for CPU usage (not load average). I am not getting any hit and match. Below is my .yaml file for CPU rule:
name: CPU usgae
type: metric_aggregation
index: metricbeat-*
buffer_time:
minutes: 10
metric_agg_key: system.cpu.total.pct
metric_agg_type: avg
query_key: beat.hostname
doc_type: doc
bucket_interval:
minutes: 5
sync_bucket_interval: true
max_threshold: 60.0
filter:
- term:
metricset.name: cpu
alert:
- "email"
email:
- "xyz#xy.com"
Can you please help me what changes i need to make in my rule.
Any assistance will be appreciated.
Thanks.

Metricbeat reports CPU values in the range of 0 to 1. So a threshold of 60 will never be matched.
Try it with max_threshold: 0.6 and it probably will work.

Try reducing buffer_time and bucket_interval for testing

The best way to debug elastalert issue is by using command line option --es_debug_trace like this (--es_debug_trace /tmp/output.txt). It shows exact curl api call to elasticsearch being used by elastalert in background. Then the query can be copied and used in Kibana's Dev Tools for easy analysis and fiddling.
Most likely, doc_type: doc setting might have caused the ES endpoint to look like this: metricbeat-*/doc/_search
You might not have that doc document hence no match. Please remove doc_type and try.
Also please note that the pct value is less than 1 hence for you case: max_threshold: 0.6
For me following works, for your reference:
name: CPU usage
type: metric_aggregation
use_strftime_index: true
index: metricbeat-system.cpu-%Y.%m.%d
buffer_time:
hour: 1
metric_agg_key: system.cpu.total.pct
metric_agg_type: avg
query_key: beat.hostname
min_doc_count: 1
bucket_interval:
minutes: 5
max_threshold: 0.6
filter:
- term:
metricset.name: cpu
realert:
hours: 2
...
sample match output:
{
'#timestamp': '2021-08-19T15:06:22Z',
'beat.hostname': 'MY_BUSY_SERVER',
'metric_system.cpu.total.pct_avg': 0.6155,
'num_hits': 50,
'num_matches': 10
}

Related

Elastalert - Text fields are not optimised for operations that require per-document field data - Please use a keyword field instead

I have setup elastalert on a server and managed to run a rule to monitor disk usage. It was running ok until I had to rebuild the server. Now when I run the rule I get error below. I can't find a solution on Internet. Any ideas? Thank you in advanced.
ERROR:root:Error running query: RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [host.name] in order to load field data by uninverting the inverted index. Note that this can use significant memory.')
alert rule:
name: "warning:High Disk Usage - Disk use is over 85% of capacity:warning"
type: metric_aggregation
index: metricbeat-*
metric_agg_key: system.filesystem.used.pct
metric_agg_type: avg
alert_subject: "High Disk Usage"
max_threshold: 0.15
filter:
- term:
metricset.name: filesystem
- term:
system.filesystem.mount_point: "/"
query_key: host.name

Setting Cloudwatch Event Cron Job

I'm a little bit confused on the cron job documentation for cloudwatch events. My goal is to create a cron job that run every day at 9am, 5pm, and 11pm EST. Does this look correct or did I do it wrong? It seems like cloudwatch uses UTC military time so I tried to convert it to EST.
I thought I was right, but I got the following error when trying to deploy cloudformation template via sam deploy
Parameter ScheduleExpression is not valid. (Service: AmazonCloudWatchEvents; Status Code: 400; Error Code: ValidationException
What is wrong with my cron job? I appreciate any help!
(SUN-SAT, 4,0,6)
UPDATED:
this below gets same error Parameter ScheduleExpression is not valid
Events:
CloudWatchEvent:
Type: Schedule
Properties:
Schedule: cron(0 0 9,17,23 ? * * *)
MemorySize: 128
Timeout: 100
You have to specify a value for all six required cron fields
This should satisfy all your requirements:
0 4,14,22 ? * * *
Generated using:
https://www.freeformatter.com/cron-expression-generator-quartz.html
There are a lot of other cron generators you can find online.

How to delete/remove unfetched URLs from NUTCH Database (CrawlDB)

I want to crawl new URL list using nutch but there are some Un-fetched URL available :
bin/nutch readdb -stats
WebTable statistics start
Statistics for WebTable:
retry 0: 3403
retry 1: 25
retry 2: 2
status 4 (status_redir_temp): 5
status 5 (status_redir_perm): 26
retry 3: 1
status 2 (status_fetched): 704
jobs: {db_stats-job_local_0001={jobName=db_stats, jobID=job_local_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=227, REDUCE_INPUT_RECORDS=13, SPILLED_RECORDS=26, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=3431, SPLIT_RAW_BYTES=1059, MAP_OUTPUT_BYTES=181843, REDUCE_SHUFFLE_BYTES=0, PHYSICAL_MEMORY_BYTES=0, REDUCE_INPUT_GROUPS=13, COMBINE_OUTPUT_RECORDS=13, REDUCE_OUTPUT_RECORDS=13, MAP_OUTPUT_RECORDS=13724, COMBINE_INPUT_RECORDS=13724, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=718675968}, File Input Format Counters ={BYTES_READ=0}, File Output Format Counters ={BYTES_WRITTEN=397}, FileSystemCounters={FILE_BYTES_WRITTEN=1034761, FILE_BYTES_READ=912539}}}}
max score: 1.0
status 1 (status_unfetched): 2679
min score: 0.0
status 3 (status_gone): 17
TOTAL urls: 3431
avg score: 0.0043631596
WebTable statistics: done
So, How can i remove it from Nutch Database ?? Thanks
You could use CrawlDbMerger but you would only be able to filter by the URL and not the status, the Generator Job has already support for using jexl expressions but as far as I remember we don't have that feature built in now into the crawl DB.
One way would be to list all the URLs with the status_unfetched (readdb) and write some regex to block them (using the normal URL filter) then you just use the CrawlDbMerger to filter the crawldb with this filter enabled and your URLs should disappear.

jmeter and apachetop - why I see different values?

Probably explanation is simple - but I couldn't find answer to my question:
I am running jmeter test from one VM (worker) to another (target). On worker I have jmeter with 100 threads (100 users). On target I have API that runs on Apache. When I run "apachetop -f access_log" on target, I see only about 7 req/s.
Can someone explain me, why I don't see 100 req/s on target?
In test result in jmeter I see always 200 OK, so all request are hitting the target, and moreover target always responds. So I am not dropping any requests here. Network bandwidth between machines is 1G. What I am missing here?
Thanks,
Daddy
100 users doesn't necessarily mean 100 requests per second, even more, it is highly unlikely.
According to JMeter glossary:
Elapsed time. JMeter measures the elapsed time from just before sending the request to just after the last response has been received. JMeter does not include the time needed to render the response, nor does JMeter process any client code, for example Javascript.
Roughly, if JMeter is able to get response from server in 1 second - you will get 100 requests/second. If response time will be 2 seconds - throughput will be 50 requests/second, etc, response time 4 seconds - 25 requests/second, etc.
Also JMeter configuration matters. If you don't provide enough loops you may run into a situation where some threads already finished and some are not even started. See JMeter Test Results: Why the Actual Users Number is Lower than Expected article for more detailed explanation.
Your target load = 100 threads ( you are assuming it should generate 100 req/sec as per your plan)
Your actual load = 7 req / sec = 7*3600 / hour = 25200
Per thread throughput = 25200 / 100 threads = 252 iterations / thread / hour
Per transaction time = 3600 / 252 = 14.2 secs
This means, JMeter should be actually sending each request every 14 secs per thread. i.e., 100 requests for every 14.2 secs.
Now, analyze your JMeter summary report for the transaction timers to find out where the remaining 13.2 secs are being spent.
Possible issues are
1. High DNS resolution time (DNS issue)
2. High connection setup time (indicates load balancer issues)
3. High Request send time (indicates n/w or firewall throttling issues)
4. High request receive time (same as #3)
Now, the time that you see in Apache logs are mostly visible to JMeter as time to first byte time. I am not sure about the machine that you are running your testing. If your worker can support curl, use Curl to find the components for a single request.
echo 'request payload for POST'
| curl -X POST -H 'User-Agent: myBrowser' -H 'Content-Type: application/json' -d #- -s -w '\nDNS time:\t%{time_namelookup}\nTCP Connect time:\t%{time_connect}\nAppCon Protocol time:\t%{time_appconnect}\nRedirect time:\t%{time_redirect}\nPreXfer time:\t%{time_pretransfer}\nStartXfer time:\t%{time_starttransfer}\n\nTotal time:\t%{time_total}\n' http://mytest.test.com
If the above output indicates no such issues then the time must have been spent within JMeter. You should tune your JMeter implementation by using various options like beanshell / JSR223 etc.

How to avoid Hitting the 10 sec limit per user

We run multiple short queries in parallel, and hit the 10 sec limit.
According to the docs, throttling might occur if we hit a limit of 10 API requests per user per project.
We send a "start query job", and then we call the "getGueryResutls()" with timeoutMs of 60,000, however, we get a response after ~ 1 sec, we look for JOB Complete in the JSON response, and since it is not there, we need to send the GetQueryResults() again many times and hit the threshold, that is causing an error, not a slowdown. the sample code is below.
our questions are as such:
1. What is a "user" is it an appengine user, is it a user-id that we can put in the connection string or in the query itslef?
2. Is it really per API project of BigQuery?
3. What is the behavior?we got an error: "Exceeded rate limits: too many user/method api request limit for this user_method", and not a throttling behavior as the doc say and all of our process fails.
4. As seen below in the code, why we get the response after 1 sec & not according to our timeout? are we doing something wrong?
Thanks a lot
Here is the a sample code:
while (res is None or 'jobComplete' not in res or not res['jobComplete']) :
try:
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
except HTTPException:
if independent:
raise
Are you saying that even though you specify timeoutMs=60000, it is returning within 1 second but the job is not yet complete? If so, this is a bug.
The quota limits for getQueryResults are actually currently much higher than 10 requests per second. The reason the docs say only 10 is because we want to have the ability to throttle it down to that amount if someone is hitting us too hard. If you're currently seeing an error on this API, it is likely that you're calling it at a very high rate.
I'll try to reproduce the problem where we don't wait for the timeout ... if that is really what is happening it may be the root of your problems.
def query_results_long(self, jobId, maxResults, res=None):
start_time = query_time = None
while res is None or 'jobComplete' not in res or not res['jobComplete']:
if start_time:
logging.info('requested for query results ended after %s', query_time)
time.sleep(2)
start_time = datetime.now()
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
query_time = datetime.now() - start_time
return res
then in appengine log I had this:
requested for query results ended after 0:00:04.959110