How to build an ongoing alert that catches sudden spikes for a certain http error code? - splunk

I could really use an ongoing alert that catches a sudden rise (spike) in a certain error code (such as 404 or 502 etc...)
I tried giving this some thought on how to achieve that, and... Well... I could really use your help with the script :-)
From my understanding the search query should "know" or, "sense" the normal traffic (not sure for how long, maybe for 1hr, 2hrs) and alert when there is a spike in the error code compared to 1-2 hours ago.
I think the error code spike threshold should be more than 5% of total traffic, while occurring for longer than 90 seconds.
Here is a Splunk Query I use today, I appreciate your help tuning it to what I described above:
tag=NginxLogs host=www1 OR host=www2 |stats count by status|eventstats sum(count) as total|eval perc=round((count/total)*100,2)|where status="404" AND perc>5

The top command automatically provides the count and percent.
http://docs.splunk.com/Documentation/Splunk/7.1.2/SearchReference/Top
tag=NginxLogs host=www1 OR host=www2
| top status
| search percent > 5 AND status > 399
If you have the url,http request method and user in your splunk logs, you can add it as a part of this alert. Example:
tag=NginxLogs host=www1 OR host=www2
| eventstats distinct_count(userid) as NoOfUsersAffected by requestUri,status,httpmethod
| top status,httpmethod,NoOfUsersAffected by requestUri
| search NoOfUsersAffected > 2 AND ((status>499 AND percentage > 5) OR (StatusCode=400 AND percentage > 95))
You can use the following alert message:
$result.percent$ % ($result.count$ calls) has StatusCode $result.status$ for
$result.requestUri$ - $result.httpmethod$.
$result.NoOfUsersAffected$ users were affected
You will get alert like:
21.19 % (850 calls) has StatusCode 500 for https://app.test.com/hello - GET.
90 users are affected

Related

Is it possible to create log source health alerts in Azure Sentinel?

I am attempting to create an alert that lets me know if a data source stops providing logs to Sentinel. While I know it displays anomalies in log data on the dash board, I am hoping to receive alerts if a source stops providing logs for an extended period of time.
Something like creating a rule with the following query (CEF in this case):
CommonSecurityLog
| where TimeGenerated > ago(24h)
| summarize count() by DeviceVendor, DeviceProduct, DeviceName, DeviceExternalID
| where count_ == 0

Podio Create Item rate limit after 25 calls

I have to create items in podio using the api. When i let my program go full speed i noticed that after 5 - 6 items I get an error response from podio saying:
{
"error_propagate":false,
"error":"rate_limit",
"error_description":"You have hit the rate limit. Please wait 300 seconds before trying again",
"request":{
"url":"http://api.podio.com/oauth/token",
"query_string":"",
"method":"POST"
}
}
I tought the rate limit was 5000 calls/H and I get this error after 25 calls...
I added a thread.sleep in my code, and now it seems to be better, but even when I let the thread sleep for 10s I still get this error, I have now set the thread.sleep to 20 sec and it seems to work.
Is there a hidden rate limit to the number off calls per second ?
I think you are using Username password authentication here. The token request endpoint have lower limit from my experience. So the best way to solve this is to store and reuse the access tokens, instead of re-authenticating every time your program runs.
Podio API client libraries provide convenience methods to do this. See this links:
http://podio.github.io/podio-dotnet/sessions/
http://podio.github.io/podio-php/sessions
The rate limit is 1000 calls/H. so you can put sleep accordingly.

IBM Watson Concept Insights get related concepts (corpus) using cURL timing out

I am getting this error -> {"code": 500, "message": "Forwarding error"} every time I try to get related concepts from my private account and corpus. The error seems to be a timeout error since it always dies at 2:30.
I've replaced the sample provided by IBM to point to my account and corpus. Does anybody know why this is occurring?
curl -u "{username}":"{password}" "https://gateway.watsonplatform.net/concept-insights/api/v2/corpora/accountid/corpus/related_concepts?limit=3&level=0"
cURL result
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 44 0 44 0 0 0 0 --:--:-- 0:02:30 --:--:-- 10
Corpus status
{"id":"/corpora/accountid/corpus","documents":10,"last_updated":"0001-01-01T00:00:00Z","build_status":{"ready":10,"error":0,"processing":0}}
NOTE: I do not get this error if I use the public example provided by IBM on the API. I have also masked my account id, corpus, username, and password for this public posting.
Unfortunately, since the error is corpus specific (since you mentioned you can the API to work on the public corpus), we would need to know more information about your corpus (like account id and corpus id) in order to help you out.
One way to allow you to provide this information privately is to open a ticket with the Bluemix system (there are 2 options described here)
https://developer.ibm.com/bluemix/support/#support
If you list the "Watson Concept Insights" service in the ticket, we will get your information.

Mono error when load testing

During load testing (using Load UI) of a new .Net web api using Mono hosted on a medium sized Amazon server I'm receiving the following results (in chronological order over the course of about ten minutes)
5 connections per second for 60 seconds
No errors
50 connections per second for 60 seconds
No errors
100 connections per second for 60 seconds
Received 3 errors, appearing later during the run
2014-02-07 00:12:10Z Error HttpResponseExtensions Error occured while Processing Request: [IOException] Write failure Write failure|The socket has been shut down
2014-02-07 00:12:10Z Info HttpResponseExtensions Failed to write error to response: {0} Cannot be changed after headers are sent.
5 connections per second for 60 seconds
No errors
100 connections per second for 30 seconds
No errors
100 connections per second for 60 seconds
Received 1 error same as above, appearing later during the run
100 connections per second for 45 seconds
No errors
Doing some research on this, this error seems to be a standard one received when a client closed the connection. As this is only occurring during the heavier load tests, I am wondering if it is just getting to the upper limits of what the server instance can support? If not any suggestions on hunting down the source of the errors?

How to avoid Hitting the 10 sec limit per user

We run multiple short queries in parallel, and hit the 10 sec limit.
According to the docs, throttling might occur if we hit a limit of 10 API requests per user per project.
We send a "start query job", and then we call the "getGueryResutls()" with timeoutMs of 60,000, however, we get a response after ~ 1 sec, we look for JOB Complete in the JSON response, and since it is not there, we need to send the GetQueryResults() again many times and hit the threshold, that is causing an error, not a slowdown. the sample code is below.
our questions are as such:
1. What is a "user" is it an appengine user, is it a user-id that we can put in the connection string or in the query itslef?
2. Is it really per API project of BigQuery?
3. What is the behavior?we got an error: "Exceeded rate limits: too many user/method api request limit for this user_method", and not a throttling behavior as the doc say and all of our process fails.
4. As seen below in the code, why we get the response after 1 sec & not according to our timeout? are we doing something wrong?
Thanks a lot
Here is the a sample code:
while (res is None or 'jobComplete' not in res or not res['jobComplete']) :
try:
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
except HTTPException:
if independent:
raise
Are you saying that even though you specify timeoutMs=60000, it is returning within 1 second but the job is not yet complete? If so, this is a bug.
The quota limits for getQueryResults are actually currently much higher than 10 requests per second. The reason the docs say only 10 is because we want to have the ability to throttle it down to that amount if someone is hitting us too hard. If you're currently seeing an error on this API, it is likely that you're calling it at a very high rate.
I'll try to reproduce the problem where we don't wait for the timeout ... if that is really what is happening it may be the root of your problems.
def query_results_long(self, jobId, maxResults, res=None):
start_time = query_time = None
while res is None or 'jobComplete' not in res or not res['jobComplete']:
if start_time:
logging.info('requested for query results ended after %s', query_time)
time.sleep(2)
start_time = datetime.now()
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
query_time = datetime.now() - start_time
return res
then in appengine log I had this:
requested for query results ended after 0:00:04.959110