Splunk rex query does not return desired result - splunk

I am looking to search for error type in my spunk. A typical error log looks like this:
ERROR 2016/03/16 22:13:55 Program exited with error Calling service: Post http://hostname/v1.21/resource/create?name=/60b80cf9-ebc4-11e5-a9cb-3c4a92db9491-2: read unix #->/var/run/program.sock: use of closed network connection (Client.Timeout exceeded while awaiting headers)
Note that common part is "Program exited with error". I am looking to capture the part that follows this common part of the error message. I tried with a couple of rex expressions. Both returned different results. Importantly, neither captured the error type I have shown above. I am giving the one that worked better here.
* | rex "Program exited with error\s+(?<reason>.+)" | top reason
An example of the log it matched-
Unable to get program status, Get http://192.168.0.2:2774/program/v1/status: net/http: timeout awaiting response headers
However, it did not match log of the form-
initial ZK connection failed, stat /var/program/f47aae5c-ea42-11e5-8975-fc15b40f4cc4/srcheck/started: no such file or directory
Calling service: Post http://hostname/v1.21/resource/create?name=/60b80cf9-ebc4-11e5-a9cb-3c4a92db9491-2: read unix #->/var/run/program.sock: use of closed network connection (Client.Timeout exceeded while awaiting headers)
Could someone help me understand what's wrong with my rex expression and what the right one would be so I get all possible error types?

This recipe:
"ERROR.*Program exited with error.*:.*:.*:\s+(?<reason>.+)"
will yield:
use of closed network connection (Client.Timeout exceeded while awaiting headers)
I don't have enough sample data to know if this will hold up or not. For example, I'm counting on exactly 3 colons, to get me to the interesting part. Also I don't know if you care about other things like the hostname, the fact that it's a Post, etc.. But based on your sample of 1, this answer should do the trick.

Related

Read Timed Out : sychronous query via Bigquery java API

We are using the big query JAVA API to retrieve results for our analytics reporting frontend. We are trying to retrieve the results synchronously. A lot of times we get Read timed out error, even before the query timeout as specified in the parameters. Here's the stack trace for a sample fail:
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at com.sun.net.ssl.internal.ssl.InputRecord.readFully(InputRecord.java:293)
at com.sun.net.ssl.internal.ssl.InputRecord.read(InputRecord.java:331)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:830)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:787)
at com.sun.net.ssl.internal.ssl.AppInputStream.read(AppInputStream.java:75)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:410)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:460)
I am not able to retrieve the job id of the resulting job as the error occurs before I can retrieve a JobReference object. The timeout specified in this case was 300 sec. The query failed well before it. The query contains three JOIN's and several GROUP EACH BY clauses. Can you suggest us a possible way to debug this ?
Adding the code snippet:
QueryRequest queryInfo = new QueryRequest().setQuery(sql)
.setTimeoutMs(timeOutInSec * 1000);
// get project id
BQGameConnectionDetails details = Config
.getBQConnectionDetails(gameId);
String projectId = details.getProjectId();
Bigquery.Jobs.Query queryRequest = getInstance(gameId).jobs()
.query(projectId, queryInfo);
QueryResponse response = queryRequest.execute();
There are two timeouts involved. The first timeout is in the HTTP request you've sent to bigquery. The second is in the bigquery request timeout. It sounds like you've set the latter to a large value, but the former is likely the timeout that you're hitting. If the HTTP request times out before the BigQuery timeout, the connection will be closed and BigQuery won't have a chance to respond.
There are two options: First is to increase the HTTP request timeout (which depends on the libraries you're using, but this page here may be helpful). The second is to decrease the bigquery timeout. This means you'll have to use jobs.getQueryResults() to read the actual results, but this is a more robust method because it doesn't matter how long the query takes, you can just call getQueryResults() in a loop. I would post a link to a good java sample that does this, but I don't know that one exists, unfortunately.

File: 0: Unexpected from Google BigQuery load job

I've a compressed json file (900MB, newline delimited) and load into a new table via bq command and get the load failure:
e.g.
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values mtdataset.mytable gs://xxx/data.gz schema.json
Waiting on bqjob_r3ec270ec14181ca7_000001461d860737_1 ... (1049s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r3ec270ec14181ca7_000001461d860737_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 0: Unexpected. Please try again.
Why the error?
I tried again with the --max_bad_records, still not useful error message
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values --max_bad_records 2 XXX.test23 gs://XXX/20140521/file1.gz schema.json
Waiting on bqjob_r518616022f1db99d_000001461f023f58_1 ... (319s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r518616022f1db99d_000001461f023f58_1': Unexpected. Please try again.
And also cannot find any useful message in the console.
To BigQuery team, can you have a look using the job ID?
As far I know there are two error sections on a job. There is one error result, and that's what you see now. And there is a second, which should be a stream of errors. This second is important as you could have errors in it, but the actual job might succeed.
Also you can set the --max_bad_records=3 on the BQ tool. Check here for more params https://developers.google.com/bigquery/bq-command-line-tool
You probably have an error that is for each line, so you should try a sample set from this big file first.
Also there is an open feature request to improve the error message, you can star (vote) this ticket https://code.google.com/p/google-bigquery-tools/issues/detail?id=13
This answer will be picked up by the BQ team, so for them I am sharing that: We need an endpoint where we can query based on a jobid, the state, or the stream of errors. It would help a lot to get a full list of errors, it would help debugging the BQ jobs. This could be easy to implement.
I looked up this job in the BigQuery logs, and unfortunately, there isn't any more information than "failed to read" somewhere after about 930 MB have been read.
I've filed a bug that we're dropping important error information in one code path and submitted a fix. However, this fix won't be live until next week, and all that will do is give us more diagnostic information.
Since this is repeatable, it isn't likely a transient error reading from GCS. That means one of two problems: we have trouble decoding the .gz file, or there is something wrong with that particular GCS object.
For the first issue, you could try decompressing the file and re-uploading it as uncompressed. While it may sound like a pain to send gigabytes of data over the network, the good news is that the import will be faster since it can be done in parallel (we can't import a compressed file in parallel since it can only be read sequentially).
For the second issue (which is somewhat less likely) you could try downloading the file yourself to make sure you don't get errors, or try re-uploading the same file and seeing if that works.

Determine actual errors from a load job

Using the Java SDK I am creating a load job for just a single record with a fairly complicated schema. When monitoring the status of the load job, it takes a surprisingly long time (but perhaps this is due to working out the schema), but then says:
11:21:06.975 [main] INFO xxx.GoogleBigQuery - Job status (21694ms) create_scans_1384744805079_172221126: DONE
11:24:50.618 [main] ERROR xxx.GoogleBigQuery - Job create_scans_1384744805079_172221126 caused error (invalid) with message
Too many errors encountered. Limit is: 0.
11:24:50.810 [main] ERROR xxx.GoogleBigQuery - {
"message" : "Too many errors encountered. Limit is: 0.",
"reason" : "invalid"
?}
BTW - how do I tell the job that it can have more than zero errors using Java?
This load job does not appear in the list of recent jobs in the console, and as far as I can see, none of the Java objects contains any more details about the actual errors encountered. So how can I pro-grammatically find out what is going wrong? All I can find is:
if (err != null) {
log.error("Job {} caused error ({}) with message\n{}", jobID, err.getReason(), err.getMessage());
try {
log.error(err.toPrettyString());
}
...
In general I am having a difficult time finding good documentation for some of these things and am working it out by trial and error and short snippets of code found on here and older groups. If there is a better source of information than the getting started guides, then I would appreciate any pointers to that information. The Javadoc does not really help and I cannot find any complete examples of loading, querying, testing for errors, cataloging errors and so on.
This job is submitted via a NEWLINE_DELIMITIED_JSON record, supplied to the job via:
InputStream dummy = getClass().getResourceAsStream("/googlebigquery/xxx.record");
final InputStreamContent jsonIn = new InputStreamContent("application/octet-stream", dummy);
createTableJob = bigQuery.jobs().insert(projectId, loadJob, jsonIn).execute();
My authentication and so on seems to work correctly as separate Java code to list the projects, and the datasets in the project all works correctly. So I just need help in working what the actual error is - does it not like the schema (I have records nested within records for instance), or does it think that there is an error in the data I am submitting.
Thanks in advance for any help. The job number cited above is an actual failed load job if that helps any Google staffers who might read this.
It sounds like you have a couple of questions, so I'll try to address them all.
First, the way to get the status of the job that failed is to call jobs().get(jobId), which returns a job object that has an errorResult object that has the error that caused the job to fail (e.g. "too many errors"). The errorStream list is a lost of all of the errors on the job, which should tell you which lines hit errors.
Note if you have the job id, it may be easier to use bq to lookup the job -- you can run bq show <job_id> to get the job error information. If you add the --format=prettyjson it will print out all of the information in the job.
A hint you also might want to consider is to supply your own job id when you create the job -- then even if there is an error starting the job (i.e. the insert() call fails, perhaps due to a network error) you can look up the job to see what actually happened.
To tell BigQuery that some errors are allowed during import, you can use the maxBadResults setting in the load job. See https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#getMaxBadRecords().

How to make Xlib print error message, but not exit?

"The action of the default handlers is to print an explanatory message and exit." (link)
Example of such message:
X Error of failed request: BadWindow (invalid Window parameter)
Major opcode of failed request: 12 (X_ConfigureWindow)
Resource id in failed request: 0xc0007a
Serial number of failed request: 140
Current serial number in output stream: 141
If I set (XSetErrorHandler) my own "ignore everything" error handler, the "explanatory messages" disappear.
How to make Xlib ignore errors, but still print error messages?
If you actually want those error messages you pretty much have two options:
Pull _XPrintDefaultError out of XlibInt.c along with some private headers (with all the caveats of using library-private definitions).
Redefine exit() not to actually exit when _XDefautError calls it.
Neither is especially pretty and both may break and reduce your portability, but they do work.
You have to format your own message. The contents of the message is the contents of the XErrorEvent struct:
http://tronche.com/gui/x/xlib/event-handling/protocol-errors/default-handlers.html

Splunk Error Log Dashboard

I've to create a dashboard in splunk which will show error reporting within the log file:
[2011-09-12 14:13:00:605 GMT][com.abc.rest.Security][http-8080-Processor15] ERROR Unable to decrypt token [abc.com=3502639832.36895.0000; path=/] due to error: Input length must be multiple of 16 when decrypting with padded cipher
[2011-09-12 14:13:00:608 GMT][com.abc.filters.AuthenticationFilter][http-8080-Processor15] DEBUG ValidAuthToken: false
[2011-09-13 16:43:40:134 GMT][com.abc.PerfManager][http-8080-Processor13] ERROR Operation Failed: GET_ACCOUNT_ORDER [Status Code: 0150 Message: ACCESS_DENIED]
[2011-09-13 16:43:40:137 GMT][com.abc.rest.ResolvePackage][http-8080-Processor13] WARN MCE error occurred [StatusCode: 0150].
The above errors are occurring at different times & I want to count those all & show pie chart of all these errors with their count. Basically, these errors could be anything which starts with ERROR.
I should also get the Top10 warnings in the logs with their count.
I couldn't find a good way to implement it in Splunk. Can any one help me out on how to implement it in splunk?
Thanks!
... | rex "ERROR (?<error_type>[^\[]+)" | stats count by error_type
something like that should work. check our http://splunk-base.splunk.com/answers/ if it doesn't work.