Trying to import a 2.8GB of file via GS to Big query, and the job failed with:
Unexpected. Please try again.
here are some other outputs
Job ID: aerobic-forge-504:job_a6H1vqkuNFf-cJfAn544yy0MfxA
Start Time: 5:12pm, 3 Jul 2014
End Time: 7:12pm, 3 Jul 2014
Destination Table: aerobic-forge-504:wr_dev.phone_numbers
Source URI: gs://fls_csv_files/2014-6-11_Global_0A43E3B1-2E4A-4CA9-BD2A-012B4D0E4C69.txt
Source Format: CSV
Allow Quoted Newlines: true
Allow Jagged Rows: true
Ignore Unknown Values: true
Schema:
area: INTEGER
number: INTEGER
The job failed due to timeout; there is a maximum of 2 hours allowed for processing; after that the import job is killed. I'm not sure why the import was so slow; from what I can tell we only processed at about 100KB/sec, which is far slower than expected. It is quite possible that the error is transient.
In the future, you can speed up the import by setting allow_quoted_newlines to false, which will allow bigquery to process the import in parallel. Alternately, you can partition the file yourself and send multiple file paths in the job.
Can you try again and let us know whether it works?
Related
I am running into a strange issue with PyHive running a Hive query in async mode. Internally, PyHive uses Thrift client to execute the query and to fetch logs (along with execution status). I am unable to fetch the logs of Hive query (map/reduce tasks, etc). cursor.fetch_logs() returns an empty data structure
Here is the code snippet
rom pyhive import hive # or import hive or import trino
from TCLIService.ttypes import TOperationState
def run():
cursor = hive.connect(host="10.x.y.z", port='10003', username='xyz', password='xyz', auth='LDAP').cursor()
cursor.execute("select count(*) from schema1.table1 where date = '2021-03-13' ", async_=True)
status = cursor.poll(True).operationState
print(status)
while status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):
logs = cursor.fetch_logs()
for message in logs:
print("running ")
print(message)
# If needed, an asynchronous query can be cancelled at any time with:
# cursor.cancel()
print("running ")
status = cursor.poll().operationState
print
cursor.fetchall()
The cursor is able to get operationState correctly but its unable to fetch the logs. Is there anything on HiveServer2 side that needs to be configured?
Thanks in advance
Closing the loop here in case someone else has same or similar issue with hive.
In my case the problem was the hiveserver configuration. Hive Server won't stream the logs if logging operation is not enabled. Following is the list I configured
hive.server2.logging.operation.enabled - true
hive.server2.logging.operation.level EXECUTION (basic logging - There are other values that increases the logging level)
hive.async.log.enabled false
hive.server2.logging.operation.log.location
I've a compressed json file (900MB, newline delimited) and load into a new table via bq command and get the load failure:
e.g.
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values mtdataset.mytable gs://xxx/data.gz schema.json
Waiting on bqjob_r3ec270ec14181ca7_000001461d860737_1 ... (1049s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r3ec270ec14181ca7_000001461d860737_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 0: Unexpected. Please try again.
Why the error?
I tried again with the --max_bad_records, still not useful error message
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values --max_bad_records 2 XXX.test23 gs://XXX/20140521/file1.gz schema.json
Waiting on bqjob_r518616022f1db99d_000001461f023f58_1 ... (319s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r518616022f1db99d_000001461f023f58_1': Unexpected. Please try again.
And also cannot find any useful message in the console.
To BigQuery team, can you have a look using the job ID?
As far I know there are two error sections on a job. There is one error result, and that's what you see now. And there is a second, which should be a stream of errors. This second is important as you could have errors in it, but the actual job might succeed.
Also you can set the --max_bad_records=3 on the BQ tool. Check here for more params https://developers.google.com/bigquery/bq-command-line-tool
You probably have an error that is for each line, so you should try a sample set from this big file first.
Also there is an open feature request to improve the error message, you can star (vote) this ticket https://code.google.com/p/google-bigquery-tools/issues/detail?id=13
This answer will be picked up by the BQ team, so for them I am sharing that: We need an endpoint where we can query based on a jobid, the state, or the stream of errors. It would help a lot to get a full list of errors, it would help debugging the BQ jobs. This could be easy to implement.
I looked up this job in the BigQuery logs, and unfortunately, there isn't any more information than "failed to read" somewhere after about 930 MB have been read.
I've filed a bug that we're dropping important error information in one code path and submitted a fix. However, this fix won't be live until next week, and all that will do is give us more diagnostic information.
Since this is repeatable, it isn't likely a transient error reading from GCS. That means one of two problems: we have trouble decoding the .gz file, or there is something wrong with that particular GCS object.
For the first issue, you could try decompressing the file and re-uploading it as uncompressed. While it may sound like a pain to send gigabytes of data over the network, the good news is that the import will be faster since it can be done in parallel (we can't import a compressed file in parallel since it can only be read sequentially).
For the second issue (which is somewhat less likely) you could try downloading the file yourself to make sure you don't get errors, or try re-uploading the same file and seeing if that works.
Using the Java SDK I am creating a load job for just a single record with a fairly complicated schema. When monitoring the status of the load job, it takes a surprisingly long time (but perhaps this is due to working out the schema), but then says:
11:21:06.975 [main] INFO xxx.GoogleBigQuery - Job status (21694ms) create_scans_1384744805079_172221126: DONE
11:24:50.618 [main] ERROR xxx.GoogleBigQuery - Job create_scans_1384744805079_172221126 caused error (invalid) with message
Too many errors encountered. Limit is: 0.
11:24:50.810 [main] ERROR xxx.GoogleBigQuery - {
"message" : "Too many errors encountered. Limit is: 0.",
"reason" : "invalid"
?}
BTW - how do I tell the job that it can have more than zero errors using Java?
This load job does not appear in the list of recent jobs in the console, and as far as I can see, none of the Java objects contains any more details about the actual errors encountered. So how can I pro-grammatically find out what is going wrong? All I can find is:
if (err != null) {
log.error("Job {} caused error ({}) with message\n{}", jobID, err.getReason(), err.getMessage());
try {
log.error(err.toPrettyString());
}
...
In general I am having a difficult time finding good documentation for some of these things and am working it out by trial and error and short snippets of code found on here and older groups. If there is a better source of information than the getting started guides, then I would appreciate any pointers to that information. The Javadoc does not really help and I cannot find any complete examples of loading, querying, testing for errors, cataloging errors and so on.
This job is submitted via a NEWLINE_DELIMITIED_JSON record, supplied to the job via:
InputStream dummy = getClass().getResourceAsStream("/googlebigquery/xxx.record");
final InputStreamContent jsonIn = new InputStreamContent("application/octet-stream", dummy);
createTableJob = bigQuery.jobs().insert(projectId, loadJob, jsonIn).execute();
My authentication and so on seems to work correctly as separate Java code to list the projects, and the datasets in the project all works correctly. So I just need help in working what the actual error is - does it not like the schema (I have records nested within records for instance), or does it think that there is an error in the data I am submitting.
Thanks in advance for any help. The job number cited above is an actual failed load job if that helps any Google staffers who might read this.
It sounds like you have a couple of questions, so I'll try to address them all.
First, the way to get the status of the job that failed is to call jobs().get(jobId), which returns a job object that has an errorResult object that has the error that caused the job to fail (e.g. "too many errors"). The errorStream list is a lost of all of the errors on the job, which should tell you which lines hit errors.
Note if you have the job id, it may be easier to use bq to lookup the job -- you can run bq show <job_id> to get the job error information. If you add the --format=prettyjson it will print out all of the information in the job.
A hint you also might want to consider is to supply your own job id when you create the job -- then even if there is an error starting the job (i.e. the insert() call fails, perhaps due to a network error) you can look up the job to see what actually happened.
To tell BigQuery that some errors are allowed during import, you can use the maxBadResults setting in the load job. See https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#getMaxBadRecords().
I am able to import data from my google storage. However, having troubling exporting data to Google Cloud Storage CSV files through the web console. Data set is small, and I am not getting any specific reasons that cause the issue.
Extract9:30am
gl-analytics:glcqa.Device togs://glccsv/device.csv
Errors:
Unexpected. Please try again.
Job ID: job_f8b50cc4b4144e14a22f3526a2b76b75
Start Time: 9:30am, 24 Jan 2013
End Time: 9:30am, 24 Jan 2013
Source Table: gl-analytics:glcqa.Device
Destination URI: gs://glccsv/device.csv
It looks like you have a nested schema, which cannot be output to csv. Try setting the output format to JSON.
Note this bug has now been fixed internally, so after our next release you'll get a better error when this happens.
I got the "backend error. Job aborted" , job ID below.
I know this question was asked, but I still need some help to try & resolve this.
what happen if this happen in production,we want to have a 5min periodic loads?
thanks in advance
Errors:
Backend error. Job aborted.
Job ID: job_744a2b54b1a343e1974acdae889a7e5c
Start Time: 4:32pm, 30 Aug 2012
End Time: 5:02pm, 30 Aug 2012
Destination Table: XXXXXXXXXX
Source URI: gs://XXXXX/XXXXXX.csv.Z
Delimiter: ,
Max Bad Records: 99999999
This job hit an internal error. Since you ran this job, BigQuery has been updated to a new version, and a number of internal errors have been fixed. Can you retry your job?