Getting error from bq tool when uploading and importing data on BigQuery - 'Backend Error' - google-bigquery

I'm getting the error: BigQuery error in load operation: Backend Error when I try to upload and import data on BQ. I already reduced size, increased time between imports, but nothing helps. The strange thing is that if I wait for a time and retry it just works.
In the BigQuery Browser tool it appears like an error in some line/field, but I checked and there is none. And obviously this is a fake message, because if I wait and retry to upload/import the same file, it works.
Tnks

I looked up our failing jobs in the bigquery backend, and I couldn't find any jobs that terminated with 'backend error'. I found several that failed because there were ascii nulls found in the data. (it can be helpful to look at the error stream errors, not just the error result). It is possible that the data got garbled on the way to bigquery... are you certain the data did not change between the failing import and the successful one on the same data?

I've found exporting from a big query table to csv in cloud storage hits the same error when certain characters are present in one of the columns (in this case a column storing the raw results from a prediction analysis). By removing that column from the export it resolved the issue.

Related

BigQuery autodetect doesn't work with inconsistent json?

I'm trying to upload JSON to BigQuery, with --autodetect so I don't have to manually discover and write out the whole schema. The rows of JSON don't all have the same form, and so fields are introduced in later rows that aren't in earlier rows.
Unfortunately I get the following failure:
Upload complete.
Waiting on bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '[...]:bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1': Error while reading data, error message: JSON table encountered too many errors, giving up.
Rows: 1209; errors: 1. Please look into the errors[] collection for more details.
Failure details:
- Error while reading data, error message: JSON processing
encountered too many errors, giving up. Rows: 1209; errors: 1; max
bad: 0; error percent: 0
- Error while reading data, error message: JSON parsing error in row
starting at position 829980: No such field:
mc.marketDefinition.settledTime.
Here's the data I'm uploading: https://gist.github.com/max-sixty/c717e700a2774ba92547c7585b2b21e3
Maybe autodetect uses the first n rows, and then fails if rows after n are different? If that's the case, is there any way of resolving this?
Is there any tool I could use to pull out the schema from the whole file and then pass to BigQuery explicitly?
I found two tools that can help:
bigquery-schema-generator 0.5.1 that uses all the data to get the schema instead of 100 sample rows like BigQuery.
Spark SQL, you should to setup your dev env, or at least install Spark and invoke the spark-shell tool.
However, I noticed that the file is intended to fail, see this text in the link you shared: "Sample for BigQuery autodetect failure". So, I'm not pretty sure that such tools can work for a json file intended to fail.
The last but not least, I got the json imported after I removed manually the problematic field: "settledTime":"2020-03-01T02:55:47.000Z".
Hope this info helps.
Yes, see documentation here:
https://cloud.google.com/bigquery/docs/schema-detect
When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.
So if the data in the rest of the rows does not comply with initial rows, you should not use autodetect and need to provide explicit schema.
Autodetect may not work well since it looks only into the first 100 rows to detect schema. Using schema detection for JSON could be a costly endeavor.
How about using BqTail with AllowFieldAddition option allowing cost-effectively expand schema.
You could simply use the following ingestion workflow with CLI or serverless
bqtail -r=rule.yaml -s=sourceURL
#rule.yaml
When:
Prefix: /data/somefolder
Suffix: .json
Async: false
Dest:
Table: mydataset.mytable
AllowFieldAddition: true
Transient:
Template: mydataset.myTableTempl
Dataset: temp
Batch:
MultiPath: true
Window:
DurationInSec: 15
OnSuccess:
- Action: delete
See JSON with allow field addition e2e test case

BigQuery Error 6034920 when using UPDATE FROM

We are trying to perform an ELT in BigQuery. When using UPDATE FROM, it fails on some tables with the following error:
"An internal error occurred and the request could not be completed.
Error: 6034920"
Moreover, both (Source and Destination) tables consists of data from a single partition.
We are unable to find the details for error code 6034920. Any insight/solutions would be really appreciated?
It is transient, internal error on Bigquery. This behavior is related to a BigQuery shuffling component (in the BQ service backend) and engineers are working to solve it. At the moment there is not an ETA to have this resolved.
In the meantime, as a workaround you should retry the query to detect this behavior again. You can continue tracking the logs in Stackdriver related to this issue by using the following filter:
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.additionalErrors.message="An internal error occurred and the request could not be completed. Error: 6034920"
What you can try, is to stop putting values into the partitioning column, it could hopefully fixed the job failures. I hope you find the above pieces of information useful.

Inconsistency in BigQuery Data Ingestion on Streaming Error

​Hi,
While streaming data to BigQuery, we are facing some inconsistency in data ingested when making https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll requests using BigQuery Java library.
Some of the batches fail with error code: backendError, while some requests time-out with exception stacktrace: https://gist.github.com/anonymous/18aea1c72f8d22d2ea1792bb2ffd6139
For batches which have failed, we have observed 3 different kinds of behaviours related to ingested data:
All records in that batch fail to be ingested into BigQuery
Only some of the records fail to be ingested into BigQuery
All records successfully gets ingested into BigQuery​ in-spite of the​ thrown error
Our questions are:
How can we distinguish between these 3 cases.
For case 2, how can we handle partially ingested data, i.e., which records from that batch should be retried?
For case 3, if all records were successfully ingested, why is error thrown at all?
Thanks in advance...
For partial success, the error response will indicate which rows got inserted and which ones failed - especially, for parsing errors. There are cases where the response fails to reach your client resulting in timeout errors even though the insert succeeded.
In general, you can retry the entire batch and it will be deduplicated if you use the approach outlined in the data consistency documentation.

Import Flat File to MS SQL

I am trying to import a CSV into MSSQL 2008 by using the flat file import method but I am getting an Overflow error. Any ideas on how to go around it?
I used the tool before for files containing up to 10K-15K records but this file has 75K records in it....
These are the error messages
Messages
Error 0xc020209c: Data Flow Task 1: The column data for column "ClientBrandID" overflowed the disk I/O buffer.
(SQL Server Import and Export Wizard)
Error 0xc0202091: Data Flow Task 1: An error occurred while skipping data rows.
(SQL Server Import and Export Wizard)
Error 0xc0047038: Data Flow Task 1: SSIS Error Code DTS_E_PRIMEOUTPUTFAILED. The PrimeOutput method on component "Source - Shows_csv" (1) returned error code 0xC0202091. The component returned a failure code when the pipeline engine called PrimeOutput(). The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing. There may be error messages posted before this with more information about the failure.
(SQL Server Import and Export Wizard)
This could be a format problem of the csv file e.g. the delimiter. Check if the delimiters are consistent within the file.
It could also be a problem of blank lines. I had a similar problem a while ago. I've solved it by removing all blank lines in the csv file. Worth a try anyway.
You may have one or more bad data elements. Try loading a small subset of your data to determine if it's a small number of bad records or a large one. This will also tell you if your loading scheme is working and your datatypes match.
Sometimes you can quickly spot data issues if you open the csv file in excel.
Another possible reason for this error is that input file has wrong encoding. So, when you manually check data, it seems fine. For example, in my case correct files were in 8-bit ANSI, and wrong files in UTF-16 - you can tell the difference by looking at files size, wrong files were twice bigger than correct files.

Backend error on import from Cloud Storage to BigQuery

Recently, we have begun to see a number of errors such as this when importing from Cloud Storage to BigQuery:
Waiting on job_72ae7db68bb14e93b7a6990ed628aedd ... (153s) Current status: RUNNING
BigQuery error in load operation: Backend Error
Waiting on job_894172da125943dbb2cd8891958d2d10 ... (364s) Current status: RUNNING
BigQuery error in load operation: Backend Error
This process runs hourly, and had previously been stable for a long time. Nothing has changed in the import script or the types of data being loaded. Please let me know if you need any more information.
I looked up these jobs in the BigQuery logs-- both of them appear to have succeeded. It is possible that the error you got was in reading the job state. I've filed an internal bug that we should distinguish between errors in the job and errors getting the state of the job in the bq tool.
After the job runs, you can use bq show -j <job_d> to see what the actual state of the job is. If it is still running, you can run bq wait <job_id>.
I also took a look at the front-end logs; all of the status requests for those job ids returned HTTP 200 (success) codes.
Can you add the --apilog=file.txt parameter to your bq command line (you'll need to add it to the beginning of the command line, as in bq --apilog=file.txt load ...) and send the output of a case where you get another failure? If you're worried about sensitive data, feel free to send it directly to me (tigani at google).
Thanks /
Jordan Tigani /
Google BigQuery Engineer