Pub/Sub to BigQuery template - google-bigquery

Currently evaluating stream process from pub/sub to bq by using template that provided by Google and keep getting this error:
Schema of the message in the BQ is matched. But somehow job is not able to write message to BQ. Any ideas what could be the issue on this?

It looks like an information error because you get an error while inserting the data into BigQuery.
You need to validate these things:
Some fields are missing.
The type of the data you are inserting mismatches with the table
schema.
A mandatory field has no data.
A date does not have the correct format.
You can execute this code in the terminal to get more information about this error.
bq --format=prettyjson show -j <JobID>
It will return a JSON with more details. You can see this example:
"message": "Error while reading data, error message: Could not parse '16.66666666666667' as int for field Course_Percentage (position 46) starting at location 1717164"
You can see this document about troubleshooting streaming inserts.
You can use this option: go to the “Logging” section, then “Logs Explorer” in the console, and see more details about errors.
You can filter the error by product name, in this case “Dataflow” or “Bigquery”.
Use another filter by hour.
You’ll see all the errors by product, and if you click each one, you’ll see more details about this specific error.
You can see this documentation about logging.

Related

BigQuery autodetect doesn't work with inconsistent json?

I'm trying to upload JSON to BigQuery, with --autodetect so I don't have to manually discover and write out the whole schema. The rows of JSON don't all have the same form, and so fields are introduced in later rows that aren't in earlier rows.
Unfortunately I get the following failure:
Upload complete.
Waiting on bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '[...]:bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1': Error while reading data, error message: JSON table encountered too many errors, giving up.
Rows: 1209; errors: 1. Please look into the errors[] collection for more details.
Failure details:
- Error while reading data, error message: JSON processing
encountered too many errors, giving up. Rows: 1209; errors: 1; max
bad: 0; error percent: 0
- Error while reading data, error message: JSON parsing error in row
starting at position 829980: No such field:
mc.marketDefinition.settledTime.
Here's the data I'm uploading: https://gist.github.com/max-sixty/c717e700a2774ba92547c7585b2b21e3
Maybe autodetect uses the first n rows, and then fails if rows after n are different? If that's the case, is there any way of resolving this?
Is there any tool I could use to pull out the schema from the whole file and then pass to BigQuery explicitly?
I found two tools that can help:
bigquery-schema-generator 0.5.1 that uses all the data to get the schema instead of 100 sample rows like BigQuery.
Spark SQL, you should to setup your dev env, or at least install Spark and invoke the spark-shell tool.
However, I noticed that the file is intended to fail, see this text in the link you shared: "Sample for BigQuery autodetect failure". So, I'm not pretty sure that such tools can work for a json file intended to fail.
The last but not least, I got the json imported after I removed manually the problematic field: "settledTime":"2020-03-01T02:55:47.000Z".
Hope this info helps.
Yes, see documentation here:
https://cloud.google.com/bigquery/docs/schema-detect
When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.
So if the data in the rest of the rows does not comply with initial rows, you should not use autodetect and need to provide explicit schema.
Autodetect may not work well since it looks only into the first 100 rows to detect schema. Using schema detection for JSON could be a costly endeavor.
How about using BqTail with AllowFieldAddition option allowing cost-effectively expand schema.
You could simply use the following ingestion workflow with CLI or serverless
bqtail -r=rule.yaml -s=sourceURL
#rule.yaml
When:
Prefix: /data/somefolder
Suffix: .json
Async: false
Dest:
Table: mydataset.mytable
AllowFieldAddition: true
Transient:
Template: mydataset.myTableTempl
Dataset: temp
Batch:
MultiPath: true
Window:
DurationInSec: 15
OnSuccess:
- Action: delete
See JSON with allow field addition e2e test case

Cannot delete phantom table using `bq` or BigQuery UI

I'm trying to remove a table from a dataset using bq without success:
BigQuery error in rm operation: Not found: Table carbon-web-...:AS_....Orders_01Jun2014_31May2015_3704438_01
The table is listed whenever I run bq ls AS_....
I'm seeing similar behavior when I try to access the table from the BigQuery UI. When I click on the link to the table, I receive an error message:
Unable to find table: carbon-web-...:AS_....Orders_01May2017_31May2017
Is there a way to force a refresh on the metadata for this dataset?
These are tables in transient state that shouldn't have been exposed. We found a bug in a feature that we were rolling out with listing tables where in some rare scenarios tables in transient state would show up in the list. We have reverted that now.

BigQuery "Backend Error, Job aborted" when exporting data

The export job for one of my tables fails in BigQuery with no error message, I checked the job id hoping to get more info but it just says "Backend Error, Job aborted". I used the command-line tool with tis command
bq extract --project_id=my-proj-id --destination_format=NEWLINE_DELIMITED_JSON 'test.table_1' gs://mybucket/export
I checked this question but I know that it is not a problem with my destination bucket in GCS, Because exporting other tables to same bucked is done successfully.
The only difference here is that this table has a repeated record field and each json can get pretty large but I did not find any limit for this on BigQuery docs.
Any ideas on what be the problem can be?
Job Id from one of my tries: bqjob_r51435e780aefb826_0000015691dda235_1

Loading Avro-file to BigQuery fails with an internal error

Google BigQuery has on March 23, 2016 announced "Added support for Avro source format for load operations and as a federated data source in the BigQuery API or command-line tool". It says here "This is a Beta release of Avro format support. This feature is not covered by any SLA or deprecation policy and may be subject to backward-incompatible changes.". However, I'd expect the feature to work.
I didn't find anywhere code examples on how to use Avro format for loading. Neither I did find examples on how to use bq-tool for loading.
Here's my practical issue. I haven't been able to load data into BigQuery in Avro-format.
The following happens using bq-tool. The dataset, table name and bucket name have been obfuscated:
$ bq extract --destination_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro
Waiting on bqjob_r62088699049ce969_0000015432b7627a_1 ... (36s) Current status: DONE
$ bq load --source_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro
Waiting on bqjob_r6cefe75ece6073a1_0000015432b83516_1 ... (2s) Current status: DONE
BigQuery error in load operation: Error processing job 'dataset:bqjob_r6cefe75ece6073a1_0000015432b83516_1': An internal error occurred and the request could not be completed.
Basically, I am extracting from a table and inserting to the same table causing an internal error.
Additionally, I have Java program that does the same (extract from table X and load to table X) with the same result (internal error). But I think the above illustrates the problem as clearly as possible, and because of that I'm not sharing the code here. In Java, If I extract from an empty table and insert that, the insert job doesn't fail.
My questions are
I think BigQuery API should never fail with internal error. Why is that happening with my test?
Is the extracted Avro file compatible with an insert job?
There seems to be no specification what the Avro schema in an insert job is like, at least I couldn't find any. Could the documentation be created?
UPDATED 2016-04-25:
So far I've managed to get an Avro load job not to give an internal error based on the hint of not using REQUIRED fields. However, I haven't managed to load non-null values.
Consider this Avro-schema:
{
"type": "record",
"name": "root",
"fields": [
{
"name": "x",
"type": "string"
}
]
}
The BigQuery table has one column, x that is NULLABLE.
If I insert N (I've tried with one and two) rows (x being e.g. 1), I got N rows in BigQuery but x always having value null.
If I change the table so that X is REQUIRED I get an internal error.
There is no exact match from a BQ schema to Avro schema, and vice versa, so when you export a BQ table to Avro file and then import back, the schema will be different. I see the destination table of your load already exists, in this case we throw an error when the schema of the destination table doesn't match the schema we converted from the Avro schema. This should be an external error though, we're investigating why it's an internal error.
We're in the middle of upgrading the export pipeline, and the new import pipeline has a bug that doesn't work with the Avro file exported by the current pipeline. The fix should be deployed in a couple weeks. After that, if you import the exported file to a non-existent destination table, or a destination table with compatible schema, it should work. Meanwhile, importing your own Avro files should work. You can also query it directly on GCS without importing it.
There's a problem with the error mapping for the AVRO reader here. The error should have been along the lines of: "The reference schema differs from the existing data: The required field 'api_key' is missing"
Looking at your load job configuration, it includes REQUIRED fields. It sounds like some of the data you are trying to load doesn't specify these required fields, so the operation fails.
I suggest avoiding required fields.
So, there's a bug in BigQuery: an insert job using Avro format does not work if the destination table exists, but gives an internal error. The workaround is to use createDisposition CREATE_IF_NEEDED and not to have the pre-existing table there. I verified that this works.
Hua Zung's comment says that the bug will be fixed in "the fix should be deployed in a couple weeks". Needless to say that existing major bugs in the live system should be documented somewhere.
While updating the system, I really recommend improving the Avro documentation. Now there's no mention on what the Avro schema should be like (type record, name root and fields array having the columns(?)) and not even the fact that each record in the Avro file maps to a row in the destination table (obvious, but should be mentioned). Also what happens with schema mismatch is not documented.
Thanks for the help, I'll be now switching to Avro-format. It's so much better than CSV.

Getting error from bq tool when uploading and importing data on BigQuery - 'Backend Error'

I'm getting the error: BigQuery error in load operation: Backend Error when I try to upload and import data on BQ. I already reduced size, increased time between imports, but nothing helps. The strange thing is that if I wait for a time and retry it just works.
In the BigQuery Browser tool it appears like an error in some line/field, but I checked and there is none. And obviously this is a fake message, because if I wait and retry to upload/import the same file, it works.
Tnks
I looked up our failing jobs in the bigquery backend, and I couldn't find any jobs that terminated with 'backend error'. I found several that failed because there were ascii nulls found in the data. (it can be helpful to look at the error stream errors, not just the error result). It is possible that the data got garbled on the way to bigquery... are you certain the data did not change between the failing import and the successful one on the same data?
I've found exporting from a big query table to csv in cloud storage hits the same error when certain characters are present in one of the columns (in this case a column storing the raw results from a prediction analysis). By removing that column from the export it resolved the issue.