BigQuery autodetect doesn't work with inconsistent json? - google-bigquery

I'm trying to upload JSON to BigQuery, with --autodetect so I don't have to manually discover and write out the whole schema. The rows of JSON don't all have the same form, and so fields are introduced in later rows that aren't in earlier rows.
Unfortunately I get the following failure:
Upload complete.
Waiting on bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '[...]:bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1': Error while reading data, error message: JSON table encountered too many errors, giving up.
Rows: 1209; errors: 1. Please look into the errors[] collection for more details.
Failure details:
- Error while reading data, error message: JSON processing
encountered too many errors, giving up. Rows: 1209; errors: 1; max
bad: 0; error percent: 0
- Error while reading data, error message: JSON parsing error in row
starting at position 829980: No such field:
mc.marketDefinition.settledTime.
Here's the data I'm uploading: https://gist.github.com/max-sixty/c717e700a2774ba92547c7585b2b21e3
Maybe autodetect uses the first n rows, and then fails if rows after n are different? If that's the case, is there any way of resolving this?
Is there any tool I could use to pull out the schema from the whole file and then pass to BigQuery explicitly?

I found two tools that can help:
bigquery-schema-generator 0.5.1 that uses all the data to get the schema instead of 100 sample rows like BigQuery.
Spark SQL, you should to setup your dev env, or at least install Spark and invoke the spark-shell tool.
However, I noticed that the file is intended to fail, see this text in the link you shared: "Sample for BigQuery autodetect failure". So, I'm not pretty sure that such tools can work for a json file intended to fail.
The last but not least, I got the json imported after I removed manually the problematic field: "settledTime":"2020-03-01T02:55:47.000Z".
Hope this info helps.

Yes, see documentation here:
https://cloud.google.com/bigquery/docs/schema-detect
When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.
So if the data in the rest of the rows does not comply with initial rows, you should not use autodetect and need to provide explicit schema.

Autodetect may not work well since it looks only into the first 100 rows to detect schema. Using schema detection for JSON could be a costly endeavor.
How about using BqTail with AllowFieldAddition option allowing cost-effectively expand schema.
You could simply use the following ingestion workflow with CLI or serverless
bqtail -r=rule.yaml -s=sourceURL
#rule.yaml
When:
Prefix: /data/somefolder
Suffix: .json
Async: false
Dest:
Table: mydataset.mytable
AllowFieldAddition: true
Transient:
Template: mydataset.myTableTempl
Dataset: temp
Batch:
MultiPath: true
Window:
DurationInSec: 15
OnSuccess:
- Action: delete
See JSON with allow field addition e2e test case

Related

Pub/Sub to BigQuery template

Currently evaluating stream process from pub/sub to bq by using template that provided by Google and keep getting this error:
Schema of the message in the BQ is matched. But somehow job is not able to write message to BQ. Any ideas what could be the issue on this?
It looks like an information error because you get an error while inserting the data into BigQuery.
You need to validate these things:
Some fields are missing.
The type of the data you are inserting mismatches with the table
schema.
A mandatory field has no data.
A date does not have the correct format.
You can execute this code in the terminal to get more information about this error.
bq --format=prettyjson show -j <JobID>
It will return a JSON with more details. You can see this example:
"message": "Error while reading data, error message: Could not parse '16.66666666666667' as int for field Course_Percentage (position 46) starting at location 1717164"
You can see this document about troubleshooting streaming inserts.
You can use this option: go to the “Logging” section, then “Logs Explorer” in the console, and see more details about errors.
You can filter the error by product name, in this case “Dataflow” or “Bigquery”.
Use another filter by hour.
You’ll see all the errors by product, and if you click each one, you’ll see more details about this specific error.
You can see this documentation about logging.

Saving output from parsing json file and passing it to Bigqueryinsertjoboperator

I need some advise on solving this requirement for auditing purpose . I am using airflow composer - dataflow java operator job which spits out json file after job completion with status and error message details (into airflow data folder ) . I want to extract the status and error message from json file via some operator and then pass the variable to next pipeline job Bigqueryinsertjoboperator which calls the stored proc and passes status and error message as input parameter and finally gets written into BQ dataset table.
Thanks
You need to do XCom and JINJA templating. When you return meta-data from the operator, the data is stored in XCom and you can retrieve it using JINJA templating or Python code in Python operator (or Python code in your custom operator).
Those are two very good articles from Marc Lamberti (who also has really nice courses on Airlfow) describing how templating and jinja can be leveraged in Airflow https://marclamberti.com/blog/templates-macros-apache-airflow/ and this one describes XCom: https://marclamberti.com/blog/airflow-xcom/
By combining the two you can get what you want.

loading avro files with different schemas into one bigquery table

I have a set of avro files with slightly varying schemas which I'd like to load into one bq table.
Is there a way to do that with one line? Every automatic way to handle schema difference would be fine for me.
Here is what I tried so far.
0) If I try to do it in a straightforward way, bq fails with error:
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*
Waiting on bqjob_r4e484dc546c68744_0000015bcaa30f59_1 ... (4s) Current status: DONE
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r4e484dc546c68744_0000015bcaa30f59_1': The Apache Avro library failed to read data with the follwing error: EOF reached
1) Quick googling shows that there is --schema_update_option=ALLOW_FIELD_ADDITION option which, added to bq load job, changes nothing. ALLOW_FIELD_RELAXATION does not change anything either.
2) Actually, schema id is mentioned in the file name, so files look like:
gs://mybucket/logs/*_schemaA_*
gs://mybucket/logs/*_schemaB_*
Unfortunately, bq load does not allow more that on asterisk (as is written in bq manual too):
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*_schemaA_*
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r5e14bb6f3c7b6ec3_0000015bcaa641f3_1': Not found: Uris gs://otishutin-eu/imp/2016-06-27/*_schemaA_*
3) When I try to list the files explicitly, the list happens to be too long, so bq load does not work either:
bq load --source_format=AVRO myproject:mydataset.logs $(gsutil ls gs://mybucket/logs/*_schemaA_* | xargs | tr ' ' ',')
Too many positional args, still have ['gs://mybucket/logs/log_schemaA_2658.avro,gs://mybucket/logs/log_schemaA_2659.avro,gs://mybucket/logs/log_schemaA_2660.avro,...
4) When I try to use files as external table and list the files explicitly in external table definition, I also get "too many files" error:
BigQuery error in query operation: Table definition may not have more than 500 source_uris
I understand that I could first copy files to different folders and then process them folder-by-folder, and this is what I'm doing now as last resort, but this is only a small part of data processing pipeline, and copying is not acceptable as production solution.

Loading Avro-file to BigQuery fails with an internal error

Google BigQuery has on March 23, 2016 announced "Added support for Avro source format for load operations and as a federated data source in the BigQuery API or command-line tool". It says here "This is a Beta release of Avro format support. This feature is not covered by any SLA or deprecation policy and may be subject to backward-incompatible changes.". However, I'd expect the feature to work.
I didn't find anywhere code examples on how to use Avro format for loading. Neither I did find examples on how to use bq-tool for loading.
Here's my practical issue. I haven't been able to load data into BigQuery in Avro-format.
The following happens using bq-tool. The dataset, table name and bucket name have been obfuscated:
$ bq extract --destination_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro
Waiting on bqjob_r62088699049ce969_0000015432b7627a_1 ... (36s) Current status: DONE
$ bq load --source_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro
Waiting on bqjob_r6cefe75ece6073a1_0000015432b83516_1 ... (2s) Current status: DONE
BigQuery error in load operation: Error processing job 'dataset:bqjob_r6cefe75ece6073a1_0000015432b83516_1': An internal error occurred and the request could not be completed.
Basically, I am extracting from a table and inserting to the same table causing an internal error.
Additionally, I have Java program that does the same (extract from table X and load to table X) with the same result (internal error). But I think the above illustrates the problem as clearly as possible, and because of that I'm not sharing the code here. In Java, If I extract from an empty table and insert that, the insert job doesn't fail.
My questions are
I think BigQuery API should never fail with internal error. Why is that happening with my test?
Is the extracted Avro file compatible with an insert job?
There seems to be no specification what the Avro schema in an insert job is like, at least I couldn't find any. Could the documentation be created?
UPDATED 2016-04-25:
So far I've managed to get an Avro load job not to give an internal error based on the hint of not using REQUIRED fields. However, I haven't managed to load non-null values.
Consider this Avro-schema:
{
"type": "record",
"name": "root",
"fields": [
{
"name": "x",
"type": "string"
}
]
}
The BigQuery table has one column, x that is NULLABLE.
If I insert N (I've tried with one and two) rows (x being e.g. 1), I got N rows in BigQuery but x always having value null.
If I change the table so that X is REQUIRED I get an internal error.
There is no exact match from a BQ schema to Avro schema, and vice versa, so when you export a BQ table to Avro file and then import back, the schema will be different. I see the destination table of your load already exists, in this case we throw an error when the schema of the destination table doesn't match the schema we converted from the Avro schema. This should be an external error though, we're investigating why it's an internal error.
We're in the middle of upgrading the export pipeline, and the new import pipeline has a bug that doesn't work with the Avro file exported by the current pipeline. The fix should be deployed in a couple weeks. After that, if you import the exported file to a non-existent destination table, or a destination table with compatible schema, it should work. Meanwhile, importing your own Avro files should work. You can also query it directly on GCS without importing it.
There's a problem with the error mapping for the AVRO reader here. The error should have been along the lines of: "The reference schema differs from the existing data: The required field 'api_key' is missing"
Looking at your load job configuration, it includes REQUIRED fields. It sounds like some of the data you are trying to load doesn't specify these required fields, so the operation fails.
I suggest avoiding required fields.
So, there's a bug in BigQuery: an insert job using Avro format does not work if the destination table exists, but gives an internal error. The workaround is to use createDisposition CREATE_IF_NEEDED and not to have the pre-existing table there. I verified that this works.
Hua Zung's comment says that the bug will be fixed in "the fix should be deployed in a couple weeks". Needless to say that existing major bugs in the live system should be documented somewhere.
While updating the system, I really recommend improving the Avro documentation. Now there's no mention on what the Avro schema should be like (type record, name root and fields array having the columns(?)) and not even the fact that each record in the Avro file maps to a row in the destination table (obvious, but should be mentioned). Also what happens with schema mismatch is not documented.
Thanks for the help, I'll be now switching to Avro-format. It's so much better than CSV.

Getting error from bq tool when uploading and importing data on BigQuery - 'Backend Error'

I'm getting the error: BigQuery error in load operation: Backend Error when I try to upload and import data on BQ. I already reduced size, increased time between imports, but nothing helps. The strange thing is that if I wait for a time and retry it just works.
In the BigQuery Browser tool it appears like an error in some line/field, but I checked and there is none. And obviously this is a fake message, because if I wait and retry to upload/import the same file, it works.
Tnks
I looked up our failing jobs in the bigquery backend, and I couldn't find any jobs that terminated with 'backend error'. I found several that failed because there were ascii nulls found in the data. (it can be helpful to look at the error stream errors, not just the error result). It is possible that the data got garbled on the way to bigquery... are you certain the data did not change between the failing import and the successful one on the same data?
I've found exporting from a big query table to csv in cloud storage hits the same error when certain characters are present in one of the columns (in this case a column storing the raw results from a prediction analysis). By removing that column from the export it resolved the issue.