loading avro files with different schemas into one bigquery table - schema

I have a set of avro files with slightly varying schemas which I'd like to load into one bq table.
Is there a way to do that with one line? Every automatic way to handle schema difference would be fine for me.
Here is what I tried so far.
0) If I try to do it in a straightforward way, bq fails with error:
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*
Waiting on bqjob_r4e484dc546c68744_0000015bcaa30f59_1 ... (4s) Current status: DONE
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r4e484dc546c68744_0000015bcaa30f59_1': The Apache Avro library failed to read data with the follwing error: EOF reached
1) Quick googling shows that there is --schema_update_option=ALLOW_FIELD_ADDITION option which, added to bq load job, changes nothing. ALLOW_FIELD_RELAXATION does not change anything either.
2) Actually, schema id is mentioned in the file name, so files look like:
gs://mybucket/logs/*_schemaA_*
gs://mybucket/logs/*_schemaB_*
Unfortunately, bq load does not allow more that on asterisk (as is written in bq manual too):
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*_schemaA_*
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r5e14bb6f3c7b6ec3_0000015bcaa641f3_1': Not found: Uris gs://otishutin-eu/imp/2016-06-27/*_schemaA_*
3) When I try to list the files explicitly, the list happens to be too long, so bq load does not work either:
bq load --source_format=AVRO myproject:mydataset.logs $(gsutil ls gs://mybucket/logs/*_schemaA_* | xargs | tr ' ' ',')
Too many positional args, still have ['gs://mybucket/logs/log_schemaA_2658.avro,gs://mybucket/logs/log_schemaA_2659.avro,gs://mybucket/logs/log_schemaA_2660.avro,...
4) When I try to use files as external table and list the files explicitly in external table definition, I also get "too many files" error:
BigQuery error in query operation: Table definition may not have more than 500 source_uris
I understand that I could first copy files to different folders and then process them folder-by-folder, and this is what I'm doing now as last resort, but this is only a small part of data processing pipeline, and copying is not acceptable as production solution.

Related

BigQuery autodetect doesn't work with inconsistent json?

I'm trying to upload JSON to BigQuery, with --autodetect so I don't have to manually discover and write out the whole schema. The rows of JSON don't all have the same form, and so fields are introduced in later rows that aren't in earlier rows.
Unfortunately I get the following failure:
Upload complete.
Waiting on bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '[...]:bqjob_r1aa6e3302cfc399a_000001712c8ea62b_1': Error while reading data, error message: JSON table encountered too many errors, giving up.
Rows: 1209; errors: 1. Please look into the errors[] collection for more details.
Failure details:
- Error while reading data, error message: JSON processing
encountered too many errors, giving up. Rows: 1209; errors: 1; max
bad: 0; error percent: 0
- Error while reading data, error message: JSON parsing error in row
starting at position 829980: No such field:
mc.marketDefinition.settledTime.
Here's the data I'm uploading: https://gist.github.com/max-sixty/c717e700a2774ba92547c7585b2b21e3
Maybe autodetect uses the first n rows, and then fails if rows after n are different? If that's the case, is there any way of resolving this?
Is there any tool I could use to pull out the schema from the whole file and then pass to BigQuery explicitly?
I found two tools that can help:
bigquery-schema-generator 0.5.1 that uses all the data to get the schema instead of 100 sample rows like BigQuery.
Spark SQL, you should to setup your dev env, or at least install Spark and invoke the spark-shell tool.
However, I noticed that the file is intended to fail, see this text in the link you shared: "Sample for BigQuery autodetect failure". So, I'm not pretty sure that such tools can work for a json file intended to fail.
The last but not least, I got the json imported after I removed manually the problematic field: "settledTime":"2020-03-01T02:55:47.000Z".
Hope this info helps.
Yes, see documentation here:
https://cloud.google.com/bigquery/docs/schema-detect
When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.
So if the data in the rest of the rows does not comply with initial rows, you should not use autodetect and need to provide explicit schema.
Autodetect may not work well since it looks only into the first 100 rows to detect schema. Using schema detection for JSON could be a costly endeavor.
How about using BqTail with AllowFieldAddition option allowing cost-effectively expand schema.
You could simply use the following ingestion workflow with CLI or serverless
bqtail -r=rule.yaml -s=sourceURL
#rule.yaml
When:
Prefix: /data/somefolder
Suffix: .json
Async: false
Dest:
Table: mydataset.mytable
AllowFieldAddition: true
Transient:
Template: mydataset.myTableTempl
Dataset: temp
Batch:
MultiPath: true
Window:
DurationInSec: 15
OnSuccess:
- Action: delete
See JSON with allow field addition e2e test case

Bigquery Error: 8822097

On trying to load a json file to bigquery. I get the following error: "An internal error occurred and the request could not be completed. Error: 8822097". Is this an error related to hitting the bigquery daily load limit? It will be amazing if someone can point me to a glossary of errors.
{Location: ""; Message: "An internal error occurred and the request could not be completed. Error: 8822097"; Reason: "internalError"
Thanks!
Are you trying to load different types of file in a single command?
It may happen when you try to load from a Google Storage path with both compressed and uncompressed files:
$ gsutil ls gs://bucket/path/
gs://bucket/path/a.txt
gs://bucket/path/b.txt.gz
$ bq load --autodetect --noreplace --source_format=NEWLINE_DELIMITED_JSON "project-id:dataset_name.table_name" gs://bucket/path/*
Waiting on bqjob_id_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'project-id:bqjob_id_1': An internal error occurred and the request could not be completed. Error: 8822097
This error can occur due to the maximum columns per table — 10,000 BigQuery limit.
To verify this, you can check the number of distinct columns in the used table:
bq --format=json show project:dataset.table | jq . | grep "type" | grep -v "RECORD" | wc -l
Reducing the number of columns would probably be the best and quickest way to work-around this issue.
We got the same error "An internal error occurred and the request could not be completed. Error: 8822097" when running a standard sql query. Running the corresponding legacy sql query gave us an error message that was actually actionable:
Error while reading table: ABC, error message: The reference schema
differs from the existing data: The required field 'XYZ' is
missing.
Fixing the underlying error, exposed by the legacy sql query, also fixed the error for the standard sql query.
In our case we have avro files. The table was created from the avro files. Newer avro files didn't contain a certain field but the table still contained that field. Rebuilding the table from the new avro files solved the issue. We also have views on top of the table which may or may not change the resulting error message.

Error while loading AVRO files to BigQuery

I have successfully loaded large number of AVRO files (of same schema type into same table), stored on Google Storage, using bq CLI utility.
However, for some of the AVRO files I am getting very cryptic error while loading into bigquery, the error says:
The Apache Avro library failed to read data with the follwing error: EOF
reached (error code: invalid)
With avro-tools validated that the AVRO file is not corrupted, report output:
java -jar avro-tools-1.8.1.jar repair -o report 2017-05-15-07-15-01_48a99.avro
Recovering file: 2017-05-15-07-15-01_48a99.avro
File Summary:
Number of blocks: 51 Number of corrupt blocks: 0
Number of records: 58598 Number of corrupt records: 0
I tried creating a brand new table with one of the failing files in case it was due to schema mismatch but that didnt help as the error was exactly the same.
need help to figure out what could be causing the error here?
No way to pinpoint the issue without more information, but I ran into this error message and filed a ticket here.
I a number of files in a single load job were missing columns which was causing the error.
Explanation from the ticket.
BigQuery uses the alphabetically last file from the directory as the avro schema to read the other Avro files. I suspect the issue is with schema incompatibility between the last file and the "problematic" file. Do you know if all the files have the exact same schema or differ? One thing you could try to help verify this is to copy the alphabetically last file of the directory and the "problematic" file to a different folder and try to load those two files in one BigQuery load job and see if the error reproduces.

Unable to load avro file to BigQuery because of schema mismatch

I am new to big-query and i was trying to load avro file to bigQuery table.For the first two times i was able to load avro file to bigquery table .For the third times onwords it starts failing and the error message is -
Waiting on bqjob_r77fb1a791c9ab204_0000015c88ab3ad8_1 ... (0s) Current
status: DONE BigQuery error in load operation: Error processing job 'xxx-yz-
df:bqjob_r77fb1a791c9ab204_0000015c88ab3ad8_1': Provided Schema does not
match Table xxx-yz-df:adityadb.avro_poc3_part_stage$20120611.
i tried many times .How schema can be mismatch for the same file ,if you try more than two times .The load command which i was using is-
bq load --source_format=AVRO adityadb.avro_poc3_part_stage$20120611 gs://reair_ddh/apps/hive/warehouse/adityadb1.db/avro_poc3_part_txt/ingestion_time=20120611/000000_0
I dont know why this is happening,Any help would be appreciated. Thank you.

Loading multiple files

The following is working as expected.
./bq --nosync load -F '^' --max_bad_record=30000 myvserv.xa one.txt ip:string,cb:string,country:string,telco_name:string, ...
1) But how to I send two csv files one.txt and two.txt in the same command?
2) I can not cat file and then pipe | to bg command ?
3) What does nosync mean?
Unfortunately, you can't (yet) upload two files with the same command; you'll have to run bq twice. (If you're loading data from Google Cloud Storage, though, you can specify multiple gs:// URLs separated by commas.)
Nope, bq doesn't (yet) support reading upload data from stdin, though that's a great idea for a future version.
If you just run "bq load", bq will create a load job on the server and then poll for completion. If you specify the --nosync flag, it will just create the load job and then exit without polling. (If desired, you can poll for completion separately using "bq wait".)
For 1), as Jeremy mentioned, you can't import two local files at once in the same command. However, you can start two parallel loads to the same table -- loads are atomic, and append by default, so this should do what you want and may be faster than importing both in a single job since the uploads will happen in parallel.