Unable to load avro file to BigQuery because of schema mismatch - google-bigquery

I am new to big-query and i was trying to load avro file to bigQuery table.For the first two times i was able to load avro file to bigquery table .For the third times onwords it starts failing and the error message is -
Waiting on bqjob_r77fb1a791c9ab204_0000015c88ab3ad8_1 ... (0s) Current
status: DONE BigQuery error in load operation: Error processing job 'xxx-yz-
df:bqjob_r77fb1a791c9ab204_0000015c88ab3ad8_1': Provided Schema does not
match Table xxx-yz-df:adityadb.avro_poc3_part_stage$20120611.
i tried many times .How schema can be mismatch for the same file ,if you try more than two times .The load command which i was using is-
bq load --source_format=AVRO adityadb.avro_poc3_part_stage$20120611 gs://reair_ddh/apps/hive/warehouse/adityadb1.db/avro_poc3_part_txt/ingestion_time=20120611/000000_0
I dont know why this is happening,Any help would be appreciated. Thank you.

Related

Bigquery Error: 8822097

On trying to load a json file to bigquery. I get the following error: "An internal error occurred and the request could not be completed. Error: 8822097". Is this an error related to hitting the bigquery daily load limit? It will be amazing if someone can point me to a glossary of errors.
{Location: ""; Message: "An internal error occurred and the request could not be completed. Error: 8822097"; Reason: "internalError"
Thanks!
Are you trying to load different types of file in a single command?
It may happen when you try to load from a Google Storage path with both compressed and uncompressed files:
$ gsutil ls gs://bucket/path/
gs://bucket/path/a.txt
gs://bucket/path/b.txt.gz
$ bq load --autodetect --noreplace --source_format=NEWLINE_DELIMITED_JSON "project-id:dataset_name.table_name" gs://bucket/path/*
Waiting on bqjob_id_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'project-id:bqjob_id_1': An internal error occurred and the request could not be completed. Error: 8822097
This error can occur due to the maximum columns per table — 10,000 BigQuery limit.
To verify this, you can check the number of distinct columns in the used table:
bq --format=json show project:dataset.table | jq . | grep "type" | grep -v "RECORD" | wc -l
Reducing the number of columns would probably be the best and quickest way to work-around this issue.
We got the same error "An internal error occurred and the request could not be completed. Error: 8822097" when running a standard sql query. Running the corresponding legacy sql query gave us an error message that was actually actionable:
Error while reading table: ABC, error message: The reference schema
differs from the existing data: The required field 'XYZ' is
missing.
Fixing the underlying error, exposed by the legacy sql query, also fixed the error for the standard sql query.
In our case we have avro files. The table was created from the avro files. Newer avro files didn't contain a certain field but the table still contained that field. Rebuilding the table from the new avro files solved the issue. We also have views on top of the table which may or may not change the resulting error message.

Loading Avro Data into BigQuery via command-line?

I have created an avro-hive table and loaded data into avro-table from another table using hive insert-overwrite command.I can see the data in avro-hive table but when i try to load this into bigQuery table, It gives an error.
Table schema:-
CREATE TABLE `adityadb1.gold_hcth_prfl_datatype_acceptence`(
`prfl_id` bigint,
`crd_dtl` array< struct < cust_crd_id:bigint,crd_nbr:string,crd_typ_cde:string,crd_typ_cde_desc:string,crdhldr_nm:string,crd_exprn_dte:string,acct_nbr:string,cre_sys_cde:string,cre_sys_cde_desc:string,last_upd_sys_cde:string,last_upd_sys_cde_desc:string,cre_tmst:string,last_upd_tmst:string,str_nbr:int,lng_crd_nbr:string>>)
STORED AS AVRO;
Error that i am getting:-
Error encountered during job execution:
Error while reading data, error message: The Apache Avro library failed to read data with the follwing error: Cannot resolve:
I am using following command to load the data into bigquery:-
bq load --source_format=AVRO dataset.tableName avro-filePath
Make sure that there is data available in your gs folder where you are pointing and the data contains the schema (it should if your created it from Hive). Here you have an example of how load data
bq --location=US load --source_format=AVRO --noreplace my_dataset.my_avro_table gs://myfolder/mytablefolder/part-m-00001.avro

Error while loading AVRO files to BigQuery

I have successfully loaded large number of AVRO files (of same schema type into same table), stored on Google Storage, using bq CLI utility.
However, for some of the AVRO files I am getting very cryptic error while loading into bigquery, the error says:
The Apache Avro library failed to read data with the follwing error: EOF
reached (error code: invalid)
With avro-tools validated that the AVRO file is not corrupted, report output:
java -jar avro-tools-1.8.1.jar repair -o report 2017-05-15-07-15-01_48a99.avro
Recovering file: 2017-05-15-07-15-01_48a99.avro
File Summary:
Number of blocks: 51 Number of corrupt blocks: 0
Number of records: 58598 Number of corrupt records: 0
I tried creating a brand new table with one of the failing files in case it was due to schema mismatch but that didnt help as the error was exactly the same.
need help to figure out what could be causing the error here?
No way to pinpoint the issue without more information, but I ran into this error message and filed a ticket here.
I a number of files in a single load job were missing columns which was causing the error.
Explanation from the ticket.
BigQuery uses the alphabetically last file from the directory as the avro schema to read the other Avro files. I suspect the issue is with schema incompatibility between the last file and the "problematic" file. Do you know if all the files have the exact same schema or differ? One thing you could try to help verify this is to copy the alphabetically last file of the directory and the "problematic" file to a different folder and try to load those two files in one BigQuery load job and see if the error reproduces.

loading avro files with different schemas into one bigquery table

I have a set of avro files with slightly varying schemas which I'd like to load into one bq table.
Is there a way to do that with one line? Every automatic way to handle schema difference would be fine for me.
Here is what I tried so far.
0) If I try to do it in a straightforward way, bq fails with error:
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*
Waiting on bqjob_r4e484dc546c68744_0000015bcaa30f59_1 ... (4s) Current status: DONE
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r4e484dc546c68744_0000015bcaa30f59_1': The Apache Avro library failed to read data with the follwing error: EOF reached
1) Quick googling shows that there is --schema_update_option=ALLOW_FIELD_ADDITION option which, added to bq load job, changes nothing. ALLOW_FIELD_RELAXATION does not change anything either.
2) Actually, schema id is mentioned in the file name, so files look like:
gs://mybucket/logs/*_schemaA_*
gs://mybucket/logs/*_schemaB_*
Unfortunately, bq load does not allow more that on asterisk (as is written in bq manual too):
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*_schemaA_*
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r5e14bb6f3c7b6ec3_0000015bcaa641f3_1': Not found: Uris gs://otishutin-eu/imp/2016-06-27/*_schemaA_*
3) When I try to list the files explicitly, the list happens to be too long, so bq load does not work either:
bq load --source_format=AVRO myproject:mydataset.logs $(gsutil ls gs://mybucket/logs/*_schemaA_* | xargs | tr ' ' ',')
Too many positional args, still have ['gs://mybucket/logs/log_schemaA_2658.avro,gs://mybucket/logs/log_schemaA_2659.avro,gs://mybucket/logs/log_schemaA_2660.avro,...
4) When I try to use files as external table and list the files explicitly in external table definition, I also get "too many files" error:
BigQuery error in query operation: Table definition may not have more than 500 source_uris
I understand that I could first copy files to different folders and then process them folder-by-folder, and this is what I'm doing now as last resort, but this is only a small part of data processing pipeline, and copying is not acceptable as production solution.

BigQuery "Backend Error, Job aborted" when exporting data

The export job for one of my tables fails in BigQuery with no error message, I checked the job id hoping to get more info but it just says "Backend Error, Job aborted". I used the command-line tool with tis command
bq extract --project_id=my-proj-id --destination_format=NEWLINE_DELIMITED_JSON 'test.table_1' gs://mybucket/export
I checked this question but I know that it is not a problem with my destination bucket in GCS, Because exporting other tables to same bucked is done successfully.
The only difference here is that this table has a repeated record field and each json can get pretty large but I did not find any limit for this on BigQuery docs.
Any ideas on what be the problem can be?
Job Id from one of my tries: bqjob_r51435e780aefb826_0000015691dda235_1