"Invalid schema update" error when loading data using autodetect

"Invalid schema update" error when loading data using autodetect - google-bigquery

Let's say I have a table with one single field named "version", which is a string. When I try to load data into the table using autodetect with values like "1.1" or "1", the autodetect feature infers these values as float or integer type respectively.
data1.json example:
{ "version": "1.11.0" }
bq load output:
$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data1.json
Upload complete.
Waiting on bqjob_ZZZ ... (1s) Current status: DONE
data2.json example:
{ "version": "1.11" }
bq load output:
$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data2.json
Upload complete.
Waiting on bqjob_ZZZ ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'YYY:bqjob_ZZZ': Invalid schema update. Field version has changed type from STRING to FLOAT
data3.json example:
{ "version": "1" }
bq load output:
$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data3.json
Upload complete.
Waiting on bqjob_ZZZ ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'YYY:bqjob_ZZZ': Invalid schema update. Field version has changed type from STRING to INTEGER
The scenario where this problem doesn't happen is when you have, in the same file, another JSON where the value is inferred correctly as string (as seen in Bigquery autoconverting fields in data question):
{ "version": "1.12" }
{ "version": "1.12.0" }
In the question listed above, there's an answer stating that a fix was pushed to production, but it looks like the bug is back again. Is there a way/workaround to prevent this?

Looks like the confusing part here is whether "1.12" should be detected as string or float. BigQuery chose to detect as float. Before autodetect is introduced in BigQuery, BigQuery allows users to load float values in string format. This is very common in CSV/JSON format. So when autodetect is introduced, it kept this behavior. Autodetect will scan up to 100 rows to detect the type. If for all 100 rows, the data is like "1.12", then very likely this field is a float value. If one of the row has value "1.12.0", then BigQuery will detect the type is string, as you have observed.

Related

Handling RuntimeException errors in a BigQuery pipeline

When we use a BigQueryIO transform to insert rows, we have an option called:
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
which instructs the pipeline to NOT attempt to create the table if the table doesn't already exist. In my scenario, I want to trap all errors. I attempted to use the following:
var write=mypipline.apply("Write table", BigQueryIO
.<Employee>write()
.to(targetTableName_notpresent)
.withExtendedErrorInfo()
.withFormatFunction(new EmployeeToTableRow())
.withSchema(schema)
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.withTableDescription("My Test Table")
.withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND));
which tried to insert rows into a non-existent table. What I found was a RuntimeException. Where I am stuck is that I don't know how to handle RuntimeException problems. I don't believe there is anything here I can surround with a try/catch.
This question is similar to this one:
Is it possible to catch a missing dataset java.lang.RuntimeException in a Google Cloud Dataflow pipeline that writes from Pub/Sub to BigQuery?
but I don't think that got a working answer and was focused on a missing Dataset and not a table.
My exception from the fragment above is:
org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
POST https://bigquery.googleapis.com/bigquery/v2/projects/XXXX/datasets/jupyter/tables/not_here/insertAll?prettyPrint=false
{
"code" : 404,
"errors" : [ {
"domain" : "global",
"message" : "Not found: Table XXXX:jupyter.not_here",
"reason" : "notFound"
} ],
"message" : "Not found: Table XXXX:jupyter.not_here",
"status" : "NOT_FOUND"
}
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:323)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)
at .(#126:1)

You can't add a try/catch directly from the BigQueryIO in the Beam job, if the destination table doesn't exist.
I think it's better to delegate this responsability outside of Beam or launch the job only if your table exists.
Usually a tool like Terraform has the responsability to create infrastructure, before to deploy resources and run Beam jobs.
If it's mandatory for you to check the existence of the table, you can create :
A Shell script with bq and gcloud cli to check the existence before to launch the job
A Python script to check the existence before to launch the job
Python script :
For Python there is the BigQuery Python client :
from google.cloud import bigquery
from google.cloud.exceptions import NotFound
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to determine existence.
# table_id = "your-project.your_dataset.your_table"
try:
client.get_table(table_id) # Make an API request.
print("Table {} already exists.".format(table_id))
except NotFound:
print("Table {} is not found.".format(table_id))
BQ Shell script :
bq show <project_id>:<dataset_id>.<table_id>
If the table doesn't exist, catch the error and do not start the Dataflow job.

Snowflake COPY INTO from JSON - ON_ERROR = CONTINUE - Weird Issue

I am trying to load JSON file from Staging area (S3) into Stage table using COPY INTO command.
Table:
create or replace TABLE stage_tableA (
RAW_JSON VARIANT NOT NULL
);
Copy Command:
copy into stage_tableA from #stgS3/filename_45.gz file_format = (format_name = 'file_json')
Got the below error when executing the above (sample provided)
SQL Error [100069] [22P02]: Error parsing JSON: document is too large, max size 16777216 bytes If you would like to continue loading
when an error is encountered, use other values such as 'SKIP_FILE' or
'CONTINUE' for the ON_ERROR option. For more information on loading
options, please run 'info loading_data' in a SQL client.
When I had put "ON_ERROR=CONTINUE" , records got partially loaded, i.e until the record with more than max size. But no records after the Error record was loaded.
Was "ON_ERROR=CONTINUE" supposed to skip only the record that has max size and load records before and after it ?

Yes, the ON_ERROR=CONTINUE skips the offending line and continues to load the rest of the file.
To help us provide more insight, can you answer the following:
How many records are in your file?
How many got loaded?
At what line was the error first encountered?
You can find this information using the COPY_HISTORY() table function

Try setting the option strip_outer_array = true for file format and attempt the loading again.
The considerations for loading large size semi-structured data are documented in the below article:
https://docs.snowflake.com/en/user-guide/semistructured-considerations.html

I partially agree with Chris. The ON_ERROR=CONTINUE option only helps if the there are in fact more than 1 JSON objects in the file. If it's 1 massive object then you would simply not get an error or the record loaded when using ON_ERROR=CONTINUE.
If you know your JSON payload is smaller than 16mb then definitely try the strip_outer_array = true. Also, if your JSON has a lot of nulls ("NULL") as values use the STRIP_NULL_VALUES = TRUE as this will slim your payload as well. Hope that helps.

Dataflow insert into BigQuery fails with large number of files for asia-northeast1 location

I am using Cloud Storage Text to BigQuery template on Cloud Composer.
The template is kicked from Python google api client.
The same program
works fine in US location (for Dataflow and BigQuery).
fails in asia-northeast1 location.
works fine with the fewer (less than 10000) input files in asia-northeast location.
Does anybody have an idea about this?
I want to execute in the asia-northeast location for business reason.
More details about failure:
The program worked until "ReifyRenameInput", and the failed .
dataflow job failed
with the error message below:
java.io.IOException: Unable to insert job: beam_load_textiotobigquerydataflow0releaser0806214711ca282fc3_8fca2422ccd74649b984a625f246295c_2a18c21953c26c4d4da2f8f0850da0d2_00000-0, aborting after 9 .
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.startJob(BigQueryServicesImpl.java:231)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.startJob(BigQueryServicesImpl.java:202)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.startCopyJob(BigQueryServicesImpl.java:196)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.copy(WriteRename.java:144)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.writeRename(WriteRename.java:107)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.processElement(WriteRename.java:80)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException:
404 Not Found { "code" : 404, "errors" : [ { "domain" : "global", "message" : "Not found: Dataset pj:datasetname", "reason" : "notFound" } ], "message" : "Not found: Dataset pj:datasetname" }
(pj and dataset name are not real name, and they are project name and dataset name for outputTable parameter)
Although the error message said the dataset is not found, the dataset surely existed.
Moreover, some new tables which seems to be tempory tables were created in the dataset after the program.

This is a known issue related to your Beam SDK version according to this public issue tracker. The Beam 2.5.0 SDK version doesn't have this issue.

Big JSON record to BigQuery is not showing up

I wanted to try to upload big JSON record object to BigQuery.
I am talking of JSON records of 1.5 MB each, with a complex nested schema up to 7th degree.
For simplicity, I started to load file with a single record on one line.
At first I try to have BigQuery to autodetect my schema, but that resulted in table that is not responsive and I cannot perform query on, albeit it says it had at least a record.
Then, assuming that my schema could be too hard to reverse for the loader, I tried to write the schema myself and I then I tried to load my my file with single record.
At first I got a simple error with just "invalid".
bq load --source_format=NEWLINE_DELIMITED_JSON invq_data.test_table
my_single_json_record_file
Upload complete.
Waiting on bqjob_r5a4ce64904bbba9d_0000015e14aba735_1 ... (3s) Current
status: DONE
BigQuery error in load operation: Error processing job 'invq-
test:bqjob_r5a4ce64904bbba9d_0000015e14aba735_1': JSON table
encountered too many errors, giving up. Rows:
1; errors: 1.
Which after checking for the job error was just giving me the following:
"status": {
"errorResult": {
"location": "file-00000000",
"message": "JSON table encountered too many errors, giving up. Rows: 1; errors: 1.",
"reason": "invalid"
},
"errors": [
{
"location": "file-00000000",
"message": "JSON table encountered too many errors, giving up. Rows: 1; errors: 1.",
"reason": "invalid"
}
],
"state": "DONE"
},
The after a couple of more attempts creating new tables, it actually started to succeed on command line, without reporting errors:
bq load --max_bad_records=1 --source_format=NEWLINE_DELIMITED_JSON invq_data.test_table_4 my_single_json_record_file
Upload complete.
Waiting on bqjob_r368f1dff98600a4b_0000015e14b43dd5_1 ... (16s) Current status: DONE
with no error on the status checker...
"statistics": {
"creationTime": "1503585955356",
"endTime": "1503585973623",
"load": {
"badRecords": "0",
"inputFileBytes": "1494390",
"inputFiles": "1",
"outputBytes": "0",
"outputRows": "0"
},
"startTime": "1503585955723"
},
"status": {
"state": "DONE"
},
But no actual records are added to my tables.
I tried to perform the same from WebUI but the result is the same. Green on the completed job, but no actual record added.
Is there something else that I can do for checking where the data is sinking to? Maybe some more log?
I can imagine that maybe I am on the the edge of the 2 MB JSON row size limit but, if so, should this be reported as error?
Thanks in advance for the help!!
EDIT:
It turned out the complexity of my schema was a bit the devil in here.
My json files were valid, but my complex schema had several errors.
It turned out that I had to simplify it such schema anyway, because I got a new batch of data where single json instances where more 30MB and I had to restructure this data in a more relational way, whilst making smaller rows to insert in the database.
Funny enough when the schema was scattered across multiple entities (ergo, simplified) the actually error/inconsistencies of the schema started to actually show up in error returned and it was easier to fix them. (Mostly it was new nested undocumented data which I was not aware anyway... but still my bad).
The lesson here, is when a table schema is too long (I didn't experiment how much precisely is too long) BigQuery just hide itself behind reporting too many errors to show.
But that is a point where you should consider simplify the schema(/structure) of your data.

How to get useful BigQuery errors

I have a job that I run with jobs().insert()
Currently I have the job failing with:
2014-11-11 11:19:15,937 - ERROR - bigquery - invalid: Could not convert value to string
Considering I have 500+ columns, I find this error message useless and pretty pathetic. What can I do to receive a proper and better error details from BigQuery?

The structured error return dictionary contains 3 elements, a "reason", a "location", and a "message". From the log line you include, it looks like only the message is logged.
Here's an example error return from a CSV import with data that doesn't match the target table schema:
"errors": [
{
"reason": "invalid",
"location": "File: 0 / Line:2 / Field:1",
"message": "Value cannot be converted to expected type."
},
...
Similar errors are returned from JSON imports with data that doesn't match the target table schema.
I hope this helps!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas