Bigquery: invalid: Illegal Schema update - google-bigquery

I tried to append data from a query to a bigquery table.
Job ID job_i9DOuqwZw4ZR2d509kOMaEUVm1Y
Error: Job failed while writing to Bigquery. invalid: Illegal Schema update. Cannot add fields (field: debug_data) at null
I copy and paste the query executed in above jon, run it in web console and choose the same dest table to append, it works.

The job you listed is trying to append query results to a table. That query has a field named 'debug_data'. The table you're appending to does not have that field. This behavior is by design, in order to prevent people from accidentally modifying the schema of their tables.
You can run a tables.update() or tables.patch() operation to modify the table schema to add this column (see an example using bq here: Bigquery add columns to table schema), and then you'll be able to run this query successfully.
Alternately, you could use truncate instead of append as the write disposition in your query job; this would overwrite the table, and in doing so, will allow schema changes.

See this post for how to have bigquery automatically add new fields to a schema while doing an append.
The code in python is:
job_config.schema_update_options = ['ALLOW_FIELD_ADDITION']

Related

Cannot add a column after deleting another column in BigQuery

I cannot imagine there is such issue in BigQuery:
le's say if I drop a column using below command in BQ console for User table:
Alter table User drop column name -> successful
I am aware this column is preserved for 7 day(for time travel duration purpose).
But I cannot add any column anymore by running below command in BQ console:
ALTER TABLE User add column first_name STRING
Cause it will give an error like below even though the two columns have totally different naming:
Column name was recently deleted in the table User. Deleted column name is reserved for up to the time travel duration, use a different column name instead.
The above error is same as when I try to drop the same column again even with IF EXISTS:
Alter table User drop IF EXISTS column name
My question:
Why is this issue will happen? After 7 days, Can I add new columns as usual?
I have recreated your issue wherein I dropped a column named employee_like_2 and then tried to add a new column named new_column.
There is already a created bug for this issue. You may click on +1 to bring more attention on the issue and STAR the issue so that you can be notified for updates.
For the meantime, a possible workaround is to manually add columns through BigQuery UI.
Apart from the solution using UI suggested #Scott B, we can also do it using bq command:
Basically bq query --use_legacy_sql=false 'ALTER TABLE User add column first_name STRING' will fail to add a column. But I found a workaround
I can run bq update command instead like below:
bq show --schema --format=prettyjson DATASET.User > user_schema.json
Add a new column I want into file user_schema.json
bp update DATASET.User user_schema.json
So this basically means it is a 100% bug in BigQuery SQL command

How to drop columns from a partitioned table in BigQuery

We can not use create or replace table statement for partitioned tables in BigQuery. I can export the table to GCS but BigQuery generates then multiple JSON files that can not be imported into a table in once. Is there a safe way to drop a column from a partitioned table? I use BigQuery's web interface.
Renaming a column is not supported by the Cloud Console, the classic BigQuery web UI, the bq command-line tool, or the API. If you attempt to update a table schema using a renamed column, the following error is returned: BigQuery error in update operation: Provided Schema does not match Table project_id:dataset.table.
There are two ways to manually rename a column:
Using a SQL query: choose this option if you are more concerned about simplicity and ease of use, and you are less concerned about costs.
Recreating the table: choose this option if you are more concerned about costs, and you are less concerned about simplicity and ease of use.
If you want to drop a column you can either:
Use a SELECT * EXCEPT query that excludes the column (or columns) you want to remove and use the query result to overwrite the table or to create a new destination table
You can also remove a column by exporting your table data to Cloud Storage, deleting the data corresponding to the column (or columns) you want to remove, and then loading the data into a new table with a schema definition that does not include the removed column(s). You can also use the load job to overwrite the existing table
There is a guide published for Manually Changing Table Schemas.
edit
In order to change a Partitioned table to a Non-partitioned table, you can use the Console to query your data and overwrite your current table or copy to a new one. As an example, I have a table in BigQuery partitioned by _PARTITIONTIME. I used the following query to create a non-partitioned table,
SELECT *, _PARTITIONTIME as pt FROM `project.dataset.table`
With the above code, you will query the data among all table's partitions and create an extra column to show which partition it came from. Then, before executing it, there are two options, save the view in a new non-partitioned table or overwrite the current table:
Creating a new table go to: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose your project, dataset and write your new table's name > Under Destination table write preference check Write if empty.
Overwriting the current table: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose the same project and dataset for your current table > Write the same table's name as the one you want to overwrite > Under Destination table write preference check Overwrite table.
credit

BigQuery: Is it posible to modify a table schema by adding a field within a record

BigQuery manual States that it is only possible to add a new field but not modify an existing one. My question is whether it is possible to add an existing field in to a RECORD field.
Say the original schema is:
{"type":"RECORD","name":"record","mode":"REPEATED"
"fields":[
{"type":"STRING","name":"f1","mode":"NULLABLE"}
]
}
And I would like to add f2 so the schema would be:
{"type":"RECORD","name":"record","mode":"REPEATED"
"fields":[
{"type":"STRING","name":"f1","mode":"NULLABLE"},
{"type":"STRING","name":"f2","mode":"NULLABLE"}
]
}
Is it possible?
Adding a new field to a STRUCT is not supported in the console but you can add it using the BigQuery CLI as you can see here
In the Console mode:
Adding a new nested field to an exising RECORD column is not currently
supported by the classic BigQuery web UI.
Using the BigQuery CLI:
In this option, you can create a new schema with and use bq update project_id:dataset.table schema to update the table.
As you can find in the link:
First, issue the bq show command with the --schema flag and write the existing table schema to a file. If the table you're updating is
in a project other than your default project, add the project ID to
the dataset name in the following format: project_id:dataset.table.
[...]
bq show \
--schema \
--format=prettyjson \
project_id:dataset.table > schema_file
Open the schema file in a text editor. The schema should look like the following. In this example, column3 is a nested repeated column.
The nested columns are nested1 and nested2. The fields array lists the
fields nested within column3. [...]
Add the new nested column to the end of the fields array. In this example, nested3 is the new nested column. [...]
After updating your schema file, issue the following command to update the table's schema. If the table you're updating is in a
project other than your default project, add the project ID to the
dataset name in the following format: project_id:dataset. [...]
bq update mydataset.mytable /tmp/myschema.json
Hope it helps
A work around when using the console would be to query the original table, unnesting the RECORD type and putting it back into a STRUCT, adding the column you want and put in with a placeholder value that matches the type you want.
SELECT STRUCT(a.foo as foo, a.bar as bar, 'hello' as baz) as words, time, id FROM dataset.table, UNNEST(words) as a;
This query result could be saved as a table, then you can go in and do update queries to change the 'hello' to actual text you want stored.
The only way is to cast the select query using cast() and then save it as bigquery table and then replace delete the old table then save the query again with the original table name
Select cast(section number,INT64) as section_number,section_name from Table.
As you can see the cast here plays the role of parsing the number from string to an INT and the result will automatically change

List Impala tables that need invalidate/refresh

How can I programatically find all Impala tables that need INVALIDATE METADATA statement (because they were created in Hive, but not yet known to Impala) or REFRESH (because column added, datafile added, etc.)?
Invalidate Metadata:
As a workaround, create a shell script to do the below steps.
Using beeline, connect to a particular database and run show tables statement and save output data to a file.
Using impala-shell, connect to the same particular database and run show tables statement and save output data to another file.
Now compare both the file to remove the duplicates and get the unique tables list from the first file which is a list of tables which are only in hive but not in impala.
Note:
a. Instead of a particular database each at a time in 1 and 2 steps, you can loop over all databases and save the output to a file. Inside the loop itself, you can redirect and append the output files to another final output file with data in some format like database.table or database_table to get all tables from all databases into a single file. Finally, follow step 3.
b. The unique tables from the second output file after removing duplicates will be tables that are deleted in hive and invalidate metadata needs to be run in impala to remove them from the impala list.
c. Rename of a table in impala can be recognized by hive but vice-versa is not possible and invalidate metadata should be run for both old and new table names to remove and add respectively in impala. This applies to most operations not just rename of table.
Refresh:
Consider a text format table with 2 columns and 1 row data.
Now suppose, a third column is added to that table in the beeline.
select * from table; ---gives 3 columns in beeline and 2 columns in impala since refresh is not run on impala for this table.
If we run compute stats in impala before running refresh in this case, then that newly added column from the beeline will be removed from the table schema in hive as well.
select * from table; ---gives 2 columns in beeline and 2 columns in impala since compute stats from impala deleted the extra column metadata of table although data resides in hdfs for that column. This might cause parsing issues in impala if the column is added somewhere in the middle or front instead of ending.
So it is advised to run REFRESH table name in impala right after adding a new column or doing any modifications in beeline for an existing table to not lose table schema as explained in the above scenario.
refresh table; ---Right after modification in hive run refresh in impala.
select * from table; ---gives 3 columns in beeline and 3 columns in impala since refresh is run before compute stats in impala.

Appending data to a table created from an Avro file in BigQuery

Every morning, an automatic job creates a new table from an Avro file. In the afternoon, I would need to append some data to this table from a Query.
When trying to do so, I get the following error:
Error: Invalid schema update. Field chn has changed mode from REQUIRED to NULLABLE
I noticed that I can change the property of the field chn from REQUIRED to NULLABLE in the BigQuery Web UI and then it works fine, but I would have to do it manually everyday which is not what I am looking for.
Is there a way to "cast" the field as REQUIRED during the append query ?
Or during the first import from the Avro file, force the field to be NULLABLE and not REQUIRED ?
Thanks !
The feature that allows relaxing a field as part of a query or a load job will be available in production shortly. I will update this answer when it goes live (likely within a week).
Update: 08/25/2016
You can supply schemaUpdateOptions in load or query job configuration.
Multiple options can be provided.
It allows the schema of the destination table to be updated as a side effect of the load or query job. Schema update options are supported in two cases:
When writeDisposition is WRITE_APPEND
When writeDisposition is WRITE_TRUNCATE and the destination table is a partition of a table, specified by partition decorators
For non-partitioned tables, WRITE_TRUNCATE will always overwrite the schema.
The following values are supported:
ALLOW_FIELD_ADDITION: allow adding a nullable field to the schema
ALLOW_FIELD_RELAXATION: allow relaxing a required field in the original schema to nullable
NOTE: This doesn't currently work with schema auto-detection. We plan to support that soon.