Cannot add a column after deleting another column in BigQuery - google-bigquery

I cannot imagine there is such issue in BigQuery:
le's say if I drop a column using below command in BQ console for User table:
Alter table User drop column name -> successful
I am aware this column is preserved for 7 day(for time travel duration purpose).
But I cannot add any column anymore by running below command in BQ console:
ALTER TABLE User add column first_name STRING
Cause it will give an error like below even though the two columns have totally different naming:
Column name was recently deleted in the table User. Deleted column name is reserved for up to the time travel duration, use a different column name instead.
The above error is same as when I try to drop the same column again even with IF EXISTS:
Alter table User drop IF EXISTS column name
My question:
Why is this issue will happen? After 7 days, Can I add new columns as usual?

I have recreated your issue wherein I dropped a column named employee_like_2 and then tried to add a new column named new_column.
There is already a created bug for this issue. You may click on +1 to bring more attention on the issue and STAR the issue so that you can be notified for updates.
For the meantime, a possible workaround is to manually add columns through BigQuery UI.

Apart from the solution using UI suggested #Scott B, we can also do it using bq command:
Basically bq query --use_legacy_sql=false 'ALTER TABLE User add column first_name STRING' will fail to add a column. But I found a workaround
I can run bq update command instead like below:
bq show --schema --format=prettyjson DATASET.User > user_schema.json
Add a new column I want into file user_schema.json
bp update DATASET.User user_schema.json
So this basically means it is a 100% bug in BigQuery SQL command

Related

How to drop columns from a partitioned table in BigQuery

We can not use create or replace table statement for partitioned tables in BigQuery. I can export the table to GCS but BigQuery generates then multiple JSON files that can not be imported into a table in once. Is there a safe way to drop a column from a partitioned table? I use BigQuery's web interface.
Renaming a column is not supported by the Cloud Console, the classic BigQuery web UI, the bq command-line tool, or the API. If you attempt to update a table schema using a renamed column, the following error is returned: BigQuery error in update operation: Provided Schema does not match Table project_id:dataset.table.
There are two ways to manually rename a column:
Using a SQL query: choose this option if you are more concerned about simplicity and ease of use, and you are less concerned about costs.
Recreating the table: choose this option if you are more concerned about costs, and you are less concerned about simplicity and ease of use.
If you want to drop a column you can either:
Use a SELECT * EXCEPT query that excludes the column (or columns) you want to remove and use the query result to overwrite the table or to create a new destination table
You can also remove a column by exporting your table data to Cloud Storage, deleting the data corresponding to the column (or columns) you want to remove, and then loading the data into a new table with a schema definition that does not include the removed column(s). You can also use the load job to overwrite the existing table
There is a guide published for Manually Changing Table Schemas.
edit
In order to change a Partitioned table to a Non-partitioned table, you can use the Console to query your data and overwrite your current table or copy to a new one. As an example, I have a table in BigQuery partitioned by _PARTITIONTIME. I used the following query to create a non-partitioned table,
SELECT *, _PARTITIONTIME as pt FROM `project.dataset.table`
With the above code, you will query the data among all table's partitions and create an extra column to show which partition it came from. Then, before executing it, there are two options, save the view in a new non-partitioned table or overwrite the current table:
Creating a new table go to: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose your project, dataset and write your new table's name > Under Destination table write preference check Write if empty.
Overwriting the current table: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose the same project and dataset for your current table > Write the same table's name as the one you want to overwrite > Under Destination table write preference check Overwrite table.
credit

BigQuery: Is it posible to modify a table schema by adding a field within a record

BigQuery manual States that it is only possible to add a new field but not modify an existing one. My question is whether it is possible to add an existing field in to a RECORD field.
Say the original schema is:
{"type":"RECORD","name":"record","mode":"REPEATED"
"fields":[
{"type":"STRING","name":"f1","mode":"NULLABLE"}
]
}
And I would like to add f2 so the schema would be:
{"type":"RECORD","name":"record","mode":"REPEATED"
"fields":[
{"type":"STRING","name":"f1","mode":"NULLABLE"},
{"type":"STRING","name":"f2","mode":"NULLABLE"}
]
}
Is it possible?
Adding a new field to a STRUCT is not supported in the console but you can add it using the BigQuery CLI as you can see here
In the Console mode:
Adding a new nested field to an exising RECORD column is not currently
supported by the classic BigQuery web UI.
Using the BigQuery CLI:
In this option, you can create a new schema with and use bq update project_id:dataset.table schema to update the table.
As you can find in the link:
First, issue the bq show command with the --schema flag and write the existing table schema to a file. If the table you're updating is
in a project other than your default project, add the project ID to
the dataset name in the following format: project_id:dataset.table.
[...]
bq show \
--schema \
--format=prettyjson \
project_id:dataset.table > schema_file
Open the schema file in a text editor. The schema should look like the following. In this example, column3 is a nested repeated column.
The nested columns are nested1 and nested2. The fields array lists the
fields nested within column3. [...]
Add the new nested column to the end of the fields array. In this example, nested3 is the new nested column. [...]
After updating your schema file, issue the following command to update the table's schema. If the table you're updating is in a
project other than your default project, add the project ID to the
dataset name in the following format: project_id:dataset. [...]
bq update mydataset.mytable /tmp/myschema.json
Hope it helps
A work around when using the console would be to query the original table, unnesting the RECORD type and putting it back into a STRUCT, adding the column you want and put in with a placeholder value that matches the type you want.
SELECT STRUCT(a.foo as foo, a.bar as bar, 'hello' as baz) as words, time, id FROM dataset.table, UNNEST(words) as a;
This query result could be saved as a table, then you can go in and do update queries to change the 'hello' to actual text you want stored.
The only way is to cast the select query using cast() and then save it as bigquery table and then replace delete the old table then save the query again with the original table name
Select cast(section number,INT64) as section_number,section_name from Table.
As you can see the cast here plays the role of parsing the number from string to an INT and the result will automatically change

List Impala tables that need invalidate/refresh

How can I programatically find all Impala tables that need INVALIDATE METADATA statement (because they were created in Hive, but not yet known to Impala) or REFRESH (because column added, datafile added, etc.)?
Invalidate Metadata:
As a workaround, create a shell script to do the below steps.
Using beeline, connect to a particular database and run show tables statement and save output data to a file.
Using impala-shell, connect to the same particular database and run show tables statement and save output data to another file.
Now compare both the file to remove the duplicates and get the unique tables list from the first file which is a list of tables which are only in hive but not in impala.
Note:
a. Instead of a particular database each at a time in 1 and 2 steps, you can loop over all databases and save the output to a file. Inside the loop itself, you can redirect and append the output files to another final output file with data in some format like database.table or database_table to get all tables from all databases into a single file. Finally, follow step 3.
b. The unique tables from the second output file after removing duplicates will be tables that are deleted in hive and invalidate metadata needs to be run in impala to remove them from the impala list.
c. Rename of a table in impala can be recognized by hive but vice-versa is not possible and invalidate metadata should be run for both old and new table names to remove and add respectively in impala. This applies to most operations not just rename of table.
Refresh:
Consider a text format table with 2 columns and 1 row data.
Now suppose, a third column is added to that table in the beeline.
select * from table; ---gives 3 columns in beeline and 2 columns in impala since refresh is not run on impala for this table.
If we run compute stats in impala before running refresh in this case, then that newly added column from the beeline will be removed from the table schema in hive as well.
select * from table; ---gives 2 columns in beeline and 2 columns in impala since compute stats from impala deleted the extra column metadata of table although data resides in hdfs for that column. This might cause parsing issues in impala if the column is added somewhere in the middle or front instead of ending.
So it is advised to run REFRESH table name in impala right after adding a new column or doing any modifications in beeline for an existing table to not lose table schema as explained in the above scenario.
refresh table; ---Right after modification in hive run refresh in impala.
select * from table; ---gives 3 columns in beeline and 3 columns in impala since refresh is run before compute stats in impala.

Crate db cannot query data in a shard

I have a instance of Crate 1.0.2 and I dropped a table from it. Then re-created table with same name and slightly modified schema. Then I imported data using copy from command. File argument to copy from command consists of 10,000 records and copy from command runs ok. When I check table tab in crate web console, it shows many partitions added and each partition having few records. If I add number of records column on this tab, it comes close to 10k but when I fire a command "select count(*) from mytable", it returns around 8000 records only. On further investigation found that there are certain partitions on which data cannot be queried at all. Has any one seen this problem? Does it have anything to do with table drop and creation with same name ? I also observed that when a table is dropped, not all files related to that table are deleted from path.data. Are these directories a reason for those partitions become non-query able? While importing, I saw "Document already exists" exception. I know my data does not have any duplicate value for primary column.
Some questions to clarify the issue:
Have you run refresh table mytable after your copy command has finished?
Are you sure that with the new schema of the table, there are no duplicate records?
Since 1.x versions are not supported anymore, could you try with CrateDB 2.1.6 which is the current stable version to see if the problem persists?

Bigquery: invalid: Illegal Schema update

I tried to append data from a query to a bigquery table.
Job ID job_i9DOuqwZw4ZR2d509kOMaEUVm1Y
Error: Job failed while writing to Bigquery. invalid: Illegal Schema update. Cannot add fields (field: debug_data) at null
I copy and paste the query executed in above jon, run it in web console and choose the same dest table to append, it works.
The job you listed is trying to append query results to a table. That query has a field named 'debug_data'. The table you're appending to does not have that field. This behavior is by design, in order to prevent people from accidentally modifying the schema of their tables.
You can run a tables.update() or tables.patch() operation to modify the table schema to add this column (see an example using bq here: Bigquery add columns to table schema), and then you'll be able to run this query successfully.
Alternately, you could use truncate instead of append as the write disposition in your query job; this would overwrite the table, and in doing so, will allow schema changes.
See this post for how to have bigquery automatically add new fields to a schema while doing an append.
The code in python is:
job_config.schema_update_options = ['ALLOW_FIELD_ADDITION']