How to drop columns from a partitioned table in BigQuery - sql

We can not use create or replace table statement for partitioned tables in BigQuery. I can export the table to GCS but BigQuery generates then multiple JSON files that can not be imported into a table in once. Is there a safe way to drop a column from a partitioned table? I use BigQuery's web interface.

Renaming a column is not supported by the Cloud Console, the classic BigQuery web UI, the bq command-line tool, or the API. If you attempt to update a table schema using a renamed column, the following error is returned: BigQuery error in update operation: Provided Schema does not match Table project_id:dataset.table.
There are two ways to manually rename a column:
Using a SQL query: choose this option if you are more concerned about simplicity and ease of use, and you are less concerned about costs.
Recreating the table: choose this option if you are more concerned about costs, and you are less concerned about simplicity and ease of use.
If you want to drop a column you can either:
Use a SELECT * EXCEPT query that excludes the column (or columns) you want to remove and use the query result to overwrite the table or to create a new destination table
You can also remove a column by exporting your table data to Cloud Storage, deleting the data corresponding to the column (or columns) you want to remove, and then loading the data into a new table with a schema definition that does not include the removed column(s). You can also use the load job to overwrite the existing table
There is a guide published for Manually Changing Table Schemas.
edit
In order to change a Partitioned table to a Non-partitioned table, you can use the Console to query your data and overwrite your current table or copy to a new one. As an example, I have a table in BigQuery partitioned by _PARTITIONTIME. I used the following query to create a non-partitioned table,
SELECT *, _PARTITIONTIME as pt FROM `project.dataset.table`
With the above code, you will query the data among all table's partitions and create an extra column to show which partition it came from. Then, before executing it, there are two options, save the view in a new non-partitioned table or overwrite the current table:
Creating a new table go to: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose your project, dataset and write your new table's name > Under Destination table write preference check Write if empty.
Overwriting the current table: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose the same project and dataset for your current table > Write the same table's name as the one you want to overwrite > Under Destination table write preference check Overwrite table.
credit

Related

How to update table in Athena

I am creating a table 'A' from another table 'B' in Athena using create sql query. However, Table 'B' is updated with new rows every hour. I want to know how can I update the table A data without dropping table A and creating it again.
I tried dropping table and creating it again, but that seems to create performance issue as every time a new table is getting created. I want to insert only new rows in table A whichever are added in Table B
Amazon Athena is a query engine, not a database.
When a query runs on a table, Athena uses the location of a table to determine where the data is stored in an Amazon S3 bucket. It then reads all files in that location (including sub-directories) and runs the query on that data.
Therefore, the easiest way to add data to Amazon Athena tables is to create additional files in that location in Amazon S3. The next time Athena runs a query, those files will be included as part of the referenced table. Even running the INSERT INTO command creates new files in that location. ("Each INSERT operation creates a new file, rather than appending to an existing file.")
If you wish to copy data from Table-B to Table-A, and you know a way to identify which rows to add (eg there is a column with a timestamp), you could use something like:
INSERT INTO table_a
SELECT * FROM table_b
WHERE timestamp_field > (SELECT MAX(timestamp_field FROM table_a))

Add new partition-scheme to existing table in athena with SQL code

Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).

List Impala tables that need invalidate/refresh

How can I programatically find all Impala tables that need INVALIDATE METADATA statement (because they were created in Hive, but not yet known to Impala) or REFRESH (because column added, datafile added, etc.)?
Invalidate Metadata:
As a workaround, create a shell script to do the below steps.
Using beeline, connect to a particular database and run show tables statement and save output data to a file.
Using impala-shell, connect to the same particular database and run show tables statement and save output data to another file.
Now compare both the file to remove the duplicates and get the unique tables list from the first file which is a list of tables which are only in hive but not in impala.
Note:
a. Instead of a particular database each at a time in 1 and 2 steps, you can loop over all databases and save the output to a file. Inside the loop itself, you can redirect and append the output files to another final output file with data in some format like database.table or database_table to get all tables from all databases into a single file. Finally, follow step 3.
b. The unique tables from the second output file after removing duplicates will be tables that are deleted in hive and invalidate metadata needs to be run in impala to remove them from the impala list.
c. Rename of a table in impala can be recognized by hive but vice-versa is not possible and invalidate metadata should be run for both old and new table names to remove and add respectively in impala. This applies to most operations not just rename of table.
Refresh:
Consider a text format table with 2 columns and 1 row data.
Now suppose, a third column is added to that table in the beeline.
select * from table; ---gives 3 columns in beeline and 2 columns in impala since refresh is not run on impala for this table.
If we run compute stats in impala before running refresh in this case, then that newly added column from the beeline will be removed from the table schema in hive as well.
select * from table; ---gives 2 columns in beeline and 2 columns in impala since compute stats from impala deleted the extra column metadata of table although data resides in hdfs for that column. This might cause parsing issues in impala if the column is added somewhere in the middle or front instead of ending.
So it is advised to run REFRESH table name in impala right after adding a new column or doing any modifications in beeline for an existing table to not lose table schema as explained in the above scenario.
refresh table; ---Right after modification in hive run refresh in impala.
select * from table; ---gives 3 columns in beeline and 3 columns in impala since refresh is run before compute stats in impala.

How to add a column in the middle of a ORC partitioned hive table and still be able to query old partitioned files with new structure

Currently I have a Partitioned ORC "Managed" (Wrongly created as Internal first) Hive table in Prod with atleast 100 days worth of data partitioned by year,month,day(~16GB of data).
This table has roughly 160 columns.Now my requirement is to Add a column in the middle of this table and still be able to query the older data(partitioned files).Its is fine if the newly added column shows null for the old data.
What I did so far ?
1)First convert the table to External using below to preserve data files before dropping
alter table <table_name> SET TBLPROPERTIES('EXTERNAL'='TRUE');
2)Drop and Recreate the table with new column in the middle and then Altered the table to add the partition file
However I am unable to read the table after Recreation .I get this Error message
[Simba][HiveJDBCDriver](500312) Error in fetching data rows: *org.apache.hive.service.cli.HiveSQLException:java.io.IOException: java.io.IOException: ORC does not support type conversion from file type array<string> (87) to reader type int (87):33:32;
Any other way to accomplish this ?
No need to drop and recreate the table. Simply use the following statement.
ALTER TABLE default.test_table ADD columns (column1 string,column2 string) CASCADE;
ALTER TABLE CHANGE COLUMN with CASCADE command changes the columns of a table's metadata, and cascades the same change to all the partition metadata.
PS - This will add new columns to the end of the existing columns but before the partition columns. Unfortunately, ORC does not support adding columns in the middle as of now.
Hope that helps!

loading data to hive dynamic partitioned tables

I have created a hive table with dynamic partitioning on a column. Is there a way to directly load the data from files using "LOAD DATA" statement? Or do we have to only depend on creating a non-partitioned intermediate table and load file data to it and then inserting data from this intermediate table to partitioned table as mentioned in Hive loading in partitioned table?
No, the LOAD DATA command ONLY copies the files to the destination directory. It doesn't read the records of the input file, so it CANNOT do partitioning based on record values.
If your input data is already split into multiple files based on partitions, you could directly copy the files to table location in HDFS under their partition directory manually created by you (OR just point to their current location in case of EXTERNAL table) and use the following ALTER command to ADD the partition. This way you could skip the LOAD DATA statement altogether.
ALTER TABLE <table-name>
ADD PARTITION (<...>)
No other go, if we need to insert directly, we'll need to specify partitions manually.
For dynamic partitioning, we need staging table and then insert from there.