change a partitioned table schema in big query - google-bigquery

I have a partitioned table in big query and I wanted to change the schema of that table.
I have changed the schema of table before using the following sql from the web UI
'SELECT * REPLACE ((SELECT AS STRUCT whatever.* EXCEPT (columnName)) AS whatever) FROM `a:b.c`'
but this causes all the previous partitions to be lost and when I look for the partitions of this newly created table using the following command it gives me today's date
SELECT _PARTITIONTIME as pt, FORMAT_TIMESTAMP("%Y%m%d", _PARTITIONTIME) as partition_id
FROM `a.b.c`
GROUP BY _PARTITIONTIME
ORDER BY _PARTITIONTIME
Is it possible to change the schema of a table and also keep its partitions in BigQuery?

Currently, it is stated in the documentation ( link 1, link 2) all the possible modifications that are available and changing a partitioned table to a non-partitioned table is not listed. However, there is a work around.
In order to change a Partitioned table to a Non-partitioned table, you can use the Console to query your data and overwrite your current table or copy to a new one. As an example, I have a table in BigQuery partitioned by _PARTITIONTIME. I used the following query to create a non-partitioned table,
SELECT *, _PARTITIONTIME as pt FROM `project.dataset.table`
With the above code, you will query the data among all table's partitions and create an extra column to show which partition it came from. Then, before executing it, there are two options, save the view in a new non-partitioned table or overwrite the current table:
Creating a new table go to: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose your project, dataset and write your new table's name > Under Destination table write preference check Write if empty.
Overwriting the current table: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose the same project and dataset for your current table > Write the same table's name as the one you want to overwrite > Under Destination table write preference check Overwrite table.

Related

BigQuery dynamic table name for sharded tables

I'm trying to create a table with a name that uses information from a source table. This is specifically for creating a sharded table (by day) that I'm adding data to regularly.
For example, I'd like to create table which has the tablename_tablesuffix (where tablesuffix = date) structure so that I can extract data from a table in such a way that BigQuery recognises it as a sharded table.
In this example, tablesuffix would need to be dynamic and based on a date field in the source table.
CREATE TABLE `project.database.tablename_tablesuffix`
AS
SELECT * FROM `project.database.tablename_20230126`
Anyone know if this is possible!?

Is it possible to change partition metadata in HIVE?

This is an extension of a previous question I asked: How to compare two columns with different data type groups
We are exploring the idea of changing the metadata on the table as opposed to performing a CAST operation on the data in SELECT statements. Changing the metadata in the MySQL metastore is easy enough. But, is it possible to have that metadata change applied to partitions (they are daily)? Otherwise, we might be stuck with current and future data being of type BIGINT while the historical is STRING.
Question: Is it possible to change partition meta data in HIVE? If yes, how?
You can change partition column type using this statement:
alter table {table_name} partition column ({column_name} {column_type});
Also you can re-create table definition and change all columns types using these steps:
Make your table external, so it can be dropped without dropping the data
ALTER TABLE abc SET TBLPROPERTIES('EXTERNAL'='TRUE');
Drop table (only metadata will be removed).
Create EXTERNAL table using updated DDL with types changed and with the same LOCATION.
recover partitions:
MSCK [REPAIR] TABLE tablename;
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE tablename RECOVER PARTITIONS;
This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS
And finally you can make you table MANAGED again if necessary:
ALTER TABLE tablename SET TBLPROPERTIES('EXTERNAL'='FALSE');
Note: All commands above should be ran in HUE, not MySQL.
You can not change the partition column in hive infact Hive does not support alterting of partitioning columns
Refer : altering partition column type in Hive
You can think of it this way
- Hive stores the data by creating a folder in hdfs with partition column values
- Since if you trying to alter the hive partition it means you are trying to change the whole directory structure and data of hive table which is not possible
exp if you have partitioned on year this is how directory structure looks like
tab1/clientdata/2009/file2
tab1/clientdata/2010/file3
If you want to change the partition column you can perform below steps
Create another hive table with required changes in partition column
Create table new_table ( A int, B String.....)
Load data from previous table
Insert into new_table partition ( B ) select A,B from table Prev_table

Need to change the partition column to another column and reloading the data into new partitions

I am trying to change the already existing partition column to another column.
The current workflow I'm using:
Backup the existing data
Create a new table with new partition column
Reload the data into new partitions
My problem:
Since there is huge data in our existing partition tables, this way will be costly
Is there a way we can do Alter table and change partition column name to another?
You can not avoid 1-time cost of scanning the table as you can see from the error message generated from this CREATE OR REPLACE DML command
#standardSQL
CREATE OR REPLACE TABLE `project.dataset.table`
PARTITION BY DATE(ts)
AS
SELECT * FROM `project.dataset.table`
Cannot replace a table with a different partitioning spec. Instead, DROP the table, and then recreate it. New partitioning spec is interval(type:day,field:ts) and existing spec is none
What you can do to save cost is use the WHERE command to limit the number of the partition you move from existing table to the new table
CREATE TABLE project.mydataset.newPartitionTable
PARTITION BY date
OPTIONS (
partition_expiration_days=365,
description="Table with a new partition"
) AS
SELECT * from `project.dataset.table` WHERE
PARTITIONTIME >= '2019-01-23 00:00:00'
AND _PARTITIONTIME <= '2019-01-23 00:00:00'
You can consider for example not to move your Long-term storage which is data you haven't access for the last 90 days (see this link for more details)
If you want to keep your original table name you can drop/create it with the new partition field, after the copy, and use the copy option from webUI which will be free of charge

Multiple Parquet files while writing to Hive Table(Incremental)

Having a Hive table that's partitioned
CREATE EXTERNAL TABLE IF NOT EXISTS CUSTOMER_PART (
NAME string ,
AGE int ,
YEAR INT)
PARTITIONED BY (CUSTOMER_ID decimal(15,0))
STORED AS PARQUET LOCATION 'HDFS LOCATION'
The first LOAD is done from ORACLE to HIVE via PYSPARK using
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID) SELECT NAME, AGE, YEAR, CUSTOMER_ID FROM CUSTOMER;
Which works fine and creates partition dynamically during the run. Now coming to data loading incrementally everyday creates individual files for a single record under the partition.
INSERT INTO TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3; --Assume this gives me the latest record in the database
Is there a possibility to have the value appended to the existing parquet file under the partition until it reaches it block size, without having smaller files created for each insert.
Rewriting the whole partition is one option but I would prefer not to do this
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3;
The following properties are set for the Hive
set hive.execution.engine=tez; -- TEZ execution engine
set hive.merge.tezfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB
Which still doesn't help with daily inserts. Any alternate approach that can be followed would be really helpful.
As Per my knowledge we cant store the single file for daily partition data since data will be stored by different part files for each day partition.
Since you mention that you are importing the data from Oracle DB so you can import the entire data each time from oracle DB and overwrite into HDFS. By this way you can maintain the single part file.
Also HDFS is not recommended for small amount data.
I could think of the following approaches for this case:
Approach1:
Recreating the Hive Table, i.e after loading incremental data into CUSTOMER_PART table.
Create a temp_CUSTOMER_PART table with entire snapshot of CUSTOMER_PART table data.
Run overwrite the final table CUSTOMER_PART selecting from temp_CUSTOMER_PART table
In this case you are going to have final table without small files in it.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created.
Approach2:
Using input_file_name() function by making use of it:
check how many distinct filenames are there in each partition then select only the partitions that have more than 10..etc files in each partition.
Create an temporary table with these partitions and overwrite the final table only the selected partitions.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created because we are going to overwrite the final table.
Approach3:
Hive(not spark) offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition: Orc format will offer concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;

how to convert a non-partitioned table into a partitioned one

How to rename a TABLE in Big query using StandardSQL or LegacySQL.
I'm trying with StandardSQL but it is giving following error,
RENAME TABLE dataset.old_table_name TO dataset.new_table_name;
Statement not supported: RenameStatement at [1:1]
Does it mean there is no any method(SQL QUERY) Which can rename a table?
I just want to change from non-partition table to partition-table
You can achieve this in two steps process
Step 1 - Export your table to Google Cloud Storage
Step 2 - Load file from GCS back to GBQ into new table with partitioned column
Both are free of charge
Still, have in mind some limitatins of partitioned tables - like number of partitions for example - it is 4000 per table as of today - https://cloud.google.com/bigquery/quotas#partitioned_tables
Currently it is not possible to rename table in Bigquery as explained in this document. You will have to create another table by following the steps given by Mikhail. Notice there is still some charge from table storage, but it is minimal. See this doc for detail information.
You can use the below query, it will create a new table with distinct records from old table with partition on given column.
create or replace table `dataset.new_table` PARTITION BY DATE(date_time_column) as select distinct * from `dataset.old_table`