Keep history data onto a partitioned table
Team,
I have scenario here - I have 2 tables - one is non partitioned and another one is partitioned table partition on one date field.
Have loaded the data from non-partitioned tables to a partitioned table and I have set the below parameter to load onto partition table.
write.partitionBy(“date”) \
.format(“orc”) \
.mode(“overwrite”) \
.saveAsTable(“schema.table1”)
Now both table count match which has 3 years of data. Which is as expected.
now I have refreshed only latest one year of data and try to load the partitioned table but it got loaded only with 1 year data, where as I need all 3 years data in the partitioned table.
What am I missing here.. I have to refresh only 1 year of data and load it to the partition table and keep building history.
Kindly suggest. Thanks
write.partitionBy(“date”)
.format(“orc”)
.mode(“overwrite”)
.saveAsTable(“schema.table1”)
Need to keep history with latest data refresh on each day basis.
Related
I have an external hive table, partitioned on the date column. Each date has multiple AB test experiment’s data.
I need to create a job that drops experiments which have ended more than 6 months ago.
Since dropping data in an external hive partitioned table, drops the entire partition. In this case, data for one whole date. Is there a way to only drop part of a date?
I have a couple of years of data on a big query partitioned table by day. I want to replace the data of just the last 30 days; however, when I use the create or replace table function on bigquery, it replaces the entire table with just the new dates partitions. Is there any way to update only those partitions without losing the previous data?
Recently, I have been working on converting date suffixed tables into partitioned tables using ingestion time. However, in partition tables, how do we know whether certain date contains no data or the table was not created successfully?
Here is more details,
Previously, daily tables were created, but it is OK that some tables were empty because no result met the criteria. For example,
daily_table_20200601 (100 rows)
daily_table_20200602 (0 rows)
daily_table_20200603 (10 rows)
In this case, I can see table daily_table_20200602 exists, so I know my scheduled job runs successfully.
When switching to partitioned tables using ingestion time, I am writing into the table daily_table every day, for example,
daily_table$20200601 (100 rows)
daily_table$20200602 (0 rows)
daily_table$20200603 (10 rows)
But how do we know the whether table daily_table$20200602 was created successfully or it is just empty?
Also, there is something interesting. I am using API to check whether partition table exist, see the following code,
dataset_ref = client.dataset('dataset_name')
table_ref = dataset_ref.table("daily_table$20210101")
client.get_table(table_ref)
The result shows the table exist. So are we able to check whether the certain date table exist or not?
there's no separate (date table) for every partition, because the partitioning doesn't create a separate partition table, it's similar to relational database partitioning
ingestion time partitioning method adds a pseudo columns for day partitioning (_PARTITIONTIME,_PARTITIONDATE) and for hourly partitioning (_PARTITIONTIME) which will contains the timestamp of the beginning of the insertion data or hour and partition the table accordingly,
for this code:
dataset_ref = client.dataset('dataset_name')
table_ref = dataset_ref.table("daily_table$20210101")
client.get_table(table_ref)
This will success as long as the partitioned table exists
How to rename a TABLE in Big query using StandardSQL or LegacySQL.
I'm trying with StandardSQL but it is giving following error,
RENAME TABLE dataset.old_table_name TO dataset.new_table_name;
Statement not supported: RenameStatement at [1:1]
Does it mean there is no any method(SQL QUERY) Which can rename a table?
I just want to change from non-partition table to partition-table
You can achieve this in two steps process
Step 1 - Export your table to Google Cloud Storage
Step 2 - Load file from GCS back to GBQ into new table with partitioned column
Both are free of charge
Still, have in mind some limitatins of partitioned tables - like number of partitions for example - it is 4000 per table as of today - https://cloud.google.com/bigquery/quotas#partitioned_tables
Currently it is not possible to rename table in Bigquery as explained in this document. You will have to create another table by following the steps given by Mikhail. Notice there is still some charge from table storage, but it is minimal. See this doc for detail information.
You can use the below query, it will create a new table with distinct records from old table with partition on given column.
create or replace table `dataset.new_table` PARTITION BY DATE(date_time_column) as select distinct * from `dataset.old_table`
I have an external hive table which is partitioned on load_date (DD-MM-YYYY). however the very first period lets say 01-01-2000 has all the data from 1980 till 2000. How can I further create partitions on year for the previous data while keeping the existing data (data for load date greater than 01-01-2000) still available
First load the data of '01-01-2000' into a table and create a dynamic partition table partitioned by data '01-01-2000'. This might solve your problem.