Bigquery - How to keep partition in target table - google-bigquery

I need to select rows from a partitioned table and save the result into another table, how can I keep records' __PARTITIONTIME the same as they are in the source table? I mean, not only to keep the value of __PARTITIONTIME, but the whole partition feature so that I can do further queries on the target table using time decor and like stuff.
(I'm using Datalab notebooks)
%%sql -d standard --module TripData
SELECT
HardwareId,
TripId,
StartTime,
StopTime
FROM
`myproject.mydataset.TripData`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY)
AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(),DAY)

You cannot do this for multiple partitions at once!
You should do it one partition at a time specifying target partition - targetTable$yyyymmdd
Note: first you need to create target table as a partitioned table with respective schema

Related

Performing Date Math on Hive Partition Columns

My data is partitioned by day in the standard Hive format:
/year=2020/month=10/day=01
/year=2020/month=10/day=02
/year=2020/month=10/day=03
/year=2020/month=10/day=04
...
I want to query all data from the last 60 days, using Amazon Athena (IE: Presto). I want this query to use the partitioned columns (year, month, day) so that only the necessary partition files are scanned. Assuming I can't change the file partition format, what is the best approach to this problem?
You don't have to use year, month, day as the partition keys for the table. You can have a single partition key called date and add the partitions like this:
ALTER TABLE the_table ADD
PARTITION (`date` = '2020-10-01') LOCATION 's3://the-bucket/data/year=2020/month=10/day=01'
PARTITION (`date` = '2020-10-02') LOCATION 's3://the-bucket/data/year=2020/month=10/day=02'
...
With this setup you can even set the type of the partition key to date:
PARTITIONED BY (`date` date)
Now you have a table with a date column typed as a DATE, and you can use any of the date and time functions to do calculations on it.
What you won't be able to do with this setup is use MSCK REPAIR TABLE to load partitions, but you really shouldn't do that anyway – it's extremely slow and inefficient and really something you only do when you have a couple of partitions to load into a new table.
An alternative way to that proposed by Theo, is to use the following syntax, e.g.:
select ... from my_table where year||month||day between '2020630' and '20201010'
this works when the format for the columns year, month and day are string. It's particularly useful to query across months.

Bigquery Schedule query to load data to a particular partition

I am using the bigquery schedule query functionality to run a query every 30 mins.
My destination table will be a partitioned table and the partionining column is 'event_date'
The schedule query that i am using will be to copy today's data from source_table -> Dest_table
(like select * from source_table where event_date = CURRENT_DATE())
every 30 mins ,
but i would like it to write_truncate existing partition without write truncating the whole table.(since i don't want to duplicate today's data every 30 mins)
Currently when i schedule this query with partition_field set to event_date and write_truncate , it is truncating the whole table and this causes the previous data to be lost . Is there something else that i am missing
Instead of specifying destination table, you may use MERGE to truncate only one partition.
It is unfortunately more expensive, for you also pay for deleting the data from dest_table. (Insert is still free)
MERGE dest_table t
USING source_table
ON FALSE
WHEN NOT MATCHED BY SOURCE AND event_date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

Creating dynamic partition in Range partitioning

I have below scenario.
Suppose I have a table which has 3 partition. one is for 20190201 next is 20190202 and one is for 20190210.
I have been given requirement. whichever date we pass automatic partition should be created.
so if I am using dynamic sql I am able to create partition after the max partition for eg 20190211. but if I want to create partition for 20190205 it is giving error.
Is there anyway to create the partition at run time without data loss even when max partition exist.
We have been told not to create interval partitioning
this is very simple.
while creating the table itself use interval partition on the date column.
you can choose the partition interval as hour/day/month whichever you like.
so any time you insert a new data to the table based on the date value the data will go to correct partition or create a new partition.
use the below syntax in your table while creating..
partition by range ( date_col )
interval ( NUMTODSINTERVAL(1,'day') )
( partition p1 values less then ( date '2016-01-01' ))

BigQuery, date partitioned tables and decorator

I am familiar with using table decorators to query a table, for example, as it was a week ago or for data inserted over a certain date range.
Introducing date-partitioned tables revealed a pseudo column called _PARTITIONTIME. Using a date decorator syntax, you can add records to a certain partition in the table.
I was wondering if the pseudo column _PARTITIONTIME is also used, behind the scene, to support table decorators or something that straightforward.
If yes, can it be accessed/changed, as we do with the pseudo column of partitioned tables?
Is it called _PARTITIONTIME or _INSERTIONTIME? Of course, both didn't work. :)
First check if indeed the table is partitioned by reading out partitions
SELECT TIMESTAMP(partition_id)
FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
In case not you will get error: Cannot read partition information from a table that is not partitioned
then another important step: To select the value of _PARTITIONTIME, you must use an alias.
SELECT
_PARTITIONTIME AS pt,
field1
FROM
mydataset.table1
but when you use in WHERE it's not mandatory, only when it's in select.
#legacySQL
SELECT
field1
FROM
mydataset.table1
WHERE
_PARTITIONTIME > DATE_ADD(TIMESTAMP('2016-04-15'), -5, "DAY")
you can always reference one partitioned table with the decorator: mydataset.table$20160519

BigQuery table partitioning by month

I can't find any documentation relating to this. Is time_partitioning_type=DAY the only way to partition a table in BigQuery? Can this parameter take any other values besides a date?
Note that even if you partition on day granularity, you can still write your queries to operate at the level of months using an appropriate filter on _PARTITIONTIME. For example,
#standardSQL
SELECT * FROM MyDatePartitionedTable
WHERE DATE_TRUNC(EXTRACT(DATE FROM _PARTITIONTIME), MONTH) = '2017-01-01';
This selects all rows from January of this year.
Unfortunately not. BigQuery currently only supports date-partitioned tables.
https://cloud.google.com/bigquery/docs/partitioned-tables
BigQuery offers date-partitioned tables, which means that the table is divided into a separate partition for each date
It seems like this would work:
#standardSQL
CREATE OR REPLACE TABLE `My_Partition_Table`
PARTITION BY event_month
OPTIONS (
description="this is a table partitioned by month"
) AS
SELECT
DATE_TRUNC(DATE(some_event_timestamp), month) as event_month,
*
FROM `TableThatNeedsPartitioning`
For those that run into the error "Too many partitions produced by query, allowed 4000, query produces at least X partitions", due to the 4000 partitions BigQuery limit as of 2023.02, you can do the following:
CREATE OR REPLACE TABLE `My_Partition_Table`
PARTITION BY DATE_TRUNC(date_column, MONTH)
OPTIONS (
description="This is a table partitioned by month"
) AS
-- Your query
Basically, take #david-salmela 's answer, but move the DATE_TRUNC part to the PARTITION BY section.
It seems to work exactly like PARTITION BY date_column in terms of querying the table (e.g. WHERE date_column = "2023-02-20"), but my understanding is that you always retrieve data for a whole month in terms of cost.