Do I need to use the pseudo column _PARTITIONTIME when querying from a column partitioned table? - google-bigquery

I created a time-partitioned table on BigQuery by using a date column from the table itself:
new_table.time_partitioning = bigquery.TimePartitioning(field='date')
I query the data by a simple request as follows:
SELECT * FROM t where date="2020-04-08"
My question is whether this is sufficient to query the partitioning, and thus reduce costs, or do I need to add also the pseudo columns _PARTITIONTIME as outlined in the section on Querying Partitioned Tables?
SELECT * FROM t where _PARTITIONTIME = TIMESTAMP("2020-04-08")

Quick answer is SELECT * FROM t where date="2020-04-08" is good enough for you to engage "partition pruning" and reduce cost.
Longer answer is always consult UI to see if partition filter is properly engaged for certain query:
SELECT * FROM `bigquery-public-data.crypto_bitcoin.transactions`
WHERE block_timestamp_month >= "2020-01-01"
This month ->
Year to date ->

Related

Specify the partition # based on date range for that pkey value

We have a DW query that needs to extract data from a very large table around 10 TB which is partitioned by datetime column lets say time to purge data based on this column everyday. So my understanding is each partition has worth a day of data. From storage (SSMS GUI) tab I see # of partitions is 1995.
There is no clustered index on this table as its mostly intended for write operations. Just a design by vendor.
SELECT
a.*
FROM dbo.VLTB AS a
CROSS APPLY
(
VALUES($PARTITION.a_func(a.time))
) AS c (pid)
WHERE c.pid = 1896;
Currently query submitted is as
SELECT * from dbo.VLTB
WHERE time >= convert(datetime,'20210601',112)
AND time < convert(datetime,'20210602',112)
So replacing inequality predicates with equality to look in that days specific partition might help. Users via app can control dates when sending but how will they manage if we want them to use partition # as per first query
Question
How do I find a way in above query to find partition number for that day rather than me inserting like for 06/01 I had to give 1896 part#. Is there a better way to have script find the part# to avoid all partitions being scanned and can insert correct part# in where clause query?
Thank you

How do I query one particular partition in Google BigQuery?

I have a partitioned table, with partitions based on timestamp column, like this:
...
partition by date_trunc(update_date, month)
...
Now, how do I query partitions in this table?
Afaik, I can't use _PARTITIONDATE pseudo column, since this is not an ingestion-time partitioned table. Do I just filter my query using date_trunc(update_date, month)?
select * from my_project.my_dataset.information_schema.partitions
where table_name = 'partitioned_table'
Here I can get partition_ids, but I can't address them in my query either.
I'm not sure, but whenever you want to to test if a partition is working: try typing out a SELECT * statement without a filter, and don't run it but check the estimated query size. Then try filtering on what you believe is the partition, also without running, and see if the estimated query size is reduced.
In this case try filtering on date_trunc(update_date, month) and see if it reduces the query size

BigQuery: cost of querying tables partitioned by ingestion time vs date/timestamp partitioned

We are trying to build (or better say rebuild) our DWH in the cloud based on BigQuery. We decided to use 'partitioned by date field' tables (like a 'created_date' field) for our raw data instead of ingestion time partitions because with this feature we can load data easely and then query it with "group by" partition date column, build datamarts bla bla bla. We supposed that this partition method will increase queries speed and reduce it cost (versus non-partitioned tables - yes), BUT we've discovered than when you querying table with WHERE by partition field (like 'select count(*) from table where created_date=current_date'), it will cost money.
Our old-style ingestion time partitioned table queries with WHERE _PARTITIONTIME ='' were FREE! (like 'select count(*) from table where _PARTITIONTIME=current_date')
For example:
1) select value1 from table1 where _PARTITIONTIME = current_date
2) select value1 from table1 where created_date = current_date
3) select count(*) from table1 where _PARTITIONTIME = current_date
The second query costs more, because it will scan 2 columns. Its logical. But not fair((( The 3rd query is absolutely free btw!
This is very sad situation, because there is NO ANY WARNING about this 'side effect' in the documentation. This feature designed to make DB developers life easier (i guess), and it positioned as best practice feature and highly recommended by Google. But nobody said that it will cost you additional money also!
So the question is can we somehow query date-field partitioned tables using partition key for free? Is there any other pseudocolumn or method of filtering by partition key available if you use date/timestamp field based partitioning?
(ps: you guys from google must add some pseudocolumn for the date/timestamp partition method if it does not exist).
Thnx!
So the question is can we somehow query date-field partitioned tables
using partition key for free?
The answer is No, querying the partition will not be free.
Is there any other pseudocolumn or method of filtering by partition
key available if you use date/timestamp field based partitioning?
If you want partitioning by date, this can only be achieved using ingestion-time partitioning with the _PARTITIONTIME pseudocolumn or using dates value in a selected date/timestamp value columns. Currently there is no alternative option available. Keep in mind that one of the main goals of partitioning is reducing the amount of data being scanned mainly by reducing the number of rows that are scanned.
You guys from google must add some pseudocolumn for the date/timestamp partition method if it does not exist
I understand that you would like to have some pseudocolumn for the data column partitioned method, but could you please elaborate a bit more what values you would like to see in this partition in your original post?
Edit: A feature request has been opened on your behalf. You can follow it here

BigQuery, date partitioned tables and decorator

I am familiar with using table decorators to query a table, for example, as it was a week ago or for data inserted over a certain date range.
Introducing date-partitioned tables revealed a pseudo column called _PARTITIONTIME. Using a date decorator syntax, you can add records to a certain partition in the table.
I was wondering if the pseudo column _PARTITIONTIME is also used, behind the scene, to support table decorators or something that straightforward.
If yes, can it be accessed/changed, as we do with the pseudo column of partitioned tables?
Is it called _PARTITIONTIME or _INSERTIONTIME? Of course, both didn't work. :)
First check if indeed the table is partitioned by reading out partitions
SELECT TIMESTAMP(partition_id)
FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
In case not you will get error: Cannot read partition information from a table that is not partitioned
then another important step: To select the value of _PARTITIONTIME, you must use an alias.
SELECT
_PARTITIONTIME AS pt,
field1
FROM
mydataset.table1
but when you use in WHERE it's not mandatory, only when it's in select.
#legacySQL
SELECT
field1
FROM
mydataset.table1
WHERE
_PARTITIONTIME > DATE_ADD(TIMESTAMP('2016-04-15'), -5, "DAY")
you can always reference one partitioned table with the decorator: mydataset.table$20160519

BigQuery table partitioning by month

I can't find any documentation relating to this. Is time_partitioning_type=DAY the only way to partition a table in BigQuery? Can this parameter take any other values besides a date?
Note that even if you partition on day granularity, you can still write your queries to operate at the level of months using an appropriate filter on _PARTITIONTIME. For example,
#standardSQL
SELECT * FROM MyDatePartitionedTable
WHERE DATE_TRUNC(EXTRACT(DATE FROM _PARTITIONTIME), MONTH) = '2017-01-01';
This selects all rows from January of this year.
Unfortunately not. BigQuery currently only supports date-partitioned tables.
https://cloud.google.com/bigquery/docs/partitioned-tables
BigQuery offers date-partitioned tables, which means that the table is divided into a separate partition for each date
It seems like this would work:
#standardSQL
CREATE OR REPLACE TABLE `My_Partition_Table`
PARTITION BY event_month
OPTIONS (
description="this is a table partitioned by month"
) AS
SELECT
DATE_TRUNC(DATE(some_event_timestamp), month) as event_month,
*
FROM `TableThatNeedsPartitioning`
For those that run into the error "Too many partitions produced by query, allowed 4000, query produces at least X partitions", due to the 4000 partitions BigQuery limit as of 2023.02, you can do the following:
CREATE OR REPLACE TABLE `My_Partition_Table`
PARTITION BY DATE_TRUNC(date_column, MONTH)
OPTIONS (
description="This is a table partitioned by month"
) AS
-- Your query
Basically, take #david-salmela 's answer, but move the DATE_TRUNC part to the PARTITION BY section.
It seems to work exactly like PARTITION BY date_column in terms of querying the table (e.g. WHERE date_column = "2023-02-20"), but my understanding is that you always retrieve data for a whole month in terms of cost.