BigQuery, date partitioned tables and decorator - google-bigquery

I am familiar with using table decorators to query a table, for example, as it was a week ago or for data inserted over a certain date range.
Introducing date-partitioned tables revealed a pseudo column called _PARTITIONTIME. Using a date decorator syntax, you can add records to a certain partition in the table.
I was wondering if the pseudo column _PARTITIONTIME is also used, behind the scene, to support table decorators or something that straightforward.
If yes, can it be accessed/changed, as we do with the pseudo column of partitioned tables?
Is it called _PARTITIONTIME or _INSERTIONTIME? Of course, both didn't work. :)

First check if indeed the table is partitioned by reading out partitions
SELECT TIMESTAMP(partition_id)
FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
In case not you will get error: Cannot read partition information from a table that is not partitioned
then another important step: To select the value of _PARTITIONTIME, you must use an alias.
SELECT
_PARTITIONTIME AS pt,
field1
FROM
mydataset.table1
but when you use in WHERE it's not mandatory, only when it's in select.
#legacySQL
SELECT
field1
FROM
mydataset.table1
WHERE
_PARTITIONTIME > DATE_ADD(TIMESTAMP('2016-04-15'), -5, "DAY")
you can always reference one partitioned table with the decorator: mydataset.table$20160519

Related

BigQuery, Wildcard table over non partitioning tables and field based partitioning tables is not yet supported

I'm trying to run a simple query with a wildcard table using standardSQL on Bigquery. Here's the code:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_*`
WHERE _TABLE_SUFFIX BETWEEN '20150518' AND '20210406'
GROUP BY 1
My sharded dataset contains one table each day since 18/05/2015. So today's table will be 'dataset_20150518'.
The error is: 'Wildcard table over non partitioning tables and field based partitioning tables is not yet supported, first normal table dataset_test, first column table dataset_20150518.'
I've tried different kinds of select and aggregations but the error won't fix. I just want to query on all tables in that timeframe.
This is because in the wildcard you have to have all the tables with same schema. In your case, you are also adding dataset_test which is not with the same schema than others (dataset_test is a partition table?)
You should be able to get around this limitation by deleting _test and other tables with different schema or by running this query:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_20*`
WHERE _TABLE_SUFFIX BETWEEN '150518' AND '210406'
GROUP BY 1
Official documentation

Do I need to use the pseudo column _PARTITIONTIME when querying from a column partitioned table?

I created a time-partitioned table on BigQuery by using a date column from the table itself:
new_table.time_partitioning = bigquery.TimePartitioning(field='date')
I query the data by a simple request as follows:
SELECT * FROM t where date="2020-04-08"
My question is whether this is sufficient to query the partitioning, and thus reduce costs, or do I need to add also the pseudo columns _PARTITIONTIME as outlined in the section on Querying Partitioned Tables?
SELECT * FROM t where _PARTITIONTIME = TIMESTAMP("2020-04-08")
Quick answer is SELECT * FROM t where date="2020-04-08" is good enough for you to engage "partition pruning" and reduce cost.
Longer answer is always consult UI to see if partition filter is properly engaged for certain query:
SELECT * FROM `bigquery-public-data.crypto_bitcoin.transactions`
WHERE block_timestamp_month >= "2020-01-01"
This month ->
Year to date ->

BigQuery: cost of querying tables partitioned by ingestion time vs date/timestamp partitioned

We are trying to build (or better say rebuild) our DWH in the cloud based on BigQuery. We decided to use 'partitioned by date field' tables (like a 'created_date' field) for our raw data instead of ingestion time partitions because with this feature we can load data easely and then query it with "group by" partition date column, build datamarts bla bla bla. We supposed that this partition method will increase queries speed and reduce it cost (versus non-partitioned tables - yes), BUT we've discovered than when you querying table with WHERE by partition field (like 'select count(*) from table where created_date=current_date'), it will cost money.
Our old-style ingestion time partitioned table queries with WHERE _PARTITIONTIME ='' were FREE! (like 'select count(*) from table where _PARTITIONTIME=current_date')
For example:
1) select value1 from table1 where _PARTITIONTIME = current_date
2) select value1 from table1 where created_date = current_date
3) select count(*) from table1 where _PARTITIONTIME = current_date
The second query costs more, because it will scan 2 columns. Its logical. But not fair((( The 3rd query is absolutely free btw!
This is very sad situation, because there is NO ANY WARNING about this 'side effect' in the documentation. This feature designed to make DB developers life easier (i guess), and it positioned as best practice feature and highly recommended by Google. But nobody said that it will cost you additional money also!
So the question is can we somehow query date-field partitioned tables using partition key for free? Is there any other pseudocolumn or method of filtering by partition key available if you use date/timestamp field based partitioning?
(ps: you guys from google must add some pseudocolumn for the date/timestamp partition method if it does not exist).
Thnx!
So the question is can we somehow query date-field partitioned tables
using partition key for free?
The answer is No, querying the partition will not be free.
Is there any other pseudocolumn or method of filtering by partition
key available if you use date/timestamp field based partitioning?
If you want partitioning by date, this can only be achieved using ingestion-time partitioning with the _PARTITIONTIME pseudocolumn or using dates value in a selected date/timestamp value columns. Currently there is no alternative option available. Keep in mind that one of the main goals of partitioning is reducing the amount of data being scanned mainly by reducing the number of rows that are scanned.
You guys from google must add some pseudocolumn for the date/timestamp partition method if it does not exist
I understand that you would like to have some pseudocolumn for the data column partitioned method, but could you please elaborate a bit more what values you would like to see in this partition in your original post?
Edit: A feature request has been opened on your behalf. You can follow it here

How can I avoid and/or clean duplicated row in BigQuery?

How should I import data in BigQuery on a daily basis when I have potential duplicated row ?
Here is a bit of context. I'm updating data on a daily basis from a spreadsheet to BigQuery. I'm using Google App Script with a simple WRITE_APPEND method.
Sometimes I'm importing data I've already imported the day before. So I'm wondering how I can avoid this ?
Can I build a sql query in order to clean my table from duplicate row every day ? Or is this possible to detect duplicate even before importing them (with some specific command in my job definition for example...) ?
thanks !
Step 1: Have a sheet with data to be imported
Step 2: Set up your spreadsheet as a federated data source in BigQuery.
Step 3: Use DML to load data into an existing table
(requires #standardSql)
#standardSQL
INSERT INTO `fh-bigquery.tt.test_import_native` (id, data)
SELECT *
FROM `fh-bigquery.tt.test_import_sheet`
WHERE id NOT IN (
SELECT id
FROM `fh-bigquery.tt.test_import_native`
)
WHERE id NOT IN (...) ensures that only rows with new ids are loaded into the table.
As far as I know, the answer provided by Felipe Hoffa is the most effective way to avoid duplicate rows since Bigquery do not normalize data when loading data. The reason is that Bigquery performs best with denormalized data [1]. To better understand it, I’d recommend you to have a look in this SO thread.
I also would like to suggest using SQL aggregate or analytic function to clean the duplicate rows in a Bigquery table, as Felipe Hoffa's or Jordan Tigani's answer in this SO question.
If you have a large-size partitioned table, and only want to remove duplicates in a given range without scanning through (cost-saving) and replacing the whole table.
use the MERGE SQL below:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `your_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `your_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

BigQuery table partitioning by month

I can't find any documentation relating to this. Is time_partitioning_type=DAY the only way to partition a table in BigQuery? Can this parameter take any other values besides a date?
Note that even if you partition on day granularity, you can still write your queries to operate at the level of months using an appropriate filter on _PARTITIONTIME. For example,
#standardSQL
SELECT * FROM MyDatePartitionedTable
WHERE DATE_TRUNC(EXTRACT(DATE FROM _PARTITIONTIME), MONTH) = '2017-01-01';
This selects all rows from January of this year.
Unfortunately not. BigQuery currently only supports date-partitioned tables.
https://cloud.google.com/bigquery/docs/partitioned-tables
BigQuery offers date-partitioned tables, which means that the table is divided into a separate partition for each date
It seems like this would work:
#standardSQL
CREATE OR REPLACE TABLE `My_Partition_Table`
PARTITION BY event_month
OPTIONS (
description="this is a table partitioned by month"
) AS
SELECT
DATE_TRUNC(DATE(some_event_timestamp), month) as event_month,
*
FROM `TableThatNeedsPartitioning`
For those that run into the error "Too many partitions produced by query, allowed 4000, query produces at least X partitions", due to the 4000 partitions BigQuery limit as of 2023.02, you can do the following:
CREATE OR REPLACE TABLE `My_Partition_Table`
PARTITION BY DATE_TRUNC(date_column, MONTH)
OPTIONS (
description="This is a table partitioned by month"
) AS
-- Your query
Basically, take #david-salmela 's answer, but move the DATE_TRUNC part to the PARTITION BY section.
It seems to work exactly like PARTITION BY date_column in terms of querying the table (e.g. WHERE date_column = "2023-02-20"), but my understanding is that you always retrieve data for a whole month in terms of cost.