BigQuery - Partitioning necessity - google-bigquery

I am designing a BigQuery table, which is a never expiring table.
It is more of a table where the row is stored based on a Product ID.
There could be daily inserts and same Product ID could be inserted again (like maintaining a historical data).
There will be a VIEW written on this table which reads the latest version of Product ID based on the last inserted timestamp.
SELECT ARRAY_AGG(PRODUCTS ORDER BY INSERT_TIMESTAMP DESC LIMIT 2)[OFFSET(0)] from dataset1.PRODUCTS
group by PRODUCTID
Will Partitioning this table based on INSERT_TIMESTAMP do any help? I don't think so. Please confirm.

The query that you have provided won't receive any benefit from partitioning. To reduce the cost of the query and runtime, you should add a filter (if possible) to restrict INSERT_TIMESTAMP to a specific period of time, such as the last seven days.

It depends on how you are preferring to use the table. If the data doesn't grow exponentially then you can follow the same structure you are currently using. If you think the persisting data will grow humongous in future, then partitioning the table & querying within the specified time range is a good way to plan. You may also create a daily/weekly/monthly (upto you) materialized view that maintains the latest aggregate date of all product id so that you can combine your materialized view & arr_agg query with the definitive range of insert_timestamp for all product ids
SELECT
ARRAY_AGG(PRODUCTS
ORDER BY
INSERT_TIMESTAMP DESC
LIMIT
2)[OFFSET(0)]
FROM
dataset1.PRODUCTS
WHERE
INSERT_TIMESTAMP >= `Last X Months Timestamp`
GROUP BY
PRODUCTID

Related

BigQuery clustering : hit clustering with multiple keys

We have a table which is partitioned on time (512Mbyte / partition) and it has also a cluster key on customer_id and time.
Up to now we had these queries, which are working well:
SELECT column FROM TABLE WHERE customer_id = 'key' and time > '2021-11-10'
SELECT column FROM TABLE WHERE customer_id IN ('key1', 'key2') and time > '2021-11-10'
Today we are trying these queries:
SELECT column FROM TABLE WHERE customer_id IN (SELECT customer_id FROM customers) AND time > '2021-11-10'
We see that this query not uses the clustering, resulting in getting a lot more data out of BigQuery. Then i found this article, explaining that complex filtering does not work with clustering https://cloud.google.com/bigquery/docs/querying-clustered-tables#do_not_use_clustered_columns_in_complex_filter_expressions
Is there a solution to define a list of id's outside the query and inject it into the query? (because now we need to generate the list of id into code).
Thx in Advance,
Regards
Concluding the discussion from the comments:
Partitioning and Clustering are used to improve query performance and control costs of querying heavy data.
Partition helps segment the data into partitions and clustering organises the data based on specified columns ie. clustered columns.
As mentioned in this documentation, in order for clustering to work efficiently, your table/partition should be greater than or approximately 1GB.
In your case if you're trying clustering on 512MB of data, there won't be any significant difference in query performance. You should prefer clustering over partitioning if partitioning results in a small amount of data per partition(approximately less than 1GB).
Refer to this documentation for more information.

Specify the partition # based on date range for that pkey value

We have a DW query that needs to extract data from a very large table around 10 TB which is partitioned by datetime column lets say time to purge data based on this column everyday. So my understanding is each partition has worth a day of data. From storage (SSMS GUI) tab I see # of partitions is 1995.
There is no clustered index on this table as its mostly intended for write operations. Just a design by vendor.
SELECT
a.*
FROM dbo.VLTB AS a
CROSS APPLY
(
VALUES($PARTITION.a_func(a.time))
) AS c (pid)
WHERE c.pid = 1896;
Currently query submitted is as
SELECT * from dbo.VLTB
WHERE time >= convert(datetime,'20210601',112)
AND time < convert(datetime,'20210602',112)
So replacing inequality predicates with equality to look in that days specific partition might help. Users via app can control dates when sending but how will they manage if we want them to use partition # as per first query
Question
How do I find a way in above query to find partition number for that day rather than me inserting like for 06/01 I had to give 1896 part#. Is there a better way to have script find the part# to avoid all partitions being scanned and can insert correct part# in where clause query?
Thank you

BigQuery: querying the latest partition, bytes to be processed vs. actually processed

I'm struggling with querying efficiently the last partition of a table, using a date or datetime field. The first approach was to filter like this:
SELECT *
FROM my_table
WHERE observation_date = (SELECT MAX(observation_date) FROM my_table)
But that, according to BigQuery's processing estimation, scans the entire table and does not use the partitions. Even Google states this happens in their documentation. It does work if I use an exact value for the partition:
SELECT *
FROM my_table
WHERE observation_date = CURRENT_DATE
But if the table is not up to date then the query will not get any results and my automatic procesess will fail. If I include an offset like observation_date = DATE_SUB(CURRENT_DATE, INTERVAL 2 DAY), I will likely miss the latest partition.
What is the best practice to get the latest partition efficiently?
What makes this worse is that BigQuery's estimation of the bytes to be processed with the active query does not match what was actually processed, unless I'm not interpreting those numbers correctly. Find below the screenshot of the mismatching values.
BigQuery screen with aparrently mistmatching processed bytes
Finally a couple of scenarios that I also tested:
If I store a max_date with a DECLARE statement first, as suggested in this post, the estimation seems to work, but it is not clear why. However, the actual processed bytes after running the query is not different than the case that filters the latest partition in the WHERE clause.
Using the same declared max_date in a table that is both partitioned and clustered, the estimation works only when using a filter for the partition, but fails if I include a filter for the cluster.
After some iterations I got an answer from Google and although it doesn't resolve the issue, it acknowledges it happens.
Tables partitioned with DATE or DATETIME fields cannot be efficiently queried for their latest partition. The best practice remains to filter with something like WHERE observation_date = (SELECT MAX(observation_date) FROM my_table) and that will scan the whole table.
They made notes to try and improve this in the future but we have to deal with this for now. I hope this helps somebody that was trying to do the same thing.

BigQuery: cost of querying tables partitioned by ingestion time vs date/timestamp partitioned

We are trying to build (or better say rebuild) our DWH in the cloud based on BigQuery. We decided to use 'partitioned by date field' tables (like a 'created_date' field) for our raw data instead of ingestion time partitions because with this feature we can load data easely and then query it with "group by" partition date column, build datamarts bla bla bla. We supposed that this partition method will increase queries speed and reduce it cost (versus non-partitioned tables - yes), BUT we've discovered than when you querying table with WHERE by partition field (like 'select count(*) from table where created_date=current_date'), it will cost money.
Our old-style ingestion time partitioned table queries with WHERE _PARTITIONTIME ='' were FREE! (like 'select count(*) from table where _PARTITIONTIME=current_date')
For example:
1) select value1 from table1 where _PARTITIONTIME = current_date
2) select value1 from table1 where created_date = current_date
3) select count(*) from table1 where _PARTITIONTIME = current_date
The second query costs more, because it will scan 2 columns. Its logical. But not fair((( The 3rd query is absolutely free btw!
This is very sad situation, because there is NO ANY WARNING about this 'side effect' in the documentation. This feature designed to make DB developers life easier (i guess), and it positioned as best practice feature and highly recommended by Google. But nobody said that it will cost you additional money also!
So the question is can we somehow query date-field partitioned tables using partition key for free? Is there any other pseudocolumn or method of filtering by partition key available if you use date/timestamp field based partitioning?
(ps: you guys from google must add some pseudocolumn for the date/timestamp partition method if it does not exist).
Thnx!
So the question is can we somehow query date-field partitioned tables
using partition key for free?
The answer is No, querying the partition will not be free.
Is there any other pseudocolumn or method of filtering by partition
key available if you use date/timestamp field based partitioning?
If you want partitioning by date, this can only be achieved using ingestion-time partitioning with the _PARTITIONTIME pseudocolumn or using dates value in a selected date/timestamp value columns. Currently there is no alternative option available. Keep in mind that one of the main goals of partitioning is reducing the amount of data being scanned mainly by reducing the number of rows that are scanned.
You guys from google must add some pseudocolumn for the date/timestamp partition method if it does not exist
I understand that you would like to have some pseudocolumn for the data column partitioned method, but could you please elaborate a bit more what values you would like to see in this partition in your original post?
Edit: A feature request has been opened on your behalf. You can follow it here

Need suggestion on splitting table in bigquery based on non-date column along with date partition

We are having a date partitioned table with 5 yrs(with daily incremental load) data running into millions & millions of records. To improve the performance, thinking of splitting the table based on a non-date field(id) as all the queries will include a where clause on that column(id). And also partition each of split tables with date partition so that we can query on a smaller dataset with a date range. we will not be using wildcarded table as we will know the id and planning to append that to the table and run a query against that specific table. Need to know whether that would be good option to pursue to improve performance and reduce the query cost.
[Update]: We went ahead and split the tables based on id column(tablename_id) and made the table date partitioned and clustered with 4 other columns(max supported) which are commonly used in queries. With that we were able to get a better performance and also reduced the data accessed for each query. Based on the testing looks like it is a good option to puruse as long as wildcarded querying of tables is avoided and till Bigquery supports partitioning based on non-date/non-datetime columns.
We split the tables based on the id columns creating multiple tables. Each of the split tables are partition on date column. Apart from that we had it as clustered table on 4 other columns as needed. Find below the performance on a sample dataset. Old Table(UserInfo) has more than 500,000 rows. The stats we captured is for a given date range and id, the performance of old table(non split/combined table) and split table(split based on the ID) in terms of amount of data processed and the time taken for the same query.
This is not possible. BigQuery doesn't support partitioned on non-date columns.
There's a feature request for it. I suggest subscribing to it to keep receiving information regarding its availability.