Need suggestion on splitting table in bigquery based on non-date column along with date partition - google-bigquery

We are having a date partitioned table with 5 yrs(with daily incremental load) data running into millions & millions of records. To improve the performance, thinking of splitting the table based on a non-date field(id) as all the queries will include a where clause on that column(id). And also partition each of split tables with date partition so that we can query on a smaller dataset with a date range. we will not be using wildcarded table as we will know the id and planning to append that to the table and run a query against that specific table. Need to know whether that would be good option to pursue to improve performance and reduce the query cost.
[Update]: We went ahead and split the tables based on id column(tablename_id) and made the table date partitioned and clustered with 4 other columns(max supported) which are commonly used in queries. With that we were able to get a better performance and also reduced the data accessed for each query. Based on the testing looks like it is a good option to puruse as long as wildcarded querying of tables is avoided and till Bigquery supports partitioning based on non-date/non-datetime columns.

We split the tables based on the id columns creating multiple tables. Each of the split tables are partition on date column. Apart from that we had it as clustered table on 4 other columns as needed. Find below the performance on a sample dataset. Old Table(UserInfo) has more than 500,000 rows. The stats we captured is for a given date range and id, the performance of old table(non split/combined table) and split table(split based on the ID) in terms of amount of data processed and the time taken for the same query.

This is not possible. BigQuery doesn't support partitioned on non-date columns.
There's a feature request for it. I suggest subscribing to it to keep receiving information regarding its availability.

Related

BigQuery clustering : hit clustering with multiple keys

We have a table which is partitioned on time (512Mbyte / partition) and it has also a cluster key on customer_id and time.
Up to now we had these queries, which are working well:
SELECT column FROM TABLE WHERE customer_id = 'key' and time > '2021-11-10'
SELECT column FROM TABLE WHERE customer_id IN ('key1', 'key2') and time > '2021-11-10'
Today we are trying these queries:
SELECT column FROM TABLE WHERE customer_id IN (SELECT customer_id FROM customers) AND time > '2021-11-10'
We see that this query not uses the clustering, resulting in getting a lot more data out of BigQuery. Then i found this article, explaining that complex filtering does not work with clustering https://cloud.google.com/bigquery/docs/querying-clustered-tables#do_not_use_clustered_columns_in_complex_filter_expressions
Is there a solution to define a list of id's outside the query and inject it into the query? (because now we need to generate the list of id into code).
Thx in Advance,
Regards
Concluding the discussion from the comments:
Partitioning and Clustering are used to improve query performance and control costs of querying heavy data.
Partition helps segment the data into partitions and clustering organises the data based on specified columns ie. clustered columns.
As mentioned in this documentation, in order for clustering to work efficiently, your table/partition should be greater than or approximately 1GB.
In your case if you're trying clustering on 512MB of data, there won't be any significant difference in query performance. You should prefer clustering over partitioning if partitioning results in a small amount of data per partition(approximately less than 1GB).
Refer to this documentation for more information.

Google BigQuery clustered table not reducing query size when running query with WHERE clause on clustered field

I have a Google BigQuery table of 500,000 rows that I have setup to be partitioned by a TIMESTAMP field called Date and clustered by a STRING field called EventCategory (this is just a sample of a table that is over 500 million rows).
I have a duplicate of the table that is not partitioned and not clustered.
I run the following query on both tables:
SELECT
*
FROM
`table_name`
WHERE
EventCategory = "email"
There are only 2400 rows where EventCategory is "email". When I run the query on the non clustered table I get the following:
When I run the query on the clustered table I get the following:
Here is the schema of both the non clustered and the clustered table:
Date TIMESTAMP NULLABLE
UserId STRING NULLABLE
EventCategory STRING NULLABLE
EventAction STRING NULLABLE
EventLabel STRING NULLABLE
EventValue STRING NULLABLE
There is basically no difference between the two queries and how much data they look through and I can't seem to figure out why? I have confirmed that the clustered table is partitioned and clustered because in the BigQuery UI in table details it actually says so and running a query by filtering by Date greatly reduces the size of the data searched and shows the estimated query size to be much smaller.
Any help here would be greatly appreciated!
UPDATE:
If I change the query to:
SELECT
*
FROM
`table_name`
WHERE
EventCategory = "ad"
I get the following result:
There are 53640 rows with EventCategory is "ad" and it looks like clustering did result in less table data being scanned, albeit not much less (529.2MB compared to 586MB).
So it looks like clustering is working but the data is not clustered properly in the table? How would I fix that? I have tried re-creating the table multiple times using DDL and even saving the table data to a JSON in GCS and then importing it into a new partitioned and clustered table but it hasn't changed anything.
Does the date partitioning sit on top of the clustering? Meaning that BigQuery first groups by date and then groups by cluster within those date groups? If so, I think that would probably explain it but it would render clustering not very useful.
If you have less than 100MB of data per day, clustering won't do much for you - you'll probably get one <=100MB cluster of data for each day.
You haven't mentioned how many days of data you have (# of partitions, as Mikhail asked), but since the total data scanned is 500MB, I'll guess that you have at least 5 days of data, and less than 100MB per day.
Hence the results you are getting seem to be the expected results.
See an example of this at work here:
How can I improve the amount of data queried with a partitioned+clustered table?
The reason clustering wasn't helping very much was specific to the table data. The table was event based data that was partitioned by day and then clustered by EventCategory (data is clustered on each day's partition). Since every day would have a large amount of rows for each EventCategory type, querying the entire table for a specific EventCategory would still have to search every single partition, which would then almost definitely have some data with that EventCategory meaning almost every cluster would have to be searched too.
The data are partitioned by day and inside that they are clustered,
the clustering works best when you load whole partitions (days) at once or export the partition (day) to Google Storage (which should be free) and import it again to another table, when we tried loading something like 4GB JSONS the difference was something like 60/10.

BigQuery - What is the difference between Wildcard and partitionned tables [duplicate]

I try to understand if there is a difference in big query (in the cost or possibility of requesting for example) between :
Create one table per day (like my_table_2018_02_06)
Create a time partitioned table (my-table with time partition by day).
Thanks !
Short explanation: querying multiple tables using Wildcard Tables was the proposed alternative for when BigQuery did not have a partition mechanism available. The natural evolution was to include the feature of Partitioned Table, and currently there is an alpha release consisting in column-based time partitioning, i.e. letting the user define which column (having a DATE or TIMESTAMP data type) will be used for the partitioning.
So currently BigQuery engineers are working in adding more new features to table partitioning, instead of the legacy Wildcard Tables methodology, then I'd suggest that you work with them.
Long explanation: you are comparing two approaches that in fact are used with the same purpose, but which have different implications:
Wildcard Tables: some time ago, when table partitioning was not a feature supported by Big Query, Wildcard Tables was the way to query multiple tables using concise SQL queries. A Wildcard Table represents the union of all the tables that match the wildcard expression specified in the SQL statement. However, Wildcard Tables have some limitations, such as:
Do not support views.
Do not support cached results (queries containing wildcard tables are billed every time they are run, even if the "cached results" option is checked).
Only work with native BigQuery storage (cannot work with external tables [Bigtable, Storage or Drive]).
Only available in standard SQL.
Partitioned Tables: these are unique tables that are divided into segments, split by date. There is a lot of documentation regarding how to work with Partitioned Tables, and regarding the pricing, each partition in a Partitioned Table is considered an independent entity, so if a partition was not updated for the last 90 days, this data will be considered long-term and therefore will be billed with the appropriate discount (as would happen with a normal table). Finally, Partitioned Tables are here to stay, so there are more incoming features to them, such as column-based partitioning, which is currently in alpha, and you can follow its status in this Public Issue Tracker post. On the other hand, there are also some current limitations to be considered:
Maximum of 2500 partitions per Partitioned Table.
Maximum of 2000 partition updates per table per day.
Maximum of 50 partition updates every 10 seconds.
So in general, it would be advisable to work with Partitioned Tables over multiple tables using Wildcard Tables. However, you should always consider your use case and see which one of the possibilities meets your requirements better.
One thing to add to your decision criteria here is caching and usage of legacy vs standard SQL.
Since the syntax in standard SQL for selecting multiple tables uses a wild card there is no way for the query result to be cached.
Interestingly, the query result would have been cached if legacy SQL was used. Just converting the query to standard SQL would disable caching.
This may be important to consider, at least in some cases more than others.
Thank you,
Hazem
Not exactly a time partition, but one can benefit from both worlds - wildcard "partitions" and real partitions to slice the data even further. Below is an example where we first use the data suffix to select only table holding data from that particular date, then we use actual partitioning within the table to limit the amount of data scanned even further.
Create first partitioned table with data suffix
CREATE TABLE `test_2021-01-05` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-05` (x,y) values (5,1);
insert `test_2021-01-05` (x,y) values (5,2);
insert `test_2021-01-05` (x,y) values (5,3);
Create second partitioned table with data suffix
CREATE TABLE `test_2021-01-04` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-04` (x,y) values (4,1);
insert `test_2021-01-04` (x,y) values (4,2);
Select all the data from both tables using wildcard notation, 80B of data is the whole test set
select * from `test_*`
-- 80B, all the data
Just select data from one table, which is like partitioning on date
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
-- 48B
Select data both from one table(where I am interested in one date) and only from one partition
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
and y = 1
-- 16B, that was the goal
Select data just from one partition from all the tables
select * from `test_*`
where y = 1
-- 32B, only one partition from both tables
The ultimate goal was to limit the data scanned when reading, thus reducing the cost and increasing performance.

Wilcard on day table vs time partition

I try to understand if there is a difference in big query (in the cost or possibility of requesting for example) between :
Create one table per day (like my_table_2018_02_06)
Create a time partitioned table (my-table with time partition by day).
Thanks !
Short explanation: querying multiple tables using Wildcard Tables was the proposed alternative for when BigQuery did not have a partition mechanism available. The natural evolution was to include the feature of Partitioned Table, and currently there is an alpha release consisting in column-based time partitioning, i.e. letting the user define which column (having a DATE or TIMESTAMP data type) will be used for the partitioning.
So currently BigQuery engineers are working in adding more new features to table partitioning, instead of the legacy Wildcard Tables methodology, then I'd suggest that you work with them.
Long explanation: you are comparing two approaches that in fact are used with the same purpose, but which have different implications:
Wildcard Tables: some time ago, when table partitioning was not a feature supported by Big Query, Wildcard Tables was the way to query multiple tables using concise SQL queries. A Wildcard Table represents the union of all the tables that match the wildcard expression specified in the SQL statement. However, Wildcard Tables have some limitations, such as:
Do not support views.
Do not support cached results (queries containing wildcard tables are billed every time they are run, even if the "cached results" option is checked).
Only work with native BigQuery storage (cannot work with external tables [Bigtable, Storage or Drive]).
Only available in standard SQL.
Partitioned Tables: these are unique tables that are divided into segments, split by date. There is a lot of documentation regarding how to work with Partitioned Tables, and regarding the pricing, each partition in a Partitioned Table is considered an independent entity, so if a partition was not updated for the last 90 days, this data will be considered long-term and therefore will be billed with the appropriate discount (as would happen with a normal table). Finally, Partitioned Tables are here to stay, so there are more incoming features to them, such as column-based partitioning, which is currently in alpha, and you can follow its status in this Public Issue Tracker post. On the other hand, there are also some current limitations to be considered:
Maximum of 2500 partitions per Partitioned Table.
Maximum of 2000 partition updates per table per day.
Maximum of 50 partition updates every 10 seconds.
So in general, it would be advisable to work with Partitioned Tables over multiple tables using Wildcard Tables. However, you should always consider your use case and see which one of the possibilities meets your requirements better.
One thing to add to your decision criteria here is caching and usage of legacy vs standard SQL.
Since the syntax in standard SQL for selecting multiple tables uses a wild card there is no way for the query result to be cached.
Interestingly, the query result would have been cached if legacy SQL was used. Just converting the query to standard SQL would disable caching.
This may be important to consider, at least in some cases more than others.
Thank you,
Hazem
Not exactly a time partition, but one can benefit from both worlds - wildcard "partitions" and real partitions to slice the data even further. Below is an example where we first use the data suffix to select only table holding data from that particular date, then we use actual partitioning within the table to limit the amount of data scanned even further.
Create first partitioned table with data suffix
CREATE TABLE `test_2021-01-05` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-05` (x,y) values (5,1);
insert `test_2021-01-05` (x,y) values (5,2);
insert `test_2021-01-05` (x,y) values (5,3);
Create second partitioned table with data suffix
CREATE TABLE `test_2021-01-04` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-04` (x,y) values (4,1);
insert `test_2021-01-04` (x,y) values (4,2);
Select all the data from both tables using wildcard notation, 80B of data is the whole test set
select * from `test_*`
-- 80B, all the data
Just select data from one table, which is like partitioning on date
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
-- 48B
Select data both from one table(where I am interested in one date) and only from one partition
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
and y = 1
-- 16B, that was the goal
Select data just from one partition from all the tables
select * from `test_*`
where y = 1
-- 32B, only one partition from both tables
The ultimate goal was to limit the data scanned when reading, thus reducing the cost and increasing performance.

Best Index For Partitioned Table

I am querying a fairly large table that has been range partitioned (by someone else) by date into one partition per day. On average there are about 250,000 records per day. Frequently queries will be by a range of days -- usually looking for one day, 7 day week or a calendar month. Right now querying for more than 2 weeks is not performing well--have a normal date index created. If I query for more than 5 days it doesn't use the index, if I use an index hint it performs o.k. from about 5 days to 14 days but beyond that the index hint doesn't help much.
Given that the hint does better than the optimizer I am doing a gather statistics on the table.
However, my question going forward is, in general, if I wanted to create an index on the date field in the table, is it best to create a range partitioned index? Is it best to create a range index with a daily range similar to the table partition? What would be the best strategy?
This is Oracle 11g.
Thanks,
related to your question, partitioning strategy will depend on how you are going to query the data, the best strategy would be to query as fewer partitions as possible. e.g. if you are going to run monthly reports you'd rather create montly range partitioning and not daily range partitioning. If all your queryies will be around data that's within a couple of days then daily range partitioning would be fine.
Given numbers you provided in my opininon you overpartition data.
p.s. quering each partition requires additional reading(than if it were just one partition), so optimizer opts for table access full to reduce reading of indexes.
Try to create a global index on date column. If the index is partitioned and you select -let's say- 14 days, then Oracle has to read 14 indexes. Having a single index on the entire table, i.e. "global index" it has to read only 1 index.
Note, when you truncate or drop a partition then you have to rebuild the index afterwards.
I'm guessing that you could be writing your SQL wrong.
You said you're querying by date. If your date column has time part and you want to extract records from one day, from specific time of the day, e.g. 20:00-21:00, then yes, an index would be beneficial and I would recommend a local index for this (partitioned by day as just like table).
But since your queries span a range of days, it seems this is not the case and you just want all data (maybe filtered by some other attributes). If so, a partition full scan will always be much faster than index access... provided you benefit from partition pruning! Because if not - and you're actually performing a full table scan - this is expected to be very, very slow (in most cases).
So what could go wrong? Are you using plain date in WHERE clause? Note that:
SELECT * FROM trx WHERE trx_date = to_date('2014-04-03', 'YYYY-MM-DD');
will scan only one partition, whereas:
SELECT * FROM trx WHERE trunc(trx_date) = to_date('2014-04-03', 'YYYY-MM-DD');
will scan all partitions, as you apply a function to partitioning key and the optimizer can no longer determine which partitions to scan.
It would be much easier to tell for sure if you provided table definition, total number of partitions, sample data and your queries with explain plans. If possible, please edit your question and include more details.