How does partitioning in BigQuery works? - google-bigquery

Hi All: I am trying to understand how the partitioned tables work. I have a sales table of size 12.9MB. I have a date column that is partitioned by day. My assumption is that when I filter the data table using this date column, the amount of data processed by BigQuery will be optimized. However, it doesn’t seem to work that way, and I would like to understand the reason.
In the below query, I am filtering sales.date using a subquery. When I try to execute the query as such, it is processing the entire table of 12.9 MB.
However, if I replace the below subquery with the actual date (the same result that we have from the subquery), then the amount of data processed is 4.9 MB.
The subquery alone processes 630 KB of data. If my understanding is right, shouldn’t the below given query process 4.9 MB + 630 KB = ~ 5.6 MB? But, it still processes 12.9 MB. Can someone explain what’s happening here?
SELECT
sales.*,
FROM `my-project.transaction_data.sales_table` sales
WHERE DATE(sales.date) >= DATE_SUB(DATE((select max(temp.date) FROM ` my-project.transaction_data.sales_table ` temp)), INTERVAL 2 YEAR)
ORDER BY sales.customer, sales.date

Can someone explain what’s happening here?
This is expected behavior
In general, partition pruning will reduce query cost when the filters can be evaluated at the outset of the query without requiring any subquery evaluations or data scans
Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.
see more at Querying partitioned tables
Possible workaround is to use scripting where you will first calculate the actual date and assign it to valiable and then use it in the query, thus eliminating subquery

Related

Should time partition column be set as clustered column in BigQuery? [duplicate]

We are using a public dataset to benchmark BigQuery. We took the same table and partitioned it by day, but it's not clear we are getting many benefits. What's a good balance?
SELECT sum(score)
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
WHERE creation_date > "2019-01-01"
Takes 1 second, and processes 270.7MB.
Same, with partitions:
SELECT sum(score)
FROM `temp.questions_partitioned`
WHERE creation_date > "2019-01-01"
Takes 2 seconds and processes 14.3 MB.
So we see a benefit in MBs processed, but the query is slower.
What's a good strategy to decide when to partition?
(from an email I received today)
When partitioning a table, you need to consider having enough data for each partition. Think of each partition like being a different file - and opening 365 files might be slower than having a huge one.
In this case, the table used for the benchmark has 1.6 GB of data for 2019 (until June in this one). That's 1.6GB/180 = 9 MB of data for each daily partition.
For such a low amount of data - arranging it in daily partitions won't bring much benefits. Consider partitioning the data by year instead. See the following question to learn how:
Partition by week/month//quarter/year to get over the partition limit?
Another alternative is not partitioning the table at all, and instead using clustering to sort the data by date. Then BigQuery can choose the ideal size of each block.
If you want to run your own benchmarks, do this:
CREATE TABLE `temp.questions_partitioned`
PARTITION BY DATE(creation_date)
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
vs no partitions, just clustering by date:
CREATE TABLE `temp.questions_clustered`
PARTITION BY fake_date
CLUSTER BY creation_date
AS
SELECT *, DATE('2000-01-01') fake_date
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
Then my query over the clustered table would be:
SELECT sum(score)
FROM `temp.questions_clustered`
WHERE creation_date > "2019-01-01"
And it took 0.5 seconds, 17 MB processed.
Compared:
Raw table: 1 sec, 270.7MB
Partitioned: 2 sec, 14.3 MB
Clustered: 0.5 sec, 17 MB
We have a winner! Clustering organized the daily data (which isn't much for this table) into more efficient blocks than strictly partitioning it by day.
It's also interesting to look at the execution details for each query on these tables:
Slot time consumed
Raw table: 10.683 sec
Partitioned: 7.308 sec
Clustered: 0.718 sec
As you can see, the query over raw table used a lot of slots (parallelism) to get the results in 1 second. In this case 50 workers processed the whole table with multiple years of data, reading 17.7M rows. The query over the partitioned table had to use a lot of slots - but this because each slot was assigned smallish daily partitions, a reading that used 153 parallel workers over 0.9M rows. The clustered query instead was able to use a very low amount of slots. Data was well organized to be read by 57 parallel workers, reading 1.12M rows.
See also:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
How can I improve the amount of data queried with a partitioned+clustered table?
how clustering works in BigQuery

What's a good balance to decide when to partition a table in BigQuery?

We are using a public dataset to benchmark BigQuery. We took the same table and partitioned it by day, but it's not clear we are getting many benefits. What's a good balance?
SELECT sum(score)
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
WHERE creation_date > "2019-01-01"
Takes 1 second, and processes 270.7MB.
Same, with partitions:
SELECT sum(score)
FROM `temp.questions_partitioned`
WHERE creation_date > "2019-01-01"
Takes 2 seconds and processes 14.3 MB.
So we see a benefit in MBs processed, but the query is slower.
What's a good strategy to decide when to partition?
(from an email I received today)
When partitioning a table, you need to consider having enough data for each partition. Think of each partition like being a different file - and opening 365 files might be slower than having a huge one.
In this case, the table used for the benchmark has 1.6 GB of data for 2019 (until June in this one). That's 1.6GB/180 = 9 MB of data for each daily partition.
For such a low amount of data - arranging it in daily partitions won't bring much benefits. Consider partitioning the data by year instead. See the following question to learn how:
Partition by week/month//quarter/year to get over the partition limit?
Another alternative is not partitioning the table at all, and instead using clustering to sort the data by date. Then BigQuery can choose the ideal size of each block.
If you want to run your own benchmarks, do this:
CREATE TABLE `temp.questions_partitioned`
PARTITION BY DATE(creation_date)
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
vs no partitions, just clustering by date:
CREATE TABLE `temp.questions_clustered`
PARTITION BY fake_date
CLUSTER BY creation_date
AS
SELECT *, DATE('2000-01-01') fake_date
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
Then my query over the clustered table would be:
SELECT sum(score)
FROM `temp.questions_clustered`
WHERE creation_date > "2019-01-01"
And it took 0.5 seconds, 17 MB processed.
Compared:
Raw table: 1 sec, 270.7MB
Partitioned: 2 sec, 14.3 MB
Clustered: 0.5 sec, 17 MB
We have a winner! Clustering organized the daily data (which isn't much for this table) into more efficient blocks than strictly partitioning it by day.
It's also interesting to look at the execution details for each query on these tables:
Slot time consumed
Raw table: 10.683 sec
Partitioned: 7.308 sec
Clustered: 0.718 sec
As you can see, the query over raw table used a lot of slots (parallelism) to get the results in 1 second. In this case 50 workers processed the whole table with multiple years of data, reading 17.7M rows. The query over the partitioned table had to use a lot of slots - but this because each slot was assigned smallish daily partitions, a reading that used 153 parallel workers over 0.9M rows. The clustered query instead was able to use a very low amount of slots. Data was well organized to be read by 57 parallel workers, reading 1.12M rows.
See also:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
How can I improve the amount of data queried with a partitioned+clustered table?
how clustering works in BigQuery

How to partition based on the month and year in Azure SQL Data Warehouse

I am going to use ADF to copy 5 billion rows to Azure SQL data warehouse. Azure SQL DWH will distribute the table into 60 distributions by default, But I want to add another 50 partitions based on the month and year as follow:
PARTITION ( DateP RANGE RIGHT FOR VALUES
(
'2015-01-01', 2015-02-01', ......2018-01-01','2018-02-01','2018-03-01','2018-04-01','2018-5-01',.......
))
But, the column that I am using to partition the table includes date and time together :
2015-01-01 00:30:00
Do you think my partitioning approach is correct?
5B rows / (50 partitions x 60 Distributions) = 1.7M rows/partition on average
That's probably too many partitions, but if you have a lot of single-month queries it might be worth it. You would definitely want to defragment your columnstores after load.
I tend to agree with David that this is probably overkill for the number of partitions. You'll want to make sure that you have a pretty even distribution of data and with 1.7M rows or so, you'll be on the lower side. You can probably move to quarter based partitions (e.g., '2017-12-31', '2018-03-01', '2018-06-30') to get good results for query performance. This would give you 4 partitions a year since 2015 (or 20 total). So the math is:
5B rows / (20 partitions * 60 distributions) = 4.167M rows/partition.
While the number of partitions does matter for partition elimination scenarios, this is a fact table with columnstore indexes which will do an additional level of index segment elimination during query time. Over partitioning can make the situation worse rather than better.
The guideline from Microsoft specifies that while sizing partitions, especially for columnstore indexed tables in Azure DW, the MINIMUM volume MUST be 60 million rows PER partition. Anything lower may NOT give an optimum performance. The logic to that is, there must be a MINIMUM of 1 M rows per distribution per partition. Since every partition created will internally create sixty additional distributions, the minimum works out to 60M per partition proposed to be created

Understanding why SQL query is taking so long

I have a fairly large SQL query written. Below is a simplification of the issue i am seeing.
SELECT *
FROM dbo.MyTransactionDetails TDTL
JOIN dbo.MyTransactions TRANS
on TDTL.ID = TRANS.ID
JOIN dbo.Customer CUST
on TRANS.CustID = CUST.CustID
WHERE TDTL.DetailPostTime > CONVERT(datetime, '2015-05-04 10:25:53', 120)
AND TDTL.DetailPostTime < CONVERT(datetime, '2015-05-04 19:25:53', 120)
The MyTransactionDetails contains about 7 million rows and MyTransactions has about 300k rows.
The above query takes about 10 minutes to run which is insane. All indexes have been reindexed and there is an index on all the ID columns.
Now if i add the below lines to the WHERE clause the query the query takes about 1 second.
AND TRANS.TransBeginTime > CONVERT(datetime, '2015-05-05 10:25:53', 120)
AND TRANS.TransBeginTime < CONVERT(datetime, '2015-05-04 19:25:53', 120)
I know the contents of the database and the TransBeginTime is almost identical to the DetailPostTime so these extra where clauses shouldnt filter much more then the JOIN.
Why is the addition of these so much faster?
The problem is that i cannot use the filter on TransBeginTime as it is not guaranteed that the transaction detail will be posted on the same date.
EDIT: I should also add that the execution plan says that 50% of the time is taken up by MyTransactionDetails
The percentages shown in the plan (both estimated and actual) are estimates that are based on the assumption that estimated row counts are correct. On bad cases the percentages can be totally wrong, even so that 1% can actually be 95%.
To figure out what is actually happening, turn on "statistics io". That will tell you the logical I/O count per table -- and getting that down usually means that also the time goes down.
You can also look at the actual plan, and there's a lot of things that can cause slowness, like scans, sorts, key lookups, spools etc. If you include both statistics I/O and execution plan (preferably the actual xml, not just the picture) it is a lot easier to figure out what's going wrong.

BigQuery gives Response Too Large error for whole dataset but not for equivalent subqueries

I have a table in BigQuery with the following fields:
time,a,b,c,d
time is a string in ISO8601 format but with a space, a is an integer from 1 to 16000, and the other columns are strings. The table contains one month's worth of data, and there are a few million records per day.
The following query fails with "response too large":
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,b,c,d,count(a),count(distinct a, 1000000)
from [myproject.mytable]
group by day,b,c,d
order by day,b,c,d asc
However, this query works (the data starts at 2012-01-01)
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,
b,c,d,count(a),count(distinct a)
from [myproject.mytable]
where UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) = UTC_USEC_TO_DAY(PARSE_UTC_USEC('2012-01-01 00:00:00'))
group by day,b,c,d
order by day,b,c,d asc
This looks like it might be related to this issue. However, because of the group by clause, the top query is equivalent to repeatedly calling the second query. Is the query planner not able to handle this?
Edit: To clarify my test data:
I am using fake test data I generated. I originally used several fields and tried to get hourly summaries for a month (group by hour, where hour is defined using as in the select part of the query). When that failed I tried switching to daily. When that failed I reduced the columns involved. That also failed when using a count (distinct xxx, 1000000), but it worked when I just did one day's worth. (It also works if I remove the 1000000 parameter, but since that does work with the one-day query it seems the query planner is not separating things as I would expect.)
The one checked for count (distinct) has cardinality 16,000, and the group by columns have cardinality 2 and 20 for a total of just 1200 expected rows. Column values are quite short, around ten characters.
How many results do you expect? There is currently a limitation of about 64MB in the total size of results that are allowed. If you're expecting millions of rows as a result, than this may be an expected error.
If the number of results isn't extremely large, it may be that the size problem is not the final response, but the internal calculation. Specifically, if there are too many results from the GROUP BY, the query can run out of memory. One possible solution is to change "GROUP BY" to "GOUP EACH BY" which alters the way the query is executed. This is a feature that is currently experimental, and as such, is not yet documented.
For your query, since you reference fields named in the select in the group by, you might need to do this:
select day, b,c,d,day,count(a),count(distinct a, 1000000)
FROM (
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day, b, c, d
from [myproject.mytable]
)
group EACH by day,b,c,d
order by day,b,c,d asc