Should time partition column be set as clustered column in BigQuery? [duplicate] - google-bigquery

We are using a public dataset to benchmark BigQuery. We took the same table and partitioned it by day, but it's not clear we are getting many benefits. What's a good balance?
SELECT sum(score)
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
WHERE creation_date > "2019-01-01"
Takes 1 second, and processes 270.7MB.
Same, with partitions:
SELECT sum(score)
FROM `temp.questions_partitioned`
WHERE creation_date > "2019-01-01"
Takes 2 seconds and processes 14.3 MB.
So we see a benefit in MBs processed, but the query is slower.
What's a good strategy to decide when to partition?
(from an email I received today)

When partitioning a table, you need to consider having enough data for each partition. Think of each partition like being a different file - and opening 365 files might be slower than having a huge one.
In this case, the table used for the benchmark has 1.6 GB of data for 2019 (until June in this one). That's 1.6GB/180 = 9 MB of data for each daily partition.
For such a low amount of data - arranging it in daily partitions won't bring much benefits. Consider partitioning the data by year instead. See the following question to learn how:
Partition by week/month//quarter/year to get over the partition limit?
Another alternative is not partitioning the table at all, and instead using clustering to sort the data by date. Then BigQuery can choose the ideal size of each block.
If you want to run your own benchmarks, do this:
CREATE TABLE `temp.questions_partitioned`
PARTITION BY DATE(creation_date)
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
vs no partitions, just clustering by date:
CREATE TABLE `temp.questions_clustered`
PARTITION BY fake_date
CLUSTER BY creation_date
AS
SELECT *, DATE('2000-01-01') fake_date
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
Then my query over the clustered table would be:
SELECT sum(score)
FROM `temp.questions_clustered`
WHERE creation_date > "2019-01-01"
And it took 0.5 seconds, 17 MB processed.
Compared:
Raw table: 1 sec, 270.7MB
Partitioned: 2 sec, 14.3 MB
Clustered: 0.5 sec, 17 MB
We have a winner! Clustering organized the daily data (which isn't much for this table) into more efficient blocks than strictly partitioning it by day.
It's also interesting to look at the execution details for each query on these tables:
Slot time consumed
Raw table: 10.683 sec
Partitioned: 7.308 sec
Clustered: 0.718 sec
As you can see, the query over raw table used a lot of slots (parallelism) to get the results in 1 second. In this case 50 workers processed the whole table with multiple years of data, reading 17.7M rows. The query over the partitioned table had to use a lot of slots - but this because each slot was assigned smallish daily partitions, a reading that used 153 parallel workers over 0.9M rows. The clustered query instead was able to use a very low amount of slots. Data was well organized to be read by 57 parallel workers, reading 1.12M rows.
See also:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
How can I improve the amount of data queried with a partitioned+clustered table?
how clustering works in BigQuery

Related

SSAS Tabular Data Cube - Fact Table with partition of 160K row partition takes almost an hour to process...Why?

I have an SSAS Tabular data model I developed with VS. In this cube there are many fact and dimension tables with lots of measures. HOWEVER, there is one fact table that in total has 158 million rows and to process all 158 million rows in this one fact table takes over an hour... To speed up that processing time I decided to create two partitions based on a date column. Partition 1 has historical data and when loaded has 157 million rows, Partition 2 (One month of data) has about 160,000 rows so VERY VERY small. I only want to process the Partition 2 daily. Unfortunately, when I process just Partition 2 the processing time is still almost an hour?? How can it be that simply refreshing a 160K partition takes 58 minutes... seems like it is still trying to process the full table….
I will say when I try to process a separate table that only has 200K rows in total I am able to process this in under 30 seconds. Shouldn't the partition 2 above also process in under a minute?? What would I be doing wrong here and why would this take so long to process a small partition.
In Summary:
Table A = 158,000,000 Rows = 1 Hour 13 Min to Process total table Partition 1 = 157,840,000 Rows = 1 hour to process FULL Partition 2 = 160,000 Rows = 58 minutes to process FULL
Table B = 200,000 Rows = 30 Seconds to process FULL Partition 1 = 200,000 Rows = 30 Seconds to process!
Shouldn't Table A/Partition 2 take 30 seconds to process just like Table B?
I just want to process full the Partition 2 of Table A... Expected process time I expected to be was under 5 min... similar to Table B result time. Instead to process Partition 2 with 160K rows takes almost same time as entire table A (Partition 1 + 2).
If you have calculated columns or DAX tables that refer to this table, they will have to process after the partition loads and could result in an extended load time. You might be able to test for this by creating a new table with a filter like the partition and see how long it takes to load.
I would also make sure the sort on the table is set to the date.

How does partitioning in BigQuery works?

Hi All: I am trying to understand how the partitioned tables work. I have a sales table of size 12.9MB. I have a date column that is partitioned by day. My assumption is that when I filter the data table using this date column, the amount of data processed by BigQuery will be optimized. However, it doesn’t seem to work that way, and I would like to understand the reason.
In the below query, I am filtering sales.date using a subquery. When I try to execute the query as such, it is processing the entire table of 12.9 MB.
However, if I replace the below subquery with the actual date (the same result that we have from the subquery), then the amount of data processed is 4.9 MB.
The subquery alone processes 630 KB of data. If my understanding is right, shouldn’t the below given query process 4.9 MB + 630 KB = ~ 5.6 MB? But, it still processes 12.9 MB. Can someone explain what’s happening here?
SELECT
sales.*,
FROM `my-project.transaction_data.sales_table` sales
WHERE DATE(sales.date) >= DATE_SUB(DATE((select max(temp.date) FROM ` my-project.transaction_data.sales_table ` temp)), INTERVAL 2 YEAR)
ORDER BY sales.customer, sales.date
Can someone explain what’s happening here?
This is expected behavior
In general, partition pruning will reduce query cost when the filters can be evaluated at the outset of the query without requiring any subquery evaluations or data scans
Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.
see more at Querying partitioned tables
Possible workaround is to use scripting where you will first calculate the actual date and assign it to valiable and then use it in the query, thus eliminating subquery

What's a good balance to decide when to partition a table in BigQuery?

We are using a public dataset to benchmark BigQuery. We took the same table and partitioned it by day, but it's not clear we are getting many benefits. What's a good balance?
SELECT sum(score)
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
WHERE creation_date > "2019-01-01"
Takes 1 second, and processes 270.7MB.
Same, with partitions:
SELECT sum(score)
FROM `temp.questions_partitioned`
WHERE creation_date > "2019-01-01"
Takes 2 seconds and processes 14.3 MB.
So we see a benefit in MBs processed, but the query is slower.
What's a good strategy to decide when to partition?
(from an email I received today)
When partitioning a table, you need to consider having enough data for each partition. Think of each partition like being a different file - and opening 365 files might be slower than having a huge one.
In this case, the table used for the benchmark has 1.6 GB of data for 2019 (until June in this one). That's 1.6GB/180 = 9 MB of data for each daily partition.
For such a low amount of data - arranging it in daily partitions won't bring much benefits. Consider partitioning the data by year instead. See the following question to learn how:
Partition by week/month//quarter/year to get over the partition limit?
Another alternative is not partitioning the table at all, and instead using clustering to sort the data by date. Then BigQuery can choose the ideal size of each block.
If you want to run your own benchmarks, do this:
CREATE TABLE `temp.questions_partitioned`
PARTITION BY DATE(creation_date)
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
vs no partitions, just clustering by date:
CREATE TABLE `temp.questions_clustered`
PARTITION BY fake_date
CLUSTER BY creation_date
AS
SELECT *, DATE('2000-01-01') fake_date
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
Then my query over the clustered table would be:
SELECT sum(score)
FROM `temp.questions_clustered`
WHERE creation_date > "2019-01-01"
And it took 0.5 seconds, 17 MB processed.
Compared:
Raw table: 1 sec, 270.7MB
Partitioned: 2 sec, 14.3 MB
Clustered: 0.5 sec, 17 MB
We have a winner! Clustering organized the daily data (which isn't much for this table) into more efficient blocks than strictly partitioning it by day.
It's also interesting to look at the execution details for each query on these tables:
Slot time consumed
Raw table: 10.683 sec
Partitioned: 7.308 sec
Clustered: 0.718 sec
As you can see, the query over raw table used a lot of slots (parallelism) to get the results in 1 second. In this case 50 workers processed the whole table with multiple years of data, reading 17.7M rows. The query over the partitioned table had to use a lot of slots - but this because each slot was assigned smallish daily partitions, a reading that used 153 parallel workers over 0.9M rows. The clustered query instead was able to use a very low amount of slots. Data was well organized to be read by 57 parallel workers, reading 1.12M rows.
See also:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
How can I improve the amount of data queried with a partitioned+clustered table?
how clustering works in BigQuery

How to partition based on the month and year in Azure SQL Data Warehouse

I am going to use ADF to copy 5 billion rows to Azure SQL data warehouse. Azure SQL DWH will distribute the table into 60 distributions by default, But I want to add another 50 partitions based on the month and year as follow:
PARTITION ( DateP RANGE RIGHT FOR VALUES
(
'2015-01-01', 2015-02-01', ......2018-01-01','2018-02-01','2018-03-01','2018-04-01','2018-5-01',.......
))
But, the column that I am using to partition the table includes date and time together :
2015-01-01 00:30:00
Do you think my partitioning approach is correct?
5B rows / (50 partitions x 60 Distributions) = 1.7M rows/partition on average
That's probably too many partitions, but if you have a lot of single-month queries it might be worth it. You would definitely want to defragment your columnstores after load.
I tend to agree with David that this is probably overkill for the number of partitions. You'll want to make sure that you have a pretty even distribution of data and with 1.7M rows or so, you'll be on the lower side. You can probably move to quarter based partitions (e.g., '2017-12-31', '2018-03-01', '2018-06-30') to get good results for query performance. This would give you 4 partitions a year since 2015 (or 20 total). So the math is:
5B rows / (20 partitions * 60 distributions) = 4.167M rows/partition.
While the number of partitions does matter for partition elimination scenarios, this is a fact table with columnstore indexes which will do an additional level of index segment elimination during query time. Over partitioning can make the situation worse rather than better.
The guideline from Microsoft specifies that while sizing partitions, especially for columnstore indexed tables in Azure DW, the MINIMUM volume MUST be 60 million rows PER partition. Anything lower may NOT give an optimum performance. The logic to that is, there must be a MINIMUM of 1 M rows per distribution per partition. Since every partition created will internally create sixty additional distributions, the minimum works out to 60M per partition proposed to be created

Price of my project in Google Big Query pricing

I am little confused about pricing of Google Big Query. I need to get final price of my project in Google Big Query. So, how much will cost 1 month of using Big Query if the project will need:
10GB of new data will be added each day
There will be made 25 millions inserts each day to given table - each insert size will be 0.4KB.
Each day there will be 1000 queries fired through whole current table of stored data.
All data will be collected (without deletion) for a 1 year.
With the information you provided I’m guessing you’ll need to use Streaming inserts as normal uploads to BigQuery datasets have 2 daily limits (Daily destination table update limit — 1,000 updates per table per day and Load jobs per project per day — 50,000 (including failures)) that would be hard to workaround.
The price of your project will be composed from the 3 parts: storage, streaming inserts and queries.
BigQuery Storage
For the first month you’ll pay $0.0067 per day for each lot of 10GB ($0.02(per GB per month) * 10(GB) * 1/30(months)). So for 30 days the total cost will be about 4 USD (estimated for daily added lots of 10GB). For the next 11 months it will costs you 0.02 * 300GB * 11 months = 66 USD.
If a table is not edited for 90 consecutive days, the price of storage for that table automatically drops by 50 percent to $0.01 per GB per month. Also, if you don’t access the data you can transfer it to a bucket and chose the storage class - Nearline Storage and it will cost you 0.01$ per 1GB resulting in a cost of 11 months * 300GB * 0.01 = 33 USD.
Streaming inserts
The cost for the streaming inserts for daily 25.000.000 inserts of 1KB (1 KB is the minimum size per individual row) → 25GB (per day) → 750GB (in the month)
Total price per month: 37.5 USD (750GB * $0.05 (per GB successfully inserted))
Query price
I roughly estimated that each query would require 1G of data to be processed resulting in a 1TB per day (1000 queries * 1G) so it will cost 5$/day → 150 USD per month.
As this is the most expensive part of your project you should correctly estimate it. I’d advise you to run some tests with provisioned queries, maybe using the public datasets and see how much it will cost. Also keep in mind about this best practices to limit the queries costs.
Total cost is 4 + 66 + 37.5 + 150 = 257.5 USD not far from the estimated price given by the BigQuery pricing calculator - $255.40
As it’s an estimate I excluded from calculations the free quotas of 10 GB/month free storage and 1TB/month for queries.