BigQuery: Max Date for Time Partition on Custom Date Column - google-bigquery

I am currently working on the Optimization of a huge table in Google's BigQuery. The tables has approximately 19 billions records resulting in a total size of 5.2 TB. In order to experiment on performance with regards to clustering and time partitioning, I duplicated the table with a Time Partitioning on a custom DATE MyDate column which is frequently used in queries.
When performing a query with a WHERE clause (for instance, WHERE(MyDate) = "2022-08-08") on the time partitioned table, the query is quicker and only reads around 20 GB compared to the 5.2 TB consumed by the table without partition. So far, so good.
My issue, however, arises when applying an aggregated function, i.e. in my case a MAX(MyDate): the query on the partitioned and the non-partitioned tables read the same amount of data and execute in roughly the same time. However, I would have expected the query on the partitioned table to be way quicker as it only needs to scan a single partition.
There seem to be workarounds by fetching the dataset's metadata (information schema) as described here. However, I would like to avoid solutions like this as it adds complexity to our queries.
Are there a more elegant ways to get the MAX of a time-partitioned BigQuery table based on a custom column without scanning the whole table or fetching metadata from the information schema?

Related

SAP HANA PARTITIONED TABLE CALCULATION VIEW RUNNING SLOW IN COMPARISON TO NON-PARTITIONED TABLE CALCULATION VIEWE

I have large size table , close to 1 GB and the size of this table is growing every week, it has total rows as 190 millions, I started getting alerts from HANA to partition this table, so I planned to partition this with column which is frequently used in Where clause.
My HANA System is scale out system with 8 nodes.
In order to compare the partition query performance difference with this un-partitioned table, I created calculation views on top of this un-partitioned table and recorded the query performance.
I partitioned this table using HASH method and by number of servers, and recorded the query performance. By this way I would have good data distribution across servers. I created calculation view and recorded query performance.
To my surprise I have found that my un-partitioned table calculation view query is performing better in comparison to partitioned table calculation view.
This was really shock. Not sure why non partitioned table Calculation view responding better to partitioned table Calculation view.
I have plan viz output files but not sure where to attach it.
Let me know why this is the behaviour?
Ok, this is not a straight-forward question that can be answered correctly as such.
What I can do though is to list some factors that likely will play a role here:
a non-partitioned table needs a single access to the table structure while the partitioned version requires at least one access for each partition
if the SELECT is not actually providing a WHERE condition that can be evaluated by the HASH function used for the partitioning, then all partitions always have to be evaluated and no partition pruning can take place.
HASH partitioning does not take any additional knowledge about the data into account, which means that similar data does not get stored together. This has a negative impact on data compression. Also, each partition requires its own set of value dictionaries for the columns where a single-partition/non-partitioned table only needs one dictionary per column.
You mentioned that you are using a scale-out system. If the table partitions are distributed across the different nodes, then every query will result in cross-node network communication. That is an additional workload and waiting time that simply does not exist with non-partitioned tables.
When joining partitioned tables each partition of the first table has to be joined with each partition of the second table, if no partition-wise join is possible.
There are other/more potential reasons for why a query against partitioned tables can be slower than against a non-partitioned table. All this is extensively explained in the SAP HANA Administration Guide.
As a general guidance, tables should only be partitioned if that cannot be avoided and when the access pattern of queries are well understood. It is definitively not a feature that you just "switch on" and everything will just work fine.

How do i create partitioned table from non partitioned table in BigQuery using query?

I have tried this solution Migrating from non-partitioned to Partitioned tables, but i get this error.
"Error: Cannot query rows larger than 100MB limit."
Job ID: sandbox-kiana-analytics:bquijob_4a1b2032_15d2c7d17f3.
Vidhya,
I internally looked at the query you sent to BigQuery, and can see that as part of your query, you are using ARRAY_AGG() to put all data for a day in one row. This results in very large rows, which ultimately exceed Big Query's 100MB per-row limit. This is a rather complex and inefficient way of partitioning the data. Instead, I suggest using the built-in support for data partitioning provided by BigQuery (example here). In this approach, you can create an empty date-partitioned table, and add day-partition data to it for each day.

Bigquery Tier 1 exceeded for partitioned table but not for by day tables

We have two tables in bigquery: one is large (a couple billion rows) and on a 'table-per-day' basis, the other is date partitioned, has the exact same schema but a small subset of the rows (~100 million rows).
I wanted to run a (standard-sql) query with a subselect in form of a join (same when subselect is in the where clause) on the small partitioned dataset. I was not able to run it because tier 1 was exceeded.
If I run the same query on the big dataset (that contains the data I need and a lot of other data) it runs fine.
I do not understand the reason for this.
Is it because:
Partitioned tables need more resources to query
Bigquery has some internal rules that the ratio of data processed to resources needed must meet a certain threshold, i.e. I was not paying enough when I queried the small dataset given the amount of resources I needed.
If 1. is true, we could simply make the small dataset also be on a 'table-per-day' basis in order to solve the issue. But before we do that though we would like to know if it is really going to solve our problem.
Details on the queries:
Big datset
queries 11 GB, runs 50 secs, Job Id remilon-study:bquijob_2adced71_15bf9a59b0a
Small dataset
Job Id remilon-study:bquijob_5b09f646_15bf9acd941
I'm an engineer on BigQuery and I just took a look at your jobs but it looks like your second query has an additional filter with a nested clause that your first query does not. It is likely that that extra processing is making your query exceed your tier. I would recommend running the queries in the BigQuery UI and looking at the Explanation tab to see how the queries differ in the query plan.
If you try running the exact same query (modifying only the partition syntax) for both tables and still get the same error I would recommend filing a bug.

Bigquery Shard Vs Bigquery Partition

I have a table with 340GB of data, but we use only last one week of data. So to minimize the cost planning to move this data to partition table or shard tables.
I have done some experiment with shard tables and partition. I have created partition table and loaded two days worth of data(two partitions) and created two shard tables(Individual tables). I tried to pull last two days worth of data.
Full table - 27sec
Partition Table - 33 sec
shard tables - 91 sec
Please let me know which way is best. Based on the experiment result is giving quick when I run against full table but full table will scan.
Thanks,
From GCP official documentation on Partitioning versus Sharding you should use Partitioned tables.
Partitioned tables perform better than tables sharded by date. When
you create date-named tables, BigQuery must maintain a copy of the
schema and metadata for each date-named table. Also, when date-named
tables are used, BigQuery might be required to verify permissions for
each queried table. This practice also adds to query overhead and
impacts query performance. The recommended best practice is to use
partitioned tables instead of date-sharded tables.
The difference in performance seems to be due to some background optimizations that have run on the non-partitioned table, but are yet to run on the partitioned table (since the data is newer).

What is a good size (# of rows) to partition a table to really benefit?

I.E. if we have got a table with 4 million rows.
Which has got a STATUS field that can assume the following value: TO_WORK, BLOCKED or WORKED_CORRECTLY.
Would you partition on a field which will change just one time (most of times from to_work to worked_correctly)? How many partitions would you create?
The absolute number of rows in a partition is not the most useful metric. What you really want is a column which is stable as the table grows, and which delivers on the potential benefits of partitioning. These are: availability, tablespace management and performance.
For instance, your example column has three values. That means you can have three partitions, which means you can have three tablespaces. So if a tablespace becomes corrupt you lose one third of your data. Has partitioning made your table more available? Not really.
Adding or dropping a partition makes it easier to manage large volumes of data. But are you ever likely to drop all the rows with a status of WORKED_CORRECTLY? Highly unlikely. Has partitioning made your table more manageable? Not really.
The performance benefits of partitioning come from query pruning, where the optimizer can discount chunks of the table immediately. Now each partition has 1.3 million rows. So even if you query on STATUS='WORKED_CORRECTLY' you still have a huge number of records to winnow. And the chances are, any query which doesn't involve STATUS will perform worse than it did against the unpartitioned table. Has partitioning made your table more performant? Probably not.
So far, I have been assuming that your partitions are evenly distributed. But your final question indicates that this is not the case. Most rows - if not all - rows will end up in the WORKED_CORRECTLY. So that partition will become enormous compared to the others, and the chances of benefits from partitioning become even more remote.
Finally, your proposed scheme is not elastic. As the current volume each partition would have 1.3 million rows. When your table grows to forty million rows in total, each partition will hold 13.3 million rows. This is bad.
So, what makes a good candidate for a partition key? One which produces lots of partitions, one where the partitions are roughly equal in size, one where the value of the key is unlikely to change and one where the value has some meaning in the life-cycle of the underlying object, and finally one which is useful in the bulk of queries run against the table.
This is why something like DATE_CREATED is such a popular choice for partitioning of fact tables in data warehouses. It generates a sensible number of partitions across a range of granularities (day, month, or year are the usual choices). We get roughly the same number of records created in a given time span. Data loading and data archiving are usually done on the basis of age (i.e. creation date). BI queries almost invariably include the TIME dimension.
The number of rows in a table isn't generally a great metric to use to determine whether and how to partition the table.
What problem are you trying to solve? Are you trying to improve query performance? Performance of data loads? Performance of purging your data?
Assuming you are trying to improve query performance? Do all your queries have predicates on the STATUS column? Are they doing single row lookups of rows? Or would you want your queries to scan an entire partition?