I'm working with a BigQuery partitioned table. The partition is based on a Timestamp column in the data (rather than ingestion-based). We're streaming data into this table at a rate of several million rows per day.
We noticed that our queries based on specific days were scanning much more data than they should in a partitioned table.
Here is the current state of the UNPARTITIONED partition:
I'm assuming that little blip at the bottom-right is normal (streaming buffer for the rows inserted this morning), but there is this massive block of data between mid-November and early-December that lives in the UNPARTITIONED partition, instead of being sent to the proper daily partitions (the partitions for that period don't appear to exist at all in __PARTITIONS_SUMMARY__).
My two questions are:
Is there a particular reason why these rows would not have been partitioned correctly, while data before and after that period is fine?
Is there a way to 'flush' the UNPARTITIONED partition, i.e. force BigQuery to dispatch the rows to their correct daily partition?
I faced a similar type of issue where a lot of rows stayed unpartitioned in a column-based partitioned table. So, what I observed that some records are not partitioned due to the source of the streaming insert. For the soulition, I update the table using the update and set a partitioned date where the partitioned column date is null. For safer side make sure that partitioned date column should not be nullable.
Related
I am currently working on the Optimization of a huge table in Google's BigQuery. The tables has approximately 19 billions records resulting in a total size of 5.2 TB. In order to experiment on performance with regards to clustering and time partitioning, I duplicated the table with a Time Partitioning on a custom DATE MyDate column which is frequently used in queries.
When performing a query with a WHERE clause (for instance, WHERE(MyDate) = "2022-08-08") on the time partitioned table, the query is quicker and only reads around 20 GB compared to the 5.2 TB consumed by the table without partition. So far, so good.
My issue, however, arises when applying an aggregated function, i.e. in my case a MAX(MyDate): the query on the partitioned and the non-partitioned tables read the same amount of data and execute in roughly the same time. However, I would have expected the query on the partitioned table to be way quicker as it only needs to scan a single partition.
There seem to be workarounds by fetching the dataset's metadata (information schema) as described here. However, I would like to avoid solutions like this as it adds complexity to our queries.
Are there a more elegant ways to get the MAX of a time-partitioned BigQuery table based on a custom column without scanning the whole table or fetching metadata from the information schema?
My team and I have been using crate for one of our projects over the passed few years. We have a table with hundreds of millions of records and performance is key.
As we've developed more and more features on this project, we've ran into interesting problem. We have a column on this table labeled 'persist_date' which is when the record actually got persisted into the table. These dates may not always align and we could have a start_date of 2021-06-21 with a persist_date of 2021-10-14.
All of our queries up this point have easily been able to add a partition against start_date. Now we are encountering a problem which requires us to use a non-partitioned column (persist_date) to query against.
As I understand it, crateDB is really performant but only when you query against 1 specific partition at a time. My question now is how would I go about creating a partition for this other date column without duplicated my data? Is there anything other than a partition that might help, like the way the table is clustered?
You could use both columns as partition values.
e.g.
CREATE TABLE two_parted (a TEXT, b TEXT, val DOUBLE) PARTITIONED BY (a,b);
If either a or b are used in a selection, this would limit queries to shards that have either value. However this could lead to more shards, so you might want to partitions not on a daily, but weekly or monthly basis.
I have large size table , close to 1 GB and the size of this table is growing every week, it has total rows as 190 millions, I started getting alerts from HANA to partition this table, so I planned to partition this with column which is frequently used in Where clause.
My HANA System is scale out system with 8 nodes.
In order to compare the partition query performance difference with this un-partitioned table, I created calculation views on top of this un-partitioned table and recorded the query performance.
I partitioned this table using HASH method and by number of servers, and recorded the query performance. By this way I would have good data distribution across servers. I created calculation view and recorded query performance.
To my surprise I have found that my un-partitioned table calculation view query is performing better in comparison to partitioned table calculation view.
This was really shock. Not sure why non partitioned table Calculation view responding better to partitioned table Calculation view.
I have plan viz output files but not sure where to attach it.
Let me know why this is the behaviour?
Ok, this is not a straight-forward question that can be answered correctly as such.
What I can do though is to list some factors that likely will play a role here:
a non-partitioned table needs a single access to the table structure while the partitioned version requires at least one access for each partition
if the SELECT is not actually providing a WHERE condition that can be evaluated by the HASH function used for the partitioning, then all partitions always have to be evaluated and no partition pruning can take place.
HASH partitioning does not take any additional knowledge about the data into account, which means that similar data does not get stored together. This has a negative impact on data compression. Also, each partition requires its own set of value dictionaries for the columns where a single-partition/non-partitioned table only needs one dictionary per column.
You mentioned that you are using a scale-out system. If the table partitions are distributed across the different nodes, then every query will result in cross-node network communication. That is an additional workload and waiting time that simply does not exist with non-partitioned tables.
When joining partitioned tables each partition of the first table has to be joined with each partition of the second table, if no partition-wise join is possible.
There are other/more potential reasons for why a query against partitioned tables can be slower than against a non-partitioned table. All this is extensively explained in the SAP HANA Administration Guide.
As a general guidance, tables should only be partitioned if that cannot be avoided and when the access pattern of queries are well understood. It is definitively not a feature that you just "switch on" and everything will just work fine.
I am checking the feasibility of moving from Redshift to BigQuery. I need help in implementing the below use case on BigQuery.
We have a by day product performance table which is a date partitioned table. It is called product_performance_by_day. There is a row for every product that was sold each day. Every day we process the data at the end of the day and put it in the partition for that day. Then we aggregate this by day performance data over the last 30 days and put it in the table called product_performance_last30days. This aggregation saves querying time and in the case of BigQuery will save the cost as well since it will scan less data.
Here is how we do it in Redshift currently -
We put the aggregated data in a new table e.g. product_performance_last30days_temp. Then drop the product_performance_last30days table and rename product_performance_last30days_temp to product_performance_last30days. So there is very minimal downtime for product_performance_last30days table.
How can we do the same thing in the BigQuery?
Currently, BigQuery does not support renaming tables or materialized views or table aliases. And since we want to save the aggregated data in the same table every day we cannot use destination table if the table is not empty.
You can overwrite the same table by using writeDisposition Specifies the action that occurs if the destination table already exists.
The following values are supported:
WRITE_TRUNCATE: If the table already exists, BigQuery overwrites the table data.
WRITE_APPEND: If the table already exists, BigQuery appends the data to the table.
WRITE_EMPTY: If the table already exists and contains data, a 'duplicate' error is returned in the job result.
The default value is WRITE_EMPTY.
Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.
For RENAMING tables look on this answer.
I'm trying to use dataflow to stream into BQ partitioned table.
The documentation says that:
Data in the streaming buffer has a NULL value for the _PARTITIONTIME column.
I can see that's the case when inserting rows into a date partitioned table.
Is there a way to be able to set the partition time of the rows I want to insert so that BigQuery can infer the correct partition?
So far I've tried doing: tableRow.set("_PARTITIONTIME", milliessinceepoch);
but I get hit with a no such field exception.
As of a month or so ago, you can stream into a specific partition of a date-partitioned table. For example, to insert into partition for date 20160501 in table T, you can call insertall with table name T$20160501
AFAIK, as of writing, BigQuery does not allow specifying the partition manually per row - it is inferred from the time of insertion.
However, as an alternative to BigQuery's built-in partitioned tables feature, you can use Dataflow's feature for streaming to multiple BigQuery tables at the same time: see Sharding BigQuery output tables.