I am currently working on the Optimization of a huge table in Google's BigQuery. The tables has approximately 19 billions records resulting in a total size of 5.2 TB. In order to experiment on performance with regards to clustering and time partitioning, I duplicated the table with a Time Partitioning on a custom DATE MyDate column which is frequently used in queries.
When performing a query with a WHERE clause (for instance, WHERE(MyDate) = "2022-08-08") on the time partitioned table, the query is quicker and only reads around 20 GB compared to the 5.2 TB consumed by the table without partition. So far, so good.
My issue, however, arises when applying an aggregated function, i.e. in my case a MAX(MyDate): the query on the partitioned and the non-partitioned tables read the same amount of data and execute in roughly the same time. However, I would have expected the query on the partitioned table to be way quicker as it only needs to scan a single partition.
There seem to be workarounds by fetching the dataset's metadata (information schema) as described here. However, I would like to avoid solutions like this as it adds complexity to our queries.
Are there a more elegant ways to get the MAX of a time-partitioned BigQuery table based on a custom column without scanning the whole table or fetching metadata from the information schema?
We have the following scenario:
We have an existing table containing approx. 15 billion records. It was not explicitly partitioned on creation.
We are creating a copy of this table with partitions, hoping for faster read time on certain types of queries.
Our tables are on Databricks Cloud, and we use Databricks Delta.
We commonly filter by two columns, one of which is the ID of an entity (350k distinct values) and one of which is the date at which an event occurred (31 distinct values so far, but increasing every day!).
So, in creating our new table, we ran a query like this:
CREATE TABLE the_new_table
USING DELTA
PARTITIONED BY (entity_id, date)
AS SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
This query has run for 48 hours and counting. We know that it is making progress, because we have found around 250k prefixes corresponding to the first partition key in the relevant S3 prefix, and there are certainly some big files in the prefixes that exist.
However, we're having some difficulty monitoring exactly how much progress has been made, and how much longer we can expect this to take.
While we waited, we tried out a query like this:
CREATE TABLE a_test_table (
entity_id STRING,
another_id STRING,
timestamp TIMESTAMP,
date DATE
)
USING DELTA
PARTITIONED BY (date);
INSERT INTO a_test_table
SELECT
entity_id,
another_id,
from_unixtime(timestamp) AS timestamp,
CAST(from_unixtime(timestamp) AS DATE) AS date
FROM the_old_table
WHERE CAST(from_unixtime(timestamp) AS DATE) = '2018-12-01'
Notice the main difference in the new table's schema here is that we partitioned only on date, not on entity id. The date we chose contains almost exactly four percent of the old table's data, which I want to point out because it's much more than 1/31. Of course, since we are selecting by a single value that happens to be the same thing we partitioned on, we are in effect only writing one partition, vs. the probably hundred thousand or so.
The creation of this test table took 16 minutes using the same number of worker-nodes, so we would expect (based on this) that the creation of a table 25x larger would only take around 7 hours.
This answer appears to partially acknowledge that using too many partitions can cause the problem, but the underlying causes appear to have greatly changed in the last couple of years, so we seek to understand what the current issues might be; the Databricks docs have not been especially illuminating.
Based on the posted request rate guidelines for S3, it seems like increasing the number of partitions (key prefixes) should improve performance. The partitions being detrimental seems counter-intuitive.
In summary: we are expecting to write many thousands of records in to each of many thousands of partitions. It appears that reducing the number of partitions dramatically reduces the amount of time it takes to write the table data. Why would this be true? Are there any general guidelines on the number of partitions that should be created for data of a certain size?
You should partition your data by date because it sounds like you are continually adding data as time passes chronologically. This is the generally accepted approach to partitioning time series data. It means that you will be writing to one date partition each day, and your previous date partitions are not updated again (a good thing).
You can of course use a secondary partition key if your use case benefits from it (i.e. PARTITIONED BY (date, entity_id))
Partitioning by date will necessitate that your reading of this data will always be made by date as well, to get the best performance. If this is not your use case, then you would have to clarify your question.
How many partitions?
No one can give you answer on how many partitions you should use because every data set (and processing cluster) is different. What you do want to avoid is "data skew", where one worker is having to process huge amounts of data, while other workers are idle. In your case that would happen if one clientid was 20% of your data set, for example. Partitioning by date has to assume that each day has roughly the same amount of data, so each worker is kept equally busy.
I don't know specifically about how Databricks writes to disk, but on Hadoop I would want to see each worker node writing it's own file part, and therefore your write performance is paralleled at this level.
I am not a databricks expert at all but hopefully this bullets can help
Number of partitions
The number of partitions and files created will impact the performance of your job no matter what, especially using s3 as data storage however this number of files should be handled easily by a cluster of descent size
Dynamic partition
There is a huge difference between partition dynamically by your 2 keys instead of one, let me try to address this in more details.
When you partition data dynamically, depending on the number of tasks and the size of the data, a big number of small files could be created per partition, this could (and probably will) impact the performance of next jobs that will require use this data, especially if your data is stored in ORC, parquet or any other columnar format. Note that this will require only a map only job.
The issue explained before, is addressed in different ways, being the most common the file consolidation. For this, data is repartitioned with the purpose of create bigger files. As result, shuffling of data will be required.
Your queries
For your first query, the number of partitions will be 350k*31 (around 11MM!), which is really big considering the amount of shuffling and task required to handle the job.
For your second query (which takes only 16 minutes), the number of required tasks and shuffling required is much more smaller.
The number of partitions (shuffling/sorting/tasks scheduling/etc) and the time of your job execution does not have a linear relationship, that is why the math doesn't add up in this case.
Recomendation
I think you already got it, you should split your etl job in 31 one different queries which will allow to optimize the execution time
My recommendations in case of occupying partitioned columns is
Identify the cardinality of all the columns and select those that have a finite amount in time, therefore exclude identifiers and date columns
Identify the main search to the table, perhaps it is date or by some categorical field
Generate sub columns with a finite cardinality in order to speed up the search example in the case of dates it is possible to decompose it into year, month, day, etc. , or in the case of integer identifiers, decompose them into the integer division of these IDs% [1,2,3 ...]
As I mentioned earlier, using columns with a high cardinality to partition, will cause poor performance, by generating a lot of files which is the worst working case.
It is advisable to work with files that do not exceed 1 GB for this when creating the delta table it is recommended to occupy "coalesce (1)"
If you need to perform updates or insertions, specify the largest number of partitioned columns to rule out the inceserary cases of file reading, which is very effective to reduce times.
I have large size table , close to 1 GB and the size of this table is growing every week, it has total rows as 190 millions, I started getting alerts from HANA to partition this table, so I planned to partition this with column which is frequently used in Where clause.
My HANA System is scale out system with 8 nodes.
In order to compare the partition query performance difference with this un-partitioned table, I created calculation views on top of this un-partitioned table and recorded the query performance.
I partitioned this table using HASH method and by number of servers, and recorded the query performance. By this way I would have good data distribution across servers. I created calculation view and recorded query performance.
To my surprise I have found that my un-partitioned table calculation view query is performing better in comparison to partitioned table calculation view.
This was really shock. Not sure why non partitioned table Calculation view responding better to partitioned table Calculation view.
I have plan viz output files but not sure where to attach it.
Let me know why this is the behaviour?
Ok, this is not a straight-forward question that can be answered correctly as such.
What I can do though is to list some factors that likely will play a role here:
a non-partitioned table needs a single access to the table structure while the partitioned version requires at least one access for each partition
if the SELECT is not actually providing a WHERE condition that can be evaluated by the HASH function used for the partitioning, then all partitions always have to be evaluated and no partition pruning can take place.
HASH partitioning does not take any additional knowledge about the data into account, which means that similar data does not get stored together. This has a negative impact on data compression. Also, each partition requires its own set of value dictionaries for the columns where a single-partition/non-partitioned table only needs one dictionary per column.
You mentioned that you are using a scale-out system. If the table partitions are distributed across the different nodes, then every query will result in cross-node network communication. That is an additional workload and waiting time that simply does not exist with non-partitioned tables.
When joining partitioned tables each partition of the first table has to be joined with each partition of the second table, if no partition-wise join is possible.
There are other/more potential reasons for why a query against partitioned tables can be slower than against a non-partitioned table. All this is extensively explained in the SAP HANA Administration Guide.
As a general guidance, tables should only be partitioned if that cannot be avoided and when the access pattern of queries are well understood. It is definitively not a feature that you just "switch on" and everything will just work fine.
I have a table with 340GB of data, but we use only last one week of data. So to minimize the cost planning to move this data to partition table or shard tables.
I have done some experiment with shard tables and partition. I have created partition table and loaded two days worth of data(two partitions) and created two shard tables(Individual tables). I tried to pull last two days worth of data.
Full table - 27sec
Partition Table - 33 sec
shard tables - 91 sec
Please let me know which way is best. Based on the experiment result is giving quick when I run against full table but full table will scan.
Thanks,
From GCP official documentation on Partitioning versus Sharding you should use Partitioned tables.
Partitioned tables perform better than tables sharded by date. When
you create date-named tables, BigQuery must maintain a copy of the
schema and metadata for each date-named table. Also, when date-named
tables are used, BigQuery might be required to verify permissions for
each queried table. This practice also adds to query overhead and
impacts query performance. The recommended best practice is to use
partitioned tables instead of date-sharded tables.
The difference in performance seems to be due to some background optimizations that have run on the non-partitioned table, but are yet to run on the partitioned table (since the data is newer).
What is the optimal size for external table partition?
I am planning to partition table by year/month/day and we are getting about 2GB of data daily.
Optimal table partitioning is such that matching to your table usage scenario.
Partitioning should be chosen based on:
how the data is being queried (if you need to work mostly with daily data then partition by date).
how the data is being loaded (parallel threads should load their own
partitions, not overlapped)
2Gb is not too much even for one file, though it again depends on your usage scenario. Avoid unnecessary complex and redundant partitions like (year, month, date) - in this case date is enough for partition pruning.
Hive partitions definition will be stored in the metastore, therefore too many partitions will take much space in the metastore.
Partitions will be stored as directories in the HDFS, therefore many partitions keys will produce hirarchical directories which make their scanning slower.
Your query will be executed as a MapReduce job, therefore it's useless to make too tiny partitions.
It's case depending, think how your data will be queried. For your case I prefer one key defined as 'yyyymmdd', hence we will get 365 partitions / year, only one level in the table directory and 2G data / partition which is nice for a MapReduce job.
For the completness of the answer, if you use Hive < 0.12, make your partition key string typed, see here.
Usefull blog here.
Hive partitioning is most effective in cases where the data is sparse. By sparse I mean that the data internally has visible partitions such as by year, month or day.
In your case, partitioning by date doesn't make much sense as each day will have 2 Gb of data which is not too big to handle. Partitioning by week or month makes more sense as it will optimize the query time and will not create too many small partition files.