Bigquery Shard Vs Bigquery Partition - google-bigquery

I have a table with 340GB of data, but we use only last one week of data. So to minimize the cost planning to move this data to partition table or shard tables.
I have done some experiment with shard tables and partition. I have created partition table and loaded two days worth of data(two partitions) and created two shard tables(Individual tables). I tried to pull last two days worth of data.
Full table - 27sec
Partition Table - 33 sec
shard tables - 91 sec
Please let me know which way is best. Based on the experiment result is giving quick when I run against full table but full table will scan.
Thanks,

From GCP official documentation on Partitioning versus Sharding you should use Partitioned tables.
Partitioned tables perform better than tables sharded by date. When
you create date-named tables, BigQuery must maintain a copy of the
schema and metadata for each date-named table. Also, when date-named
tables are used, BigQuery might be required to verify permissions for
each queried table. This practice also adds to query overhead and
impacts query performance. The recommended best practice is to use
partitioned tables instead of date-sharded tables.

The difference in performance seems to be due to some background optimizations that have run on the non-partitioned table, but are yet to run on the partitioned table (since the data is newer).

Related

BigQuery: Max Date for Time Partition on Custom Date Column

I am currently working on the Optimization of a huge table in Google's BigQuery. The tables has approximately 19 billions records resulting in a total size of 5.2 TB. In order to experiment on performance with regards to clustering and time partitioning, I duplicated the table with a Time Partitioning on a custom DATE MyDate column which is frequently used in queries.
When performing a query with a WHERE clause (for instance, WHERE(MyDate) = "2022-08-08") on the time partitioned table, the query is quicker and only reads around 20 GB compared to the 5.2 TB consumed by the table without partition. So far, so good.
My issue, however, arises when applying an aggregated function, i.e. in my case a MAX(MyDate): the query on the partitioned and the non-partitioned tables read the same amount of data and execute in roughly the same time. However, I would have expected the query on the partitioned table to be way quicker as it only needs to scan a single partition.
There seem to be workarounds by fetching the dataset's metadata (information schema) as described here. However, I would like to avoid solutions like this as it adds complexity to our queries.
Are there a more elegant ways to get the MAX of a time-partitioned BigQuery table based on a custom column without scanning the whole table or fetching metadata from the information schema?

SAP HANA PARTITIONED TABLE CALCULATION VIEW RUNNING SLOW IN COMPARISON TO NON-PARTITIONED TABLE CALCULATION VIEWE

I have large size table , close to 1 GB and the size of this table is growing every week, it has total rows as 190 millions, I started getting alerts from HANA to partition this table, so I planned to partition this with column which is frequently used in Where clause.
My HANA System is scale out system with 8 nodes.
In order to compare the partition query performance difference with this un-partitioned table, I created calculation views on top of this un-partitioned table and recorded the query performance.
I partitioned this table using HASH method and by number of servers, and recorded the query performance. By this way I would have good data distribution across servers. I created calculation view and recorded query performance.
To my surprise I have found that my un-partitioned table calculation view query is performing better in comparison to partitioned table calculation view.
This was really shock. Not sure why non partitioned table Calculation view responding better to partitioned table Calculation view.
I have plan viz output files but not sure where to attach it.
Let me know why this is the behaviour?
Ok, this is not a straight-forward question that can be answered correctly as such.
What I can do though is to list some factors that likely will play a role here:
a non-partitioned table needs a single access to the table structure while the partitioned version requires at least one access for each partition
if the SELECT is not actually providing a WHERE condition that can be evaluated by the HASH function used for the partitioning, then all partitions always have to be evaluated and no partition pruning can take place.
HASH partitioning does not take any additional knowledge about the data into account, which means that similar data does not get stored together. This has a negative impact on data compression. Also, each partition requires its own set of value dictionaries for the columns where a single-partition/non-partitioned table only needs one dictionary per column.
You mentioned that you are using a scale-out system. If the table partitions are distributed across the different nodes, then every query will result in cross-node network communication. That is an additional workload and waiting time that simply does not exist with non-partitioned tables.
When joining partitioned tables each partition of the first table has to be joined with each partition of the second table, if no partition-wise join is possible.
There are other/more potential reasons for why a query against partitioned tables can be slower than against a non-partitioned table. All this is extensively explained in the SAP HANA Administration Guide.
As a general guidance, tables should only be partitioned if that cannot be avoided and when the access pattern of queries are well understood. It is definitively not a feature that you just "switch on" and everything will just work fine.

How to increase number of reducers during insert into partitioned clustered transactional table?

We have a clustered transactional table (10k buckets) which seems to be inefficient for the following two use cases
merges with daily deltas
queries based on date range.
What we want to do is to partition table by date and thus create partitioned clustered transactional table. Daily volume suggests number of buckets to be around 1-3, but inserting into the newly created table produces number_of_buckets reduce tasks which is too slow and causes some issues with merge on reducers due to limited hard drive.
Both issues are solvable (for instance, we could split the data into several chunks and start separate jobs to insert into the target table in parallel using n_jobs*n_buckets reduce tasks though it would result in several reads of the source table) but i believe there should be the right way to do that, so the question is: what is this right way?
P.S. Hive version: 1.2.1000.2.6.4.0-91

Indexes vs Partitions in hive

How indexes in hive are different than partitions? both improves query performance as per my knowledge then in what way they differ?
What are the situations I'll be using indexing or partitioning?
Can i use them together?
Kindly suggest
Partitions allow users to store data files stored in different HDFS directories (based on chosen parameter, date for example, if you want to store your datafiles by date) thus, minimizing the number of files to scan when users run queries.
While indexes help in fetching data faster, indexes require index tables to built where the data to be indexed is stored. This leads to storing the data twice.
partition:
Think about that you have a table keeping transactions created from your applications. this table get bigger day by day,
if you partition this table based on day interval ,database creates like table at each day interval but you see only one table. It makes your dailiy basis query more effective.
Index.
Index is used to access your table records fastly.

SQL Server 2012 - Column Store indexes - Reporting Solution

We (Team) are in process of putting Audit Reporting solution for a huge online transactional website.
Our auditing solution is to enable CDC on source table and tracking every change happens on the objects, grab them and push them into Destination table for reporting.
As of now we got one to one table in source - destination.
There will be only inserts in destination and no updates or delete.
So end of the day audit tables will grow large than actual source tables as these keeps history of changes.
My plan is flatten the destination tables to fewer based on subject / module, enable column store indexes and then utilize same for reporting.
Is there any suggestion on the above approach or there is any alternative.
I would recomend that you rather keep the table structure in a single table and have a look at Partitioned Tables and Indexes
SQL Server supports table and index partitioning. The data of
partitioned tables and indexes is divided into units that can be
spread across more than one filegroup in a database. The data is
partitioned horizontally, so that groups of rows are mapped into
individual partitions. All partitions of a single index or table must
reside in the same database. The table or index is treated as a single
logical entity when queries or updates are performed on the data.