Unsure if I cluster correctly. Basicly I am looking at GCP Billing Info of say 50 clients. Each client has a Billing_ID and I cluster on that billing_ID. I use the clustered table for a data studio dashboard
See the the SQL query below to see what I do right now
CREATE OR REPLACE TABLE `dashboardgcp`
PARTITION BY DATE(usage_start_time)
CLUSTER BY billing_account_id
AS
SELECT
*
FROM
`datagcp`
WHERE
usage_start_time BETWEEN TIMESTAMP('2019-01-01')
AND TIMESTAMP(CURRENT_DATE)
It is succesfully clustered like this, I am just not a noticeable query performance increase!
So I thought by clustering it with billing_ID I should see an increase in dashboard performance
Please consider the following points:
Cluster structure
A Cluster field is composed of an array of fields, like boxes, outer to inner, As state in BigQuery link
When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
This means As #Gordon wrote, in your query the WHERE part needs to start from the outer field to the inner one to make the most out of your cluster field. In your case, if the userId is part of the WHERE you need to change your cluster field to match this
Cluster limitation
Cluster typically works better for query who scan over 1GB of data, So if you are not scanning this amount of data you won't see the improvement you are looking for
Cluster with Ingestion tables
Assuming your dara is not static and you keep adding data to your table, datagcp, you need to be aware that cluster indexing is a process which BigQuery perform off-line to the insert operation and a separate one to partitioning.
The side effect is that your clustering "weaken" over-time. To solve this you will need to use merge command to re-build your cluster in order to get the most out of your cluster
From the docs:
“Over time, as more and more operations modify a table, the degree to which the data is sorted begins to weaken, and the table becomes partially sorted”.
Related
I'm comparing the performance of clustering with that of partitioning.
Comparing a partitioned table with a clustered table, the accessed data size of the clustered table is sometimes bigger than that of the partitioned table. (e.g., clustering 122.4MB vs partitioning 35.6MB)
I expect this is due to the limitation of the cluster's minimum data size.
Is there any way to know the limit? Or is there any other
cause of the difference of accessed data size?
Edit
I found the posts 1, 2 by ex-Google.
Post 2 said that "each cluster of data in BigQuery has a minimum size.", and Post 1 said that "If you have less than 100MB of data per day, clustering won't do much for you".
From these posts, I inferred that the cause of the large size of the clustered table is a minimum size of a cluster.
Clusters are not like partitions. In fact there is no guarantee that there will be one cluster per column value (or if you use multiple columns for each combination of them). This is also why BigQuery cannot give you a good estimation of how much data the query will use before running it (like it does for partitions). Meanwhile, different partitions use different memory blocks.
Also, consider that BigQuery perform Auto-clustering (for free) therefore changing all the clusters. This is done so that the table will have more efficient clusters. This is required because when you insert/delete data the clusters results in very skewed clusters resulting in inefficient queries. This will results in data scanned by the same query even if data has not been inserted/deleted if in between BigQuery performed auto-clustering.
Another effect of this implementation is that a single table have a maximum number of partitions (4000). However, you do not have any restriction on the number of keys used for clustering.
So, single clusters in BigQuery may contains multiple clustering values and the underling clustered data blocks may change automatically due to auto-clustering.
When specifying TIMESTAMP column as partition - The data is saved on the disk by the partition allows each access.
Now, BigQuery allows to also define up to 4 columns which will used as cluster field.
If I get it correctly the partition is like PK and the cluster fields are like indexes.
So this means that the cluster fields has nothing to do with how records are saved on the disk?
If I get it correctly the partition is like PK
This is not correct, Partition is not used to identify a row in the table rather enable BigQuery to Store each partitioned data in a different segment so when you scan a table by Partition you ONLY scan the specified partitions and thus reduce your scanning cost
cluster fields are like indexes
This is correct cluster fields are used as pointers to records in the table and enable quick/minimal cost access to data regardless to the partition. This means Using cluster fields you can query a table cross partition with minimal cost
I like #Felipe image from his medium post which gives nice visualization on how data is stored.
Note: Partitioning happens on the time of the insert while clustering happens as a background job performed by BigQuery
I have large size table , close to 1 GB and the size of this table is growing every week, it has total rows as 190 millions, I started getting alerts from HANA to partition this table, so I planned to partition this with column which is frequently used in Where clause.
My HANA System is scale out system with 8 nodes.
In order to compare the partition query performance difference with this un-partitioned table, I created calculation views on top of this un-partitioned table and recorded the query performance.
I partitioned this table using HASH method and by number of servers, and recorded the query performance. By this way I would have good data distribution across servers. I created calculation view and recorded query performance.
To my surprise I have found that my un-partitioned table calculation view query is performing better in comparison to partitioned table calculation view.
This was really shock. Not sure why non partitioned table Calculation view responding better to partitioned table Calculation view.
I have plan viz output files but not sure where to attach it.
Let me know why this is the behaviour?
Ok, this is not a straight-forward question that can be answered correctly as such.
What I can do though is to list some factors that likely will play a role here:
a non-partitioned table needs a single access to the table structure while the partitioned version requires at least one access for each partition
if the SELECT is not actually providing a WHERE condition that can be evaluated by the HASH function used for the partitioning, then all partitions always have to be evaluated and no partition pruning can take place.
HASH partitioning does not take any additional knowledge about the data into account, which means that similar data does not get stored together. This has a negative impact on data compression. Also, each partition requires its own set of value dictionaries for the columns where a single-partition/non-partitioned table only needs one dictionary per column.
You mentioned that you are using a scale-out system. If the table partitions are distributed across the different nodes, then every query will result in cross-node network communication. That is an additional workload and waiting time that simply does not exist with non-partitioned tables.
When joining partitioned tables each partition of the first table has to be joined with each partition of the second table, if no partition-wise join is possible.
There are other/more potential reasons for why a query against partitioned tables can be slower than against a non-partitioned table. All this is extensively explained in the SAP HANA Administration Guide.
As a general guidance, tables should only be partitioned if that cannot be avoided and when the access pattern of queries are well understood. It is definitively not a feature that you just "switch on" and everything will just work fine.
How small should a table using Diststyle ALL be in Amazon Redshift?
It says here: http://dwbitechguru.blogspot.com/2014/11/performance-tuning-in-amazon-redshift.html
that for vey small tables, redshift should use diststyle ALL instead of EVEN or KEY. How Small is small? If I was to specify a row number in the where clause of the query: select relname, reldiststyle from pg_class how many rows should I specify?
It really depends on the cluster size you are using. DISTSTYLE ALL will copy the data of your table to all nodes - to mitigate data transfer requirement across nodes. You can find out the size of your table and Redshift nodes available size, if you can afford to copy table multiple times per node, do it!
Also, if you have a requirement of joining other tables with this table very very frequently, like in 70% of your queries, I believe it is worth the space if you want better query performance.
If your Join keys across tables are same in terms of cardinality, then you can also afford to distribute all tables on that key so that similar keys lie in same node which will obviate replication of data.
I would suggest trying out the two options above, and comparing average query run times of around 10 queries and then come to a decision.
By considering a Star Schema, the distribution style All is normally used for dimension tables. Doing this have the advantage to speed up joins, let's explain this through an example. If we would like to obtain the quantity saled of each product by country, we would require to join the fact_sales with the dim_store table on store_id key.
So, setting diststyle all on dim_store enable us to do a JOIN result in parallel compared to the disvantage of shuffling when enabling diststyle even. However, you can let Redshift automatically handle an optimal distribution style by setting distyle auto, for more info check this link.
When i am running queries in VirtualBox Sandbox with hive. I feel Select count(*) is too much slower than the Select *.
Can anyone explain what is going on behind?
And why this delay is happening?
select * from table
It can be a Map only job But
Select Count(*) from table
It can be a Map and Reduce job
Hope this helps.
There are three types of operations that a hive query can perform.
In order of cheapest and fastest to more expensive and slower here they are.
A hive query can be a metadata only request.
Show tables, describe table are examples. In these queries the hive process performs a lookup in the metadata server. The metadata server is a SQL database, probably MySQL, but the actual DB is configurable.
A hive query can be an hdfs get request.
Select * from table, would be an example. In this case hive can return the results by performing an hdfs operation. hadoop fs -get, more or less.
A hive query can be a Map Reduce job.
Hive has to ship the jar to hdfs, the jobtracker queues the tasks, the tasktracker execute the tasks, the final data is put into hdfs or shipped to the client.
The Map Reduce job has different possibilities as well.
It can be a Map only job.
Select * from table where id > 100 , for example all of that logic can be applied on the mapper.
It can be a Map and Reduce job,
Select min(id) from table;
Select * from table order by id ;
It can also lead to multiple map Reduce passes, but I think the above summarizes some behaviors.
This is because the DB is using clustered primary keys so the query searches each row for the key individually, row by agonizing row, not from an index.
Run optimize table. This will ensure that the data pages are
physically stored in sorted order. This could conceivably speed up a
range scan on a clustered primary key.
create an additional non-primary index on just the change_event_id
column. This will store a copy of that column in index pages which be
much faster to scan. After creating it, check the explain plan to
make sure it's using the new index