How small should a table using Diststyle ALL be in Amazon Redshift?
It says here: http://dwbitechguru.blogspot.com/2014/11/performance-tuning-in-amazon-redshift.html
that for vey small tables, redshift should use diststyle ALL instead of EVEN or KEY. How Small is small? If I was to specify a row number in the where clause of the query: select relname, reldiststyle from pg_class how many rows should I specify?
It really depends on the cluster size you are using. DISTSTYLE ALL will copy the data of your table to all nodes - to mitigate data transfer requirement across nodes. You can find out the size of your table and Redshift nodes available size, if you can afford to copy table multiple times per node, do it!
Also, if you have a requirement of joining other tables with this table very very frequently, like in 70% of your queries, I believe it is worth the space if you want better query performance.
If your Join keys across tables are same in terms of cardinality, then you can also afford to distribute all tables on that key so that similar keys lie in same node which will obviate replication of data.
I would suggest trying out the two options above, and comparing average query run times of around 10 queries and then come to a decision.
By considering a Star Schema, the distribution style All is normally used for dimension tables. Doing this have the advantage to speed up joins, let's explain this through an example. If we would like to obtain the quantity saled of each product by country, we would require to join the fact_sales with the dim_store table on store_id key.
So, setting diststyle all on dim_store enable us to do a JOIN result in parallel compared to the disvantage of shuffling when enabling diststyle even. However, you can let Redshift automatically handle an optimal distribution style by setting distyle auto, for more info check this link.
Related
BigQuery cost scenarios
When I query a large unioned table - partitioned by date field and clustered by a clientkey field - for a specific client's data it appears to process more data than if I just queried that client table individually. Same query, should be the exact same data from different tables; massively different cost.
Does anyone know why it costs more to query a partitioned/clustered unioned table compared to the same data from the individual client-specific table?
I'm trying to make the case for still keeping this data unioned and partitioned+clustered as opposed to individual datasets! Thanks!
There is factor which may affect your scenario, however, the factor is not a contract so this answer may be irrelevant over time.
The assumptions are:
the partitioned table is clustered
the individual table is also clustered
the query utilized clustering and touched only small amount of data (comparing with the cluster block size)
In such case, the cluster block size might affect the cost. Since the individual table is much smaller than the partitioned table, the individual table tends to have smaller cluster block size. The query is eventually charged by the combined size of blocks getting scanned.
Need help in optimizing the below query. Please suggest here
Db : Redshift
Sort Key:
order Table : install_ts
order_item: Install_ts
Suborder. : install_ts
suborder_item: install_ts
Dist Key: Not Added
query: Extracting selected columns from below table (not all)
select *, rank() OVER (PARTITION BY oo.id,
oi.id
ORDER BY i.suborder_id DESC) AS suborder_rank,
FROM order_item oi
LEFT JOIN order oo ON oo.id=oi.order_id
LEFT JOIN sub_order_item i ON i.order_item_id = oi.id
LEFT JOIN sub_order s ON i.suborder_id = s.id
WHERE
(
(oo.update_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59' AND oo.create_ts>=current_date-500)
OR
oo.create_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59'
OR
oi.create_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59'
OR
(oi.update_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59' AND oi.create_ts>=current_date-500)
)
Without knowing your data in more detail and knowing the full query load you want to optimize it will difficult to get things exactly correct. However I see a number of things in this that are POTENTIAL areas for significant performance issues. I do this work for many clients and while the optimization methods in Redshift differ from other databases there are a number of things you can do.
First off you need to remember that Redshift is a networked cluster of computers acting as a coherent database. This allows it to scale to very large database sizes and achieve horizontally scalable compute performance. It also means that there are network cables in play in the middle of your query.
This leads to the issue that I see most often at issue - massive amounts of data being redistributed across these network cable during the query. In your case you have not assigned a distribution key which will lead to random distribution of your data. But when you perform those joins based on "id" all the rows with the same id need to be moved to one slide (compute element) of the Redshift cluster. Basically all your data is being transferred around your cluster and this takes time. Assigning a common distribution key that is also a common join on (or group by) column, such as id, will greatly reduce this network traffic.
The next most common issue is scanning (or writing) too much data from (or to) disk. The disks on Redshift are fast but the data is often massive and the data is still moving through cables (faster than network but still takes time). Redshift helps this by having metadata stored with your data on disk and when it can it will not read unneeded data based on your Where clauses. In your case you have where clauses that are reducing the needed rows (I hope) which is good. But you are using a column that is not the sort key. If these columns correlates well with each other this may not be a problem but if they do not you could be reading more data than you need. You will want to look at the metadata for these table so see how well your sort keys are working for you. Also, Redshift only reads the columns you reference in your query so if you don't ask for a columns its data is not read from disk. Having "*" in your select is reading all the columns - if you don't need all the columns think about listing only the needed columns.
If your query is working on data that doesn't fit in memory during execution the extra data will need to spill to disk. This creates writes to disk of the swapped data and then reads of it back in. This can happen multiple times during the query. I don't know if this is happening in your case but you will want to check if you are spilling during your query.
The thing that impacts everything else is issues in the query and these take a number of forms. If your joins are not properly qualified then the amount of data can explode making the two issues above much worse. Window functions can be expensive especially when a partition different than the distribution key is used but also when the amount of data being operated on is larger than is needed. There are many other but these are two possible areas that look like they may be in your query. I just cannot tell by the info provided.
I'd start by looking at the explain plan and the actual execution information from system tables. Big numbers for rows or cost or time will lead you to the likely areas that need optimization. Like I said I do this work for customer all the time and can help here as time allows but much more information will be needed to understand what is really impacting your query.
I'm comparing the performance of clustering with that of partitioning.
Comparing a partitioned table with a clustered table, the accessed data size of the clustered table is sometimes bigger than that of the partitioned table. (e.g., clustering 122.4MB vs partitioning 35.6MB)
I expect this is due to the limitation of the cluster's minimum data size.
Is there any way to know the limit? Or is there any other
cause of the difference of accessed data size?
Edit
I found the posts 1, 2 by ex-Google.
Post 2 said that "each cluster of data in BigQuery has a minimum size.", and Post 1 said that "If you have less than 100MB of data per day, clustering won't do much for you".
From these posts, I inferred that the cause of the large size of the clustered table is a minimum size of a cluster.
Clusters are not like partitions. In fact there is no guarantee that there will be one cluster per column value (or if you use multiple columns for each combination of them). This is also why BigQuery cannot give you a good estimation of how much data the query will use before running it (like it does for partitions). Meanwhile, different partitions use different memory blocks.
Also, consider that BigQuery perform Auto-clustering (for free) therefore changing all the clusters. This is done so that the table will have more efficient clusters. This is required because when you insert/delete data the clusters results in very skewed clusters resulting in inefficient queries. This will results in data scanned by the same query even if data has not been inserted/deleted if in between BigQuery performed auto-clustering.
Another effect of this implementation is that a single table have a maximum number of partitions (4000). However, you do not have any restriction on the number of keys used for clustering.
So, single clusters in BigQuery may contains multiple clustering values and the underling clustered data blocks may change automatically due to auto-clustering.
Unsure if I cluster correctly. Basicly I am looking at GCP Billing Info of say 50 clients. Each client has a Billing_ID and I cluster on that billing_ID. I use the clustered table for a data studio dashboard
See the the SQL query below to see what I do right now
CREATE OR REPLACE TABLE `dashboardgcp`
PARTITION BY DATE(usage_start_time)
CLUSTER BY billing_account_id
AS
SELECT
*
FROM
`datagcp`
WHERE
usage_start_time BETWEEN TIMESTAMP('2019-01-01')
AND TIMESTAMP(CURRENT_DATE)
It is succesfully clustered like this, I am just not a noticeable query performance increase!
So I thought by clustering it with billing_ID I should see an increase in dashboard performance
Please consider the following points:
Cluster structure
A Cluster field is composed of an array of fields, like boxes, outer to inner, As state in BigQuery link
When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
This means As #Gordon wrote, in your query the WHERE part needs to start from the outer field to the inner one to make the most out of your cluster field. In your case, if the userId is part of the WHERE you need to change your cluster field to match this
Cluster limitation
Cluster typically works better for query who scan over 1GB of data, So if you are not scanning this amount of data you won't see the improvement you are looking for
Cluster with Ingestion tables
Assuming your dara is not static and you keep adding data to your table, datagcp, you need to be aware that cluster indexing is a process which BigQuery perform off-line to the insert operation and a separate one to partitioning.
The side effect is that your clustering "weaken" over-time. To solve this you will need to use merge command to re-build your cluster in order to get the most out of your cluster
From the docs:
“Over time, as more and more operations modify a table, the degree to which the data is sorted begins to weaken, and the table becomes partially sorted”.
I am joining two large tables in Hive (one is over 1 billion rows, one is about 100 million rows) like so:
create table joinedTable as select t1.id, ... from t1 join t2 ON (t1.id = t2.id);
I have bucketed the two tables in the same way, clustering by id into 100 buckets for each, but the query is still taking a long time.
Any suggestions on how to speed this up?
As you bucketed the data by the join keys, you could use the Bucket Map Join. For that the amount of buckets in one table must be a multiple of the amount of buckets in the other table. It can be activated by executing set hive.optimize.bucketmapjoin=true; before the query. If the tables don't meet the conditions, Hive will simply perform the normal Inner Join.
If both tables have the same amount of buckets and the data is sorted by the bucket keys, Hive can perform the faster Sort-Merge Join. To activate it, you have to execute the following commands:
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
You can find some visualizations of the different join techniques under https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join.pdf.
As i see it the answer is a bit more complicated than what #Adrian Lange offered.
First you must understand a very important difference between BucketJoin and Sort-Merge Bucket Join (SMBJ):
To perform a bucketjoin "the amount of buckets in one table must be a multiple of the amount of buckets in the other table" as stated before and in addition hive.optimize.bucketmapjoin must be set to true.
Issuing a join, hive will convert it into a bucketjoin if the above condition take place BUT pay attention that hive will not enforce the bucketing! this means that creating the table bucketed isn't enough for the table to actually be bucketed into the specified amount of buckets as hive doesn't enforce this unless hive.enforce.bucketing is set to true (which means that the amount of buckets actually is set by the amount of reducers in the final stage of the query inserting data into the table).
As from the performance side, please note that when using a bucketjoin a single task reads the "smaller" table into the distributed cache before the mappers access it and do the join - This stage would probably be very very long and ineffective when your table has ~100m rows!
After wards the join will be done same as in a regular join done in the reducers.
To perform a SMBJ both tables have to have the exact same amount of buckets , on the same columns and sorted by these columns in addition to setting hive.optimize.bucketmapjoin.sortedmerge to true.
As in the previous optimization, Hive doesn't enforce the bucketing and the sorting but rather assumes you made sure that the tables are actually bucketed and sorted (not only by definition but by setting hive.enforce.sorting or manually sorting the data while inserting it) - This is very important as it may lead to wrong results in both cases.
As from the performace side, this optimization is way more efficient for the following reasons :
Each mapper reads both buckets and there is no single task contention for distributed cache loading
The join being performed is a merge-sort join as the data is already sorted which is highly more efficient.
Please note the following considerations :
in both cases set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
should be executed
in both cases a /*+ MAPJOIN(b) */ should be applied in the query (right after the select and where b is the smaller table)
How many buckets?
This should be viewed from this angle : the consideration should be applied strictly to the bigger table as it has more impact from this direction and latter the configuration will be applied to the smaller table as a must. I think as a rule of thumb each bucket should contain between 1 and 3 blocks, probably somewhere near 2 blocks. so if your block size is 256MB it seams reasonable to me to have ~512MB of data in each bucket in the bigger table so this becomes a simple division issue.
Also, don't forget that these optimizations alone won't always guarantee a faster query time.
Lets say you choose to do a SMBJ, this adds the cost of sorting 2 tables prior to running the join - so the more times you will run your query the less you are "paying" for this sorting stage.
Sometimes, a simple join will lead to the best performance and none of the above optimization will help and you will have to optimize the regular join process either in the application/logical level or by tuning MapReduce / Hive settings like memory usage / parallelism etc.
I dont think Its a must criteria "the amount of buckets in one table must be a multiple of the amount of buckets in the other table" for map bucket join,We can have same number of buckets also.