Hive - Efficient join of two tables - optimization

I am joining two large tables in Hive (one is over 1 billion rows, one is about 100 million rows) like so:
create table joinedTable as select t1.id, ... from t1 join t2 ON (t1.id = t2.id);
I have bucketed the two tables in the same way, clustering by id into 100 buckets for each, but the query is still taking a long time.
Any suggestions on how to speed this up?

As you bucketed the data by the join keys, you could use the Bucket Map Join. For that the amount of buckets in one table must be a multiple of the amount of buckets in the other table. It can be activated by executing set hive.optimize.bucketmapjoin=true; before the query. If the tables don't meet the conditions, Hive will simply perform the normal Inner Join.
If both tables have the same amount of buckets and the data is sorted by the bucket keys, Hive can perform the faster Sort-Merge Join. To activate it, you have to execute the following commands:
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
You can find some visualizations of the different join techniques under https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join.pdf.

As i see it the answer is a bit more complicated than what #Adrian Lange offered.
First you must understand a very important difference between BucketJoin and Sort-Merge Bucket Join (SMBJ):
To perform a bucketjoin "the amount of buckets in one table must be a multiple of the amount of buckets in the other table" as stated before and in addition hive.optimize.bucketmapjoin must be set to true.
Issuing a join, hive will convert it into a bucketjoin if the above condition take place BUT pay attention that hive will not enforce the bucketing! this means that creating the table bucketed isn't enough for the table to actually be bucketed into the specified amount of buckets as hive doesn't enforce this unless hive.enforce.bucketing is set to true (which means that the amount of buckets actually is set by the amount of reducers in the final stage of the query inserting data into the table).
As from the performance side, please note that when using a bucketjoin a single task reads the "smaller" table into the distributed cache before the mappers access it and do the join - This stage would probably be very very long and ineffective when your table has ~100m rows!
After wards the join will be done same as in a regular join done in the reducers.
To perform a SMBJ both tables have to have the exact same amount of buckets , on the same columns and sorted by these columns in addition to setting hive.optimize.bucketmapjoin.sortedmerge to true.
As in the previous optimization, Hive doesn't enforce the bucketing and the sorting but rather assumes you made sure that the tables are actually bucketed and sorted (not only by definition but by setting hive.enforce.sorting or manually sorting the data while inserting it) - This is very important as it may lead to wrong results in both cases.
As from the performace side, this optimization is way more efficient for the following reasons :
Each mapper reads both buckets and there is no single task contention for distributed cache loading
The join being performed is a merge-sort join as the data is already sorted which is highly more efficient.
Please note the following considerations :
in both cases set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
should be executed
in both cases a /*+ MAPJOIN(b) */ should be applied in the query (right after the select and where b is the smaller table)
How many buckets?
This should be viewed from this angle : the consideration should be applied strictly to the bigger table as it has more impact from this direction and latter the configuration will be applied to the smaller table as a must. I think as a rule of thumb each bucket should contain between 1 and 3 blocks, probably somewhere near 2 blocks. so if your block size is 256MB it seams reasonable to me to have ~512MB of data in each bucket in the bigger table so this becomes a simple division issue.
Also, don't forget that these optimizations alone won't always guarantee a faster query time.
Lets say you choose to do a SMBJ, this adds the cost of sorting 2 tables prior to running the join - so the more times you will run your query the less you are "paying" for this sorting stage.
Sometimes, a simple join will lead to the best performance and none of the above optimization will help and you will have to optimize the regular join process either in the application/logical level or by tuning MapReduce / Hive settings like memory usage / parallelism etc.

I dont think Its a must criteria "the amount of buckets in one table must be a multiple of the amount of buckets in the other table" for map bucket join,We can have same number of buckets also.

Related

Unioned table costs more to query in BigQuery than individual tables?

BigQuery cost scenarios
When I query a large unioned table - partitioned by date field and clustered by a clientkey field - for a specific client's data it appears to process more data than if I just queried that client table individually. Same query, should be the exact same data from different tables; massively different cost.
Does anyone know why it costs more to query a partitioned/clustered unioned table compared to the same data from the individual client-specific table?
I'm trying to make the case for still keeping this data unioned and partitioned+clustered as opposed to individual datasets! Thanks!
There is factor which may affect your scenario, however, the factor is not a contract so this answer may be irrelevant over time.
The assumptions are:
the partitioned table is clustered
the individual table is also clustered
the query utilized clustering and touched only small amount of data (comparing with the cluster block size)
In such case, the cluster block size might affect the cost. Since the individual table is much smaller than the partitioned table, the individual table tends to have smaller cluster block size. The query is eventually charged by the combined size of blocks getting scanned.

Redshift : Query Optimisation

Need help in optimizing the below query. Please suggest here
Db : Redshift
Sort Key:
order Table : install_ts
order_item: Install_ts
Suborder. : install_ts
suborder_item: install_ts
Dist Key: Not Added
query: Extracting selected columns from below table (not all)
select *, rank() OVER (PARTITION BY oo.id,
oi.id
ORDER BY i.suborder_id DESC) AS suborder_rank,
FROM order_item oi
LEFT JOIN order oo ON oo.id=oi.order_id
LEFT JOIN sub_order_item i ON i.order_item_id = oi.id
LEFT JOIN sub_order s ON i.suborder_id = s.id
WHERE
(
(oo.update_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59' AND oo.create_ts>=current_date-500)
OR
oo.create_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59'
OR
oi.create_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59'
OR
(oi.update_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59' AND oi.create_ts>=current_date-500)
)
Without knowing your data in more detail and knowing the full query load you want to optimize it will difficult to get things exactly correct. However I see a number of things in this that are POTENTIAL areas for significant performance issues. I do this work for many clients and while the optimization methods in Redshift differ from other databases there are a number of things you can do.
First off you need to remember that Redshift is a networked cluster of computers acting as a coherent database. This allows it to scale to very large database sizes and achieve horizontally scalable compute performance. It also means that there are network cables in play in the middle of your query.
This leads to the issue that I see most often at issue - massive amounts of data being redistributed across these network cable during the query. In your case you have not assigned a distribution key which will lead to random distribution of your data. But when you perform those joins based on "id" all the rows with the same id need to be moved to one slide (compute element) of the Redshift cluster. Basically all your data is being transferred around your cluster and this takes time. Assigning a common distribution key that is also a common join on (or group by) column, such as id, will greatly reduce this network traffic.
The next most common issue is scanning (or writing) too much data from (or to) disk. The disks on Redshift are fast but the data is often massive and the data is still moving through cables (faster than network but still takes time). Redshift helps this by having metadata stored with your data on disk and when it can it will not read unneeded data based on your Where clauses. In your case you have where clauses that are reducing the needed rows (I hope) which is good. But you are using a column that is not the sort key. If these columns correlates well with each other this may not be a problem but if they do not you could be reading more data than you need. You will want to look at the metadata for these table so see how well your sort keys are working for you. Also, Redshift only reads the columns you reference in your query so if you don't ask for a columns its data is not read from disk. Having "*" in your select is reading all the columns - if you don't need all the columns think about listing only the needed columns.
If your query is working on data that doesn't fit in memory during execution the extra data will need to spill to disk. This creates writes to disk of the swapped data and then reads of it back in. This can happen multiple times during the query. I don't know if this is happening in your case but you will want to check if you are spilling during your query.
The thing that impacts everything else is issues in the query and these take a number of forms. If your joins are not properly qualified then the amount of data can explode making the two issues above much worse. Window functions can be expensive especially when a partition different than the distribution key is used but also when the amount of data being operated on is larger than is needed. There are many other but these are two possible areas that look like they may be in your query. I just cannot tell by the info provided.
I'd start by looking at the explain plan and the actual execution information from system tables. Big numbers for rows or cost or time will lead you to the likely areas that need optimization. Like I said I do this work for customer all the time and can help here as time allows but much more information will be needed to understand what is really impacting your query.

How to partition 10 billion row SQL tables quickly using AWS?

I have a SQL database of data delivered in a normalized format with several tables that have several billions of rows of data. I have decided to partition the large tables into separate tables by itemId since when I query the data I only care about 1 item at a time. I would end up having 5000+ tables at the end after partitioning the data. The problem is, partitioning the data takes about 25 minutes to build a single table for 1 item.
5000 items x 25 minutes = 86.8 days
It would take over 86 days to fully partition my entire SQL database. My entire database is about 2.5TB.
Is this something I can leverage AWS for to parallelize on an item level? Can I use AWS database migration services to host the database in its current form and then use AWS process to churn through all of the 5000 queries to partition the big tables into 5000 smaller tables with 2M rows each?
If not, is this something I just have to throw more hardware at to make it run faster (CPU or RAM)?
Thanks in advance.
This doesn't seem like a good strategy. For one thing, simple arithmetic is that 10,000,000,000 rows with 5,000 rows per item results in 2,000,000 partitions in the table.
The limit in Redshift (by default) is 1,000,000 partition per table:
Amazon Redshift Spectrum has the following quotas when using the
Athena or AWS Glue data catalog:
A maximum of 10,000 databases per account.
A maximum of 100,000 tables per database.
A maximum of 1,000,000 partitions per table.
A maximum of 10,000,000 partitions per account.
You should re-think your partitioning strategy. Or perhaps your problem is not suitable for Redshift. There may be other database strategies more suitable for your use-case. (This is not the forum for recommending specific software solutions, however.)
Use the itemid as sortkey and distkey. if the table is vacummed properly and you select one itemid this should have good results, where access time is almost as good as a single table. distkey is used to distribute the data between shards, which means each itemid's blocks would be stored together on the same shard making retrieving all of them faster. Having the itemid also be sortkey means that for itemid's with small row numbers that all exist on the same shard, finding the rows within the table's blocks on a shard would be as fast as possible.
Creating a separate table for each item, where every other attribute of the table remains the same, doesn't seem logical. If the data format is the same, then keep the data in the same table unless there is a particular problem to overcome.
If you set the itemId as the SORTKEY on a Redshift table, then Redshift will be able to skip-over the blocks that do not contain a desired value (when using WHERE itemId = 'xxx'). This will be highly efficient.
Admittedly, trying to keep such a large table sorted would probably be too hard to VACUUM. It would still work reasonably well without the SORTKEY since blocks can still be skipped, but not as efficiently because the data for that itemId would be spread over more blocks.

Redshift performance difference between CTAS and select count

I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
A
);
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Vacuumed;
Analyzed;
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?
When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.
CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.

How small should a table using Diststyle ALL be in Amazon Redshift?

How small should a table using Diststyle ALL be in Amazon Redshift?
It says here: http://dwbitechguru.blogspot.com/2014/11/performance-tuning-in-amazon-redshift.html
that for vey small tables, redshift should use diststyle ALL instead of EVEN or KEY. How Small is small? If I was to specify a row number in the where clause of the query: select relname, reldiststyle from pg_class how many rows should I specify?
It really depends on the cluster size you are using. DISTSTYLE ALL will copy the data of your table to all nodes - to mitigate data transfer requirement across nodes. You can find out the size of your table and Redshift nodes available size, if you can afford to copy table multiple times per node, do it!
Also, if you have a requirement of joining other tables with this table very very frequently, like in 70% of your queries, I believe it is worth the space if you want better query performance.
If your Join keys across tables are same in terms of cardinality, then you can also afford to distribute all tables on that key so that similar keys lie in same node which will obviate replication of data.
I would suggest trying out the two options above, and comparing average query run times of around 10 queries and then come to a decision.
By considering a Star Schema, the distribution style All is normally used for dimension tables. Doing this have the advantage to speed up joins, let's explain this through an example. If we would like to obtain the quantity saled of each product by country, we would require to join the fact_sales with the dim_store table on store_id key.
So, setting diststyle all on dim_store enable us to do a JOIN result in parallel compared to the disvantage of shuffling when enabling diststyle even. However, you can let Redshift automatically handle an optimal distribution style by setting distyle auto, for more info check this link.