Hive join query optimisation

Hive join query optimisation - optimization

Table A
---------
col1, col2,Adate,qty
Table B
-------
col2,cost,Bdate
The table sizes are as follows:
A: 1 million
B: 700k
Consider this query:
SELECT
A.col1,
A.col2,
B.Bdate bdate,
SUM(qty)*COLLECT_LIST(cost)[0] price
FROM A
JOIN B
ON (A.col2 = B.col2 AND A.Adate <= B.Bdate)
GROUP BY
A.col1,
A.col2,
B.bdate;
The above hive query takes more than 3 hrs on a cluster of 4 slaves(8GB memory,100 GB disk) and 1 master(16 GB memory, 100 GB disk)
Can this query be optimized? If yes, where can the optimization be possible?

Use Tez and mapjoin.
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --adjust for your smaller table to fit in memory
set hive.execution.engine=tez;
Also this computation is not memory-efficient:
SUM(qty)*COLLECT_LIST(cost)[0] price
COLLECT_LIST will collect all cost values in the group into non unique(contains values from ALL rows in the group) and unordered (yes, unordered, because you have no any distribute + sort before collect_list) array. This array can be big enough (the number of elements = the number of rows in the group), depending on your data, then you are taking [0] element, it means that you are picking just any random cost from the group. Does it make any sense to collect array to get just any random element? Use min() or max instead. If it does not matter which cost should be taken, then min(cost) or max(cost) or some other scalar function will consume less memory. You can use first_value analytic function (may require sub-query, but it will be memory-efficient also)

I will try to give you some advices to improve query performance in Hive.
Check the execution engine you are using
set hive.execution.engine;
If you execution engine is mr, rather than MapReduce, you may be able to use Apache Spark or Apache Tez, both of which are faster than MapReduce.
set hive.execution.engine=tez;
Join queries are computationally expensive and can be slow, especially when you’re joining three or more tables, or if you’re working with very large data.
One strategy that can be used to remedy this problem is to join the data in advance and store the pre-joined result in a separate table, which you can then query.
this is one way of denormalizing a normalized database, to make it easier to run analytic queries.
This approach of pre-joining tables has some costs, but it can make analytic queries easier to write and faster to run.
There are some other techniques for improving Hive query performance
Join table ordering (Largest table last)
As with any type of tuning, it is important to understand the internal working of a system. When Hive executes a join,
it needs to select which table is streamed and which table is cached.
Hive takes the last table in the JOIN statement for streaming, so we need to ensure that this streaming table is largest among the two.
A: 1 million B: 700k
Hence, when these two tables are joined it is important that the larger table comes last in the query.
Bucketing stores data in separate files, not separate subdirectories like partitioning.
It divides the data in an effectively random way, not in a predictable way like partitioning.
When records are inserted into a bucketed table, Hive computes hash codes of the values in the specified bucketing column and uses these hash codes to divide the records into buckets.
For this reason, bucketing is sometimes called hash partitioning.
The goal of bucketing is to distribute records evenly across a predefined number of buckets.
Bucketing can improve the performance of joins if all the joined tables are bucketed on the join key column.
For more on bucketing, see the page of the Hive Language Manual describing bucketed tables,
BucketedTables
bucketing-in-hive
Partitioning
Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department.
Each table in the hive can have one or more partition keys to identify a particular partition.
Using partition it is easy to do queries on slices of the data.
apache-hive-partitions

Related

How to improve SQL query in Spark when updating table? ('NOT IN' in subquery)

I have a Dataframe in Spark which is registered as a table called A and has 1 billion records and 10 columns. First column (ID) is Primary Key.
Also have another Dataframe which is registered as a table called B and has 10,000 records and 10 columns (same columns as table A, first column (ID) is Primary Key).
Records in Table B are 'Update records'. So I need to update all 10,000 records in table A with records in table B.
I tried first with this SQL query:
select * from A where ID not in (select ID from B) and then to Union that with table B. Approach is ok but first query (select * from A where ID not in (select ID from B)) is extremly slow (hours on moderate cluster).
Then I tried to speed up first query with LEFT JOIN:
select A.* from A left join B on (A.ID = B.ID ) where B.ID is null
That approach seems fine logically but it takes WAY to much memory for Spark containers (YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memory)..
What would be a better/faster/less memory consumption approach?

I would go with left join too rather than not in.
A couple of advices to reduce memory requirement and performance -
Please see the large table is uniformly distributed by join key (ID). If not then some tasks will be heavily burdened and some lightly busy. This will cause serious slowness. Please do a groupBy ID and count to measure this.
If the join key is naturally skewed then add more columns to the join condition keeping the result same. More columns may increase the chance to shuffle data uniformly. This is little hard to achieve.
Memory demand depends on - number of parallel tasks running, volume of data per task being executed in an executor. Reducing either or both will reduce memory pressure and obviously run slower but that is better than crashing. I would reduce the volume of data per task by creating more partitions on the data. Say you have 10 partitions for 1B rows then make it 200 to reduce the volume per task. Use repartition on table A. Don't create too many partitions because that will cause inefficiency, 10K partitions may be a bad idea.
There are some parameters to be tweaked which is explained here.
The small table having 10K rows should be automatically broadcasted because its small. If not you can increase the broadcast limit and apply broadcast hint.

reduce the amount of data scanned by Athena when using aggregate functions

The below query scans 100 mb of data.
select * from table where column1 = 'val' and partition_id = '20190309';
However the below query scans 15 GB of data (there are over 90 partitions)
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
How can I optimize the second query to scan the same amount of data as the first?

There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table, and the one #PiotrFindeisen pointed out around dynamic filtering.
The the first problem is that queries over the partition keys of a Hive table are a lot more complex than they appear. Most folks would think that if you want the max value of a partition key, you can simply execute a query over the partition keys, but that doesn't work because Hive allows partitions to be empty (and it also allows non-empty files that contain no rows). Specifically, the scalar subquery above select max(partition_id) from table requires Trino (formerly PrestoSQL) to find the max partition containing at least one row. The ideal solution would be to have perfect stats in Hive, but short of that the engine would need to have custom logic for hive that open files of the partitions until it found a non empty one.
If you are are sure that your warehouse does not contain empty partitions (or if you are ok with the implications of that), you can replace the scalar sub query with one over the hidden $partitions table"
select *
from table
where column1 = 'val' and
partition_id = (select max(partition_id) from "table$partitions");
The second problem is the one #PiotrFindeisen pointed out, and has to do with the way that queries are planned an executed. Most people would look at the above query, see that the engine should obviously figure out the value of select max(partition_id) from "table$partitions" during planning, inline that into the plan, and then continue with optimization. Unfortunately, that is a pretty complex decision to make generically, so the engine instead simply models this as a broadcast join, where one part of the execution figures out that value, and broadcasts the value to the rest of the workers. The problem is the rest of the execution has no way to add this new information into the existing processing, so it simply scans all of the data and then filters out the values you are trying to skip. There is a project in progress to add this dynamic filtering, but it is not complete yet.
This means the best you can do today, is to run two separate queries: one to get the max partition_id and a second one with the inlined value.
BTW, the hidden "$partitions" table was added in Presto 0.199, and we fixed some minor bugs in 0.201. I'm not sure which version Athena is based on, but I believe it is is pretty far out of date (the current release at the time I'm writing this answer is 309.

EDIT: Presto removed the __internal_partitions__ table in their 0.193 release so I'd suggest not using the solution defined in the Slow aggregation queries for partition keys section below in any production systems since Athena 'transparently' updates presto versions. I ended up just going with the naive SELECT max(partition_date) ... query but also using the same lookback trick outlined in the Lack of Dynamic Filtering section. It's about 3x slower than using the __internal_partitions__ table, but at least it won't break when Athena decides to update their presto version.
----- Original Post -----
So I've come up with a fairly hacky way to accomplish this for date-based partitions on large datasets for when you only need to look back over a few partitions'-worth of data for a match on the max, however, please note that I'm not 100% sure how brittle the usage of the information_schema.__internal_partitions__ table is.
As #Dain noted above, there are really two issues. The first being how slow an aggregation of the max(partition_date) query is, and the second being Presto's lack of support for dynamic filtering.
Slow aggregation queries for partition keys
To solve the first issue, I'm using the information_schema.__internal_partitions__ table which allows me to get quick aggregations on the partitions of a table without scanning the data inside the files. (Note that partition_value, partition_key, and partition_number in the below queries are all column names of the __internal_partitions__ table and not related to your table's columns)
If you only have a single partition key for your table, you can do something like:
SELECT max(partition_value) FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
But if you have multiple partition keys, you'll need something more like this:
SELECT max(partition_date) as latest_partition_date from (
SELECT max(case when partition_key = 'partition_date' then partition_value end) as partition_date, max(case when partition_key = 'another_partition_key' then partition_value end) as another_partition_key
FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
GROUP BY partition_number
)
WHERE
-- ... Filter down by values for e.g. another_partition_key
)
These queries should run fairly quickly (mine run in about 1-2 seconds) without scanning through the actual data in the files, but again, I'm not sure if there are any gotchas with using this approach.
Lack of Dynamic Filtering
I'm able to mitigate the worst effects of the second problem for my specific use-case because I expect there to always be a partition within a finite amount of time back from the current date (e.g. I can guarantee any data-production or partition-loading issues will be remedied within 3 days). It turns out that Athena does do some pre-processing when using presto's datetime functions, so this does not have the same types of issues with Dynamic Filtering as using a sub-query.
So you can change your query to limit how far it will look back for the actual max using the datetime functions so that the amount of data scanned will be limited.
SELECT * FROM "DATABASE_NAME"."TABLE_NAME"
WHERE partition_date >= cast(date '2019-06-25' - interval '3' day as varchar) -- Will only scan partitions from 3 days before '2019-06-25'
AND partition_date = (
-- Insert the partition aggregation query from above here
)

I don't know if it is still relevant, but just found out:
Instead of:
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
Use:
select a.* from table a
inner join (select max(partition_id) max_id from table) b on a.partition_id=b.max_id
where column1 = 'val';
I think it has something to do with optimizations of joins to use partitions.

5+ Intermediate SQL Tables to Arrive at Desired Table, Postgres

I am generating reports on electoral data that group voters into their age groups, and then assign those age groups a quartile, before finally returning the table of age groups and quartiles.
By the time I arrive at the table with the schema and data that I want, I have created 7 intermediate tables that might as well be deleted at this point.
My question is, is it plausible that so many intermediate tables are necessary? Or this a sign that I am "doing it wrong?"
Technical Specifics:
Postgres 9.4
I am chaining tables, starting with the raw database tables and successively transforming the table closer to what I want. For instance, I do something like:
CREATE TABLE gm.race_code_and_turnout_count AS
SELECT race_code, count(*)
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
And then I do
CREATE TABLE gm.race_code_and_percent_of_total_turnout AS
SELECT race_code, count, round((count::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.race_code_and_turnout_count
And that first table goes off in a second branch:
CREATE TABLE gm.race_code_and_turnout_percentage AS
SELECT t1.race_code, round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM gm.race_code_and_turnout_count AS t1
JOIN gm.race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
So each table is building on the one before it.

While temporary tables are used a lot in SQL Server (mainly to overcome the peculiar locking behaviour that it has) it is far less common in Postgres (and your example uses regular tables, not temporary tables).
Usually the overhead of creating a new table is higher than letting the system store intermediate on disk.
From my experience, creating intermediate tables usually only helps if:
you have a lot of data that is aggregated and can't be aggregated in memory
the aggregation drastically reduces the data volume to be processed so that the next step (or one of the next steps) can handle the data in memory
you can efficiently index the intermediate tables so that the next step can make use of those indexes to improve performance.
you re-use a pre-computed result several times in different steps
The above list is not completely and using this approach can also be beneficial if only some of these conditions are true.
If you keep creating those tables create them at least as temporary or unlogged tables to minimized the IO overhead that comes with writing that data and thus keep as much data in memory as possible.
However I would always start with a single query instead of maintaining many different tables (that all need to be changed if you have to change the structure of the report).
For example your first two queries from your question can easily be combined into a single query with no performance loss:
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code;
This is going to be faster than writing the data twice to disk (including all transactional overhead).
If you stack your queries using common table expressions Postgres will automatically store the data on disk if it gets too big, if not it will process it in-memory. When manually creating the tables you force Postgres to write everything to disk.
So you might want to try something like this:
with race_code_and_turnout_count as (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
), race_code_and_total_count as (
select ....
from ....
), race_code_and_turnout_percentage as (
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM ace_code_and_turnout_count AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
)
select *
from ....;
and see how that performs.
If you don't re-use the intermediate steps more than once, writing them as a derived table instead of a CTE might be faster in Postgres due to the way the optimizer works, e.g.:
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
) AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code

If it performs well and results in the right output, I see nothing wrong with it. I do however suggest to use (local) temporary tables if you need intermediate tables.
Your series of queries can always be optimized to use fewer intermediate steps. Do that if you feel your reports start performing poorly.

How to utilize table partition in oracle database in effective manner?

I have created a partitioned table as
CREATE TABLE orders_range(order_id NUMBER
,client_id NUMBER
,order_date DATE)
PARTITION BY RANGE(order_date)
(PARTITION orders2011 VALUES LESS THAN (to_date('1/1/2012','dd/mm/yyyy'))
,PARTITION orders2012 VALUES LESS THAN (to_date('1/1/2013','dd/mm/yyyy'))
,PARTITION orders2013 VALUES LESS THAN (MAXVALUE));
when I am selecting the records using
SELECT * FROM ORDERS_RANGE partition(orders2011);
in explain plan the cpu cost is 75
but when i go for normal query using where clause the cpu cost is only 6 then what is the advantage of table partitioning when it comes to performance?
Can anyone explain me in detail?
Thanks in advance.

First, you generally can't directly compare the cost of two different plans running against two different objects. It is entirely possible that one plan with a cost of 10,000 will run much more quickly than a different plan with a cost of 10. You can compare the cost of two different plans for a single SQL statement within a single 10053 trace (so long as you remember that these are estimates and if the optimizer estimates incorrectly, many cost values are incorrect and the optimizer is likely to pick a less efficient plan). It may make sense to compare the cost between two different queries if you are trying to work out the algorithm the optimizer is using for a particular step but that's pretty unusual.
Second, in your example, you're not inserting any data. Generally, if you're going to partition a table, you're doing so because you have multiple GB of data in that table. If you compare something like
SELECT *
FROM unpartitioned_table_with_1_billion_rows
vs
SELECT *
FROM partitioned_table_with_1_billion_rows
WHERE partition_key = date '2014-04-01' -- Restricts the data to only 10 million rows
the partitioned approach will, obviously, be more efficient not least of all because you're only reading the 10 million rows in the April 1 partition rather than the 1 billion rows in the table.
If the table has no data, it's possible that the query against the partitioned table would be a tiny bit less efficient since you've got to do more things in the course of parsing the query. But reading 0 rows from a 0 row table is going to take essentially no time either way so the difference in parse time is likely to be irrelevant.
In general, you wouldn't ever use the ORDERS_RANGE partition(orders2011) syntax to access data. In addition to hard-coding the partition name, which means that you'd often be resorting to dynamic SQL to assemble the query, you'd be doing a lot more hard parsing and that you'd be putting more pressure on the shared pool and that you'd risk making a mistake if someone changed the partitioning on the table. It makes far more sense to supply a predicate on the partition key and to let Oracle work out how to appropriately prune the partitions. In other words
SELECT *
FROM orders_range
WHERE order_date < date '2012-01-01'
would be a much more sensible query.

Speeding up aggregations for a large table in Oracle

I am trying to see how to improve performance for aggregation queries in an Oracle database. The system is used to run financial series simulations.
Here is the simplified set-up:
The first table table1 has the following columns
date | id | value
It is read-only, has about 100 million rows and is indexed on id, date
The second table table2 is generated by the application according to user input, is relatively small (300K rows) and has this layout:
id | start_date | end_date | factor
After the second table is generated, I need to compute totals as follows:
select date, sum(value * nvl(factor,1)) as total
from table1
left join table2 on table1.id = table2.id
and table1.date between table2.start_date and table2.end_date group by date
My issue is that this is slow, taking up to 20-30 minutes if the second table is particularly large. Is there a generic way to speed this up, perhaps trading off storage space and execution time, ideally, to achieve something running in under a minute?
I am not a database expert and have been reading Oracle performance tuning docs but was not able to find anything appropriate for this. The most promising idea I found were OLAP cubes but I understand this would help only if my second table was fixed and I simply needed to apply different filters on the data.

First, to provide any real insight, you'd need to determine the execution plan that Oracle is producing for the slow query.
You say the second table is ~300K rows - yes that's small compared to 100M but since you have a range condition in the join between the two tables, it's hard to say how many rows from table1 are likely to be accessed in any given execution of the query. If a large proportion of the table is accessed, but the query optimizer doesn't recognize that, the index may actually be hurting instead of helping.
You might benefit from re-organizing table1 as an index-organized table, since you already have an index that covers most of the columns. But all I can say from the information so far is that it might help, but it might not.

Apart from indexes, Also try below. My two cents!
Try running this Query with PARALLEL option employing multiple processors. /*+ PARALLEL(table1,4) */ .
NVL has been done for million of rows, and this will be an impact
to some extent, any way data can be organised?
When you know the date in Advance, probably you divide this Query
into two chunks, by fetching the ids in TABLE2 using the start
date and end date. And issue a JOIN it to TABLE1 using a
view or temp table. By this we use the index (with id as
leading edge) optimally
Thanks!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas