Hortonworks HDP 2.3.0 - Hive 0.14
Table T1 ( partition on col1, no bucket, ORC ) app 120 million rows & 6GB datasize
Table T2 ( partition on col2, no bucket, ORC ) app 200 M rows & 6MB datasize
T1 left outer join on t2 ( t1.col3 = t2.col3 )
The above query is long running in the last reducer phase in both tez & mr mode.
I also tried auto convert true / false & explicit mapjoin.
Still the query is running in the last reducer phase, never ending.
FYI - If data size of T2 is either 9k or 1GB, the query finishes.
The problem maybe is that there are too many bytes/rows per reducer. If the application execution is stuck in the last single reducer then it's most probably data skew. To check it, select top 5 col3 from both tables, skew is when there are a lot of records with the same key value(say 30%). If it's a skew then try to join separately skew key then UNION ALL with all other keys join. Something like this:
select * from
T1 left outer join on t2 on ( t1.col3 = t2.col3 ) and t1.col3=SKEW_VALUE
union all
select * from
T1 left outer join on t2 on ( t1.col3 = t2.col3 ) and t1.col3<>SKEW_VALUE
If the application execution is stuck in the last reducer stage, not a single reducer or few reducers, then check bytes.per.reducer hive setting, maybe it's too high.
set hive.exec.reducers.bytes.per.reducer=67108864;
But have you tried giving size to auto convert join, try giving size > than of small table that can be fit into memory.
set hive.auto.convert.join.noconditionaltask.size = 10000000;
Related
My Bigquery job which was executing fine until yesterday started failing due to the below error
Error:- Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN
Query:-
with result as (
select
*
from
(
select * from `project.dataset_stage.=non_split_daily_temp`
union all
select * from `project.dataset_stage.split_daily_temp`
)
)
select
*
from
result final
where
not (
exists
(
select
1
from
`project.dataset.bqt_sls_cust_xref` target
where
final.sls_dte = target.sls_dte and
final.rgs_id = target.rgs_id and
) and
unlinked = 'Y' and
cardmatched = 'Y'
}
Can someone please assist me on this, i would like to know reason for sudden break of this and how to fix this issue permanently.
Thank you for the suggestion.
We figured out the reason for the cause of the issue ,below is the reason
When one writes correlated subquery like
select T2.col, (select count(*) from T1 where T1.col = T2.col) from T2
Technically SQL text implies that subquery needs to be re-executed for every row from T2.
If T2 has billion rows then we would need to scan T1 billion times. That would take forever and query would never finish.
The cost of executing query dropped from O(size T1 * size T2) to O(size T1 + size T2) if implemented as below
select any_value(t.col), count(*) from
t left join T1 on T1.col = t.col
group by t.primary_key````
BigQuery errors out if it can't find a way to optimize correlated subquery into linear cost O(size T1 + size T2).
We have plenty of patterns that we recognize and we rewrite for correlated subqueries but apparently new view definition made subquery too complex and query optimizer was unable find a way to run it in linear complexity algorithm.
Probably google will fix the issue by identifying the better algorithm.
I don't know why it suddenly broke, but seemingly, your query can be rewritten with OUTER JOIN:
with result as (
select
*
from
(
select * from `project.dataset_stage.=non_split_daily_temp`
union all
select * from `project.dataset_stage.split_daily_temp`
)
)
select
*
from
result final LEFT OUTER JOIN `project.dataset.bqt_sls_cust_xref` target
ON final.sls_dte = target.sls_dte and
final.str_id = target.str_id and
final.rgs_id = target.rgs_id
where
target.<id_column> IS NULL AND -- No join found, equivalent to NOT (HAVING (<correlected sub-query>))
was_unlinked_run = 'Y' and
is_card_matched = 'Y'
)
For example, does the first query get processed different than the second query?
Query 1
SELECT t1.var1, t2.var2 FROM table1 t1
INNER JOIN table2 t2
ON t1.key = t2.key
WHERE t2.ID = 'ABCD'
Query 2
SELECT t1.var1, t2.var2 FROM table1 t1
INNER JOIN (
SELECT var2, key from table2
WHERE ID = 'ABCD'
) t2
ON t1.key = t2.key
WHERE t2.ID = 'ABCD'
At a glance, it seems as if the second query would be more efficient - table2 is reduced before the join begins, whereas the first query appears to join the tables first, then reduce later. I'm using teradata, if it matters.
Depends on vendor, version and configuration.
Teradata older version/legacy configuration might spool the sub-query as a first stage for Query 2 leading to reduced performance in comparison to Query 1 in depends with the table's' primary indexes and join algorithm.
I would suggest to avoid this kind of "optimization".
P.s.
Check if you get the same execution plan for both plans or different execution plans.
Check the query log for AMPCPUTime (for start)
Let's say I have three tables: t1(it has about 1 billion rows, fact table) and t2(empty table, 0 rows). and t0 (dimension table), all of them have properly collected statistics. In addition there is view v0:
REPLACE VIEW v0
AS SELECT * from t1
union
SELECT * from t2;
Let's look to these three queries:
1) Select * from t1 inner t0 join on t1.id = t0.id; -- Optimizer correctly estimates 1 bln rows
2) Select * from t2 inner t0 join on t1.id = t0.id; -- Optimizer correctly estimates 0 row
3) Select * from v0 inner t0 join on v0.id = t0.id; -- Optimizer locks t1 and t2 for read, the correctly estimated, that it will get 1 bln rows from t1, but for no clear reasons estimated same number 1 bln from table t2.
What is going on here? Is is it the bug or a feature?
PS. Original query, that pretty big to show here, didn't finished in 35 minutes. After leaving just t1 - successfully finished in 15 minutes.
TD Release: 15.10.03.07
TD Version: 15.10.03.09
It's not the same number for the 2nd Select, it's the overall number of rows in spool after the 2nd Select, which is 1 billion plus 0.
And you query was running slowly because you used a UNION which defaults to DISTINCT, running this on a billion rows is really expensive.
Better switch to UNION ALL instead.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
SQL Server IN vs. EXISTS Performance
Should I avoid IN() because slower than EXISTS()?
SELECT * FROM TABLE1 t1 WHERE EXISTS (SELECT 1 FROM TABLE2 t2 WHERE t1.ID = t2.ID)
VS
SELECT * FROM TABLE1 t1 WHERE t1.ID IN(SELECT t2.ID FROM TABLE2 t2)
From my investigation, I set SHOWPLAN_ALL. I get the same execution plan and estimation cost. The index(pk) is used, seek on both query. No difference.
What are other scenarios or other cases to make big difference result from both query? Is optimizer so optimization for me to get same execution plan?
Do neither. Do this:
SELECT DISTINCT T1.*
FROM TABLE1 t1
JOIN TABLE2 t2 ON t1.ID = t2.ID;
This will out perform anything else by orders of magnitude.
Both queries will produce the same execution plan (assuming no indexes were created): two table scans and one nested loop (join).
The join, suggested by Bohemian, will do a Hash Match instead of the loop, which I've always heard (and here is a proof: Link) is the worst kind of join.
Among IN and EXIST (your actuall question), EXISTS returs better performance (take a lok at: Link)
If your table T2 has a lot of records, EXISTS is the better approach hands down, because when your database find a record that match your requirement, the condition will be evaluated to true and it stopped the scan from T2. However, in the IN clause, you're scanning your Table2 for every row in table1.
IN is better than Exists when you have a bunch of values, or few values in the subquery.
Expandad a little my answer, based on Ask Tom answer:
In a Select with in, for example:
Select * from T1 where x in ( select y from T2 )
is usually processed as:
select *
from t1, ( select distinct y from t2 ) t2
where t1.x = t2.y;
The subquery is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to the original table (typically).
In an exist query like:
select * from t1 where exists ( select null from t2 where y = x )
That is processed more like:
for x in ( select * from t1 )
loop
if ( exists ( select null from t2 where y = x.x )
then
OUTPUT THE RECORD
end if
end loop
It always results in a full scan of T1 whereas the first query can make use of an index on T1(x).
When is where exists appropriate and in appropriate?
Use EXISTS when... Subquery T2 is huge and takes a long time and T1 is relatively small and executing (select null from t2 where y = x.x ) is very very fast
Use IN when... The result of the subquery is small -- then IN is typicaly more appropriate.
If both the subquery and the outer table are huge -- either might work as well as the other -- depends on the indexes and other factors.
Here is my issue, I am selecting and doing multiple joins to get the correct items...it pulls in a fair amount of rows, above 100,000. This query takes more than 5mins when the date range is set to 1 year.
I don't know if it's possible but I am afraid that the user might extend the date range to like ten years and crash it.
Anyone know how I can speed this up? Here is the query.
SELECT DISTINCT t1.first_name, t1.last_name, t1.email
FROM table1 AS t1
INNER JOIN table2 AS t2 ON t1.CU_id = t2.O_cid
INNER JOIN table3 AS t3 ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4 ON t3.I_pid = t4.P_id
INNER JOIN table5 AS t5 ON t4.P_cat = t5.C_id
WHERE t1.subscribe =1
AND t1.Cdate >= $startDate
AND t1.Cdate <= $endDate
AND t5.store =2
I am not the greatest with mysql so any help would be appreciated!
Thanks in advance!
UPDATE
Here is the explain you asked for
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t5 ref PRIMARY,C_store_type,C_id,C_store_type_2 C_store_type_2 1 const 101 Using temporary
1 SIMPLE t4 ref PRIMARY,P_cat P_cat 5 alphacom.t5.C_id 326 Using where
1 SIMPLE t3 ref I_pid,I_oref I_pid 4 alphacom.t4.P_id 31
1 SIMPLE t2 eq_ref O_ref,O_cid O_ref 28 alphacom.t3.I_oref 1
1 SIMPLE t1 eq_ref PRIMARY PRIMARY 4 alphacom.t2.O_cid 1 Using where
Also I added an index to table5 rows and table4 rows because they don't really change, however the other tables get around 500-1000 entries a month... I heard you should add an index to a table that has that many new entries....is this true?
I'd try the following:
First, ensure there are indexes on the following tables and columns (each set of columns in parentheses should be a separate index):
table1 : (subscribe, CDate)
(CU_id)
table2 : (O_cid)
(O_ref)
table3 : (I_oref)
(I_pid)
table4 : (P_id)
(P_cat)
table5 : (C_id, store)
Second, if adding the above indexes didn't improve things as much as you'd like, try rewriting the query as
SELECT DISTINCT t1.first_name, t1.last_name, t1.email FROM
(SELECT CU_id, t1.first_name, t1.last_name, t1.email
FROM table1
WHERE subscribe = 1 AND
CDate >= $startDate AND
CDate <= $endDate) AS t1
INNER JOIN table2 AS t2
ON t1.CU_id = t2.O_cid
INNER JOIN table3 AS t3
ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4
ON t3.I_pid = t4.P_id
INNER JOIN (SELECT C_id FROM table5 WHERE store = 2) AS t5
ON t4.P_cat = t5.C_id
I'm hoping here that the first sub-select would cut down significantly on the number of rows to be considered for joining, hopefully making the subsequent joins do less work. Ditto the reasoning behind the second sub-select on table5.
In any case, mess with it. I mean, ultimately it's just a SELECT - you can't really hurt anything with it. Examine the plans that are generated by each different permutation and try to figure out what's good or bad about each.
Share and enjoy.
Make sure your date columns and all the columns you are joining on are indexed.
Doing an unequivalence operator on your dates means it checks every row, which is inherently slower than an equivalence.
Also, using DISTINCT adds an extra comparison to the logic that your optimizer is running behind the scenes. Eliminate that if possible.
Well, first, make a subquery to decimate table1 down to just the records you actually want to go to all the trouble of joining...
SELECT DISTINCT t1.first_name, t1.last_name, t1.email
FROM (
SELECT first_name, last_name, email, CU_id FROM table1 WHERE
table1.subscribe = 1
AND table1.Cdate >= $startDate
AND table1.Cdate <= $endDate
) AS t1
INNER JOIN table2 AS t2 ON t1.CU_id = t2.O_cid
INNER JOIN table3 AS t3 ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4 ON t3.I_pid = t4.P_id
INNER JOIN table5 AS t5 ON t4.P_cat = t5.C_id
WHERE t5.store = 2
Then start looking at modifying the directionality of the joins.
Additionally, if t5.store is only very rarely 2, then flip this idea around: construct the t5 subquery, then join it back and back and back.
At present, your query is returning all matching rows on table2-table5, just to establish whether t5.store = 2. If any of table2-table5 have a significantly higher row count than table1, this may be greatly increasing the number of rows processed - consequently, the following query may perform significantly better:
SELECT DISTINCT t1.first_name, t1.last_name, t1.email
FROM table1 AS t1
WHERE t1.subscribe =1
AND t1.Cdate >= $startDate
AND t1.Cdate <= $endDate
AND EXISTS
(SELECT NULL FROM table2 AS t2
INNER JOIN table3 AS t3 ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4 ON t3.I_pid = t4.P_id
INNER JOIN table5 AS t5 ON t4.P_cat = t5.C_id AND t5.store =2
WHERE t1.CU_id = t2.O_cid);
Try adding indexes on the fields that you join. It may or may not improve the performance.
Moreover it also depends on the engine that you are using. If you are using InnoDB check your configuration params. I had faced a similar problem, as the default configuration of innodb wont scale much as myisam's default configuration.
As everyone says, make sure you have indexes.
You can also check if your server is set up properly so it can contain more of, of maybe the entire, dataset in memory.
Without an EXPLAIN, there's not much to work by. Also keep in mind that MySQL will look at your JOIN, and iterate through all possible solutions before executing the query, which can take time. Once you have the optimal JOIN order from the EXPLAIN, you could try and force this order in your query, eliminating this step from the optimizer.
It sounds like you should think about delivering subsets (paging) or limit the results some other way unless there is a reason that the users need every row possible all at once. Typically 100K rows is more than the average person can digest.