Comparing efficiency of Hive queries with different join orders - sql

Consider the following two queries in Hive:
SELECT
*
FROM
A
INNER JOIN
B
INNER JOIN
C
ON
A.COL = B.COL
AND A.COL = C.COL
and
SELECT
*
FROM
A
INNER JOIN
B
ON
A.COL = B.COL
INNER JOIN
C
ON
A.COL = C.COL
Question: Are the two queries computationally same or different? In other words, to get the fastest results should I prefer to write one versus the other, or it doesn't matter? Thanks.

On Hive 1.2, also tested on Hive 2.3, both on Tez, the optimizer is intelligent enough to derive ON condition for join with table B and performs two INNER JOINs each with correct it's own ON condition.
Checked on simple query
with A as (
select stack(3,1,2,3) as id
),
B as (
select stack(3,1,2,3) as id
),
C as (
select stack(3,1,2,3) as id
)
select * from A
inner join B
inner join C
ON A.id = B.id AND A.id = C.id
Explain command shows that both joins are executed as map-join on single mapper and each join has it's own join condition. This is explain output:
Map 1
File Output Operator [FS_17]
Map Join Operator [MAPJOIN_27] (rows=1 width=12)
Conds:FIL_24.col0=RS_12.col0(Inner),FIL_24.col0=RS_14.col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1","_col2"]
First I thought that it will be CROSS join with table B in first query, then join with C will reduce the dataset, but both queries work the same(the same plan, the same execution), thanks to the optimizer.
Also I tested the same with map-join switched off (set hive.auto.convert.join=false;) and also got exactly the same plan for both queries. I did not test it for really big tables, you better double-check.
So, computationally both are the same on Hive 1.2 and Hive 2.3 for map-join and merge join on reducer

Related

Converting Nested Subqueries into Mini Queries

I have a lot of trouble reading nested subqueries - I personally prefer to write several mini queries and work from there. I understand that more advanced SQL users find it more efficient to write nested subqueries.
For instance, in the following query:
select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
inner join ( select a.id_1, max(a.var_1) as max_var_1 from table_a a
group by a.id_1) c
on a.id_1 = b.id_1 and a.var_1 = c.max_var_1
Problem: I am trying to turn this into several different queries:
#PART 1 :
create table_1 as select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
#PART 2:
create table_2 as select a.id_1, max(a.var_1) as max_var_1 from table_a a
group by a.id_1
#PART 3 (final result: final_table)
create final_table as select a.*, b.*
from table_1 a
inner join table_2 b
on a.id_1 = b.id_1 and a.var_1 = b.max_var_1
My Question: Can someone please tell me if this is correct? Is this how the above nested subquery can be converted into 3 mini queries?
Subqueries are only inserted into separate tables when you use them multiple times. And yet, if the result of the subquery returns many records, then it is not recommended to insert them separately into the table. Because when you are using only select DB will only read data from the disc, but when using insert command, DB will write to disc. Inserting many records may be long process than selecting.
P.S. Mostly used "create temporary table" when inserting subquery process.
Another good way is to use "CTE (Common Table Expression)". When using "CTE", the database stores the results of "SELECT" queries in RAM, executing the subquery only once. If then subqueries are then used multiple times, the database only uses the results from RAM (not executing).
For the performance of your query, you can use only #PART2, and others should not be used. They are unnecessary. But for better performance, I recommended you write your query without inserting, using CTE. For example:
with sub_query as (
select
id_1,
max(var_1) as max_var_1
from
table_a
group by id_1
)
select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
inner join sub_query c
on a.id_1 = b.id_1 and a.var_1 = c.max_var_1;

What is a good way to make multiple full outer join?

I'd like to know if anyone would know an elegant and scalable method to full outer join multiple tables, given that I might want to regularly add new tables to the join?
For now my method consists in full joining table A with table B, store the result as a cte, then full joining the cte to table C, store the result as a cte2, full joining cte2 to table D... you got it.
Creating a new cte every time i want to add another table to the join is not very practical, but every other solutions i found so far have the same issue, there's always some kind of infinite looping either on ctes or in selects (like SELECT blabla FROM (SELECT blabla2 FROM..)).
Is there any way that i don't know that would help me perform this multiple full join without falling in an infinite recursive loop of ctes?
Thanks
EDIT: Sorry it seems it wasn't clear enough
When i perform a multiple full join in one query like:
SELECT
a.*, b.*, c.*
FROM
tableA a
FULL JOIN
tableB b
ON
a.id = b.id
FULL JOIN
tableC c
ON
a.id = c.id
If the id is present in tableB and tableC but not tableA, my result will create two lines where there should be one, because i joined b to a and c to a but not b to c. That's why i need to full join the result of the full join of a and b to c.
So if i have let's say five table instead of three, i need to full join the result of the full join of the result of the full join of the result of the full join... x)
This fiddle illustrates the problem.
If you want the rows from tables B and C to join, you need to accomodate the fact that maybe the data comes from table B and not A. The easiest is probably to use COALESCE.
Your join should therefore look like:
SELECT a.*, b.*, c.*
FROM tableA a
FULL JOIN tableB b ON a.id = b.id
FULL JOIN tableC c ON COALESCE(a.id, b.id) = c.id
-- FULL JOIN tableD d ON COALESCE(a.id, b.id, c.id) = d.id
-- FULL JOIN tableE e ON COALESCE(a.id, b.id, c.id, d.id) = e.id
Most databases that support FULL JOIN also support USING, which is the simplest way to do what you want:
SELECT *
FROM tableA a FULL JOIN
tableB b
USING (id) FULL JOIN
tableC c
USING (id);
The semantics of USING mean that only non-NULL values are used, if such a value is available.

Postgresql LATERAL vs INNER JOIN

JOIN
SELECT *
FROM a
INNER JOIN (
SELECT b.id, Count(*) AS Count
FROM b
GROUP BY b.id ) AS b ON b.id = a.id;
LATERAL
SELECT *
FROM a,
LATERAL (
SELECT Count(*) AS Count
FROM b
WHERE a.id = b.id ) AS b;
I understand that here join will be computed once and then merge with the main request vs the request for each FROM.
It seems to me that if join will rotate a few rows to one frame then it will be more efficient but if it will be 1 to 1 then LATERAL - I think right?
If I understand you right you are asking which of the two statements is more efficient.
You can test that yourself using EXPLAIN (ANALYZE), and I guess that the answer depends on the data:
If there are few rows in a, the LATERAL join will probably be more efficient if there is an index on b(id).
If there are many rows in a, the first query will probably be more efficient, because it can use a hash or merge join.

SQL Joining On Multiple Tables

If I was to have a SQL query as such:
SELECT * FROM tableA a
INNER JOIN TABLEB b on a.item = b.item
INNER JOIN TABLEC c on a.item = c.item
LEFT JOIN TABLED d on d.item = c.item
Am I right in assuming the following:
Firstly, Table A is joined with Table B
Table C is joined with Table A independently of statement 1
Table D is left joined with table C indepdendently of Statement 1 and 2
The results of statement 1,2 and 3 and listed all together from the select
No, you are not correct. SQL is a descriptive language, not a procedural language. A SQL query describes the result set but does not specify how the query is actually executed.
In fact, almost all databases execute queries using directed acyclic graphs (DAGs). The operators in the graphs have names, but they do not correspond to the clauses in SQL.
In other words, the query is compiled to a language that you would not recognize. Along the way, it is optimized. A key part of the optimization is determining the order of the join operations.
In terms of interpreting what the query means, the joins are grouped from left to right (in the absence of parentheses):
SELECT *
FROM ((tableA a INNER JOIN
TABLEB b
on a.item = b.item
) INNER JOIN
TABLEC c
on a.item = c.item
) LEFT JOIN
TABLED d
on d.item = c.item
I believe this is basically what you have described.

Refactor SQL statement with lots of common INNER JOINS and columns

I need suggestions on how to re-factor the following SQL expression. As you can see all the selected columns except col_N are same. Plus all the inner joins except the last one in the 2 sub-queries are the same. This is just a snippet of my code so I am not including the WHERE clause I have in my query. FYI-This is part of a stored procedure which is used by a SSRS report and performance is BIG for me due to thousands of records:
SELECT col_A
, col_B, col_C,...
, '' As[col_N]
FROM table_A
INNER JOIN table_B
INNER JOIN table_C
INNER JOIN table_D1
UNION
SELECT col_A
, col_B, col_C,...
, (select E.field_2 from table_E AS E where D2.field_1 = E.field_1 AND A.field_1 = E.field_2) AS [col_N]
FROM table_A as A
INNER JOIN table_B
INNER JOIN table_C
INNER JOIN table_D2 as D2
Jean's first suggestion of creating a view by joining A, B and C worked. I created a temp table by joining ABC and then used it to achieve significant performance improvement (query time reduces to half for couple thousand records)!