Hive join with OR condition - hive

I've got pretty big table TAB1 and I need to select rows from it which match conditions
(TAB1.KEY1 = TAB2.KEY1 OR TAB1.KEY1 = TAB2.KEY2) AND TAB1.KEY2 = TAB2.KEY3
where TAB2 is a very little table.
I can't do it just by joining these two tables in hive because hive doesn't support joins with OR conditions. I tried to separate these condition by using union clause but it seems it's much more expensive because of two joins with big table.
Is there any better way to do the job? P.S. I use hive 0.13

Map joins should be only used on the assumption one table is much smaller than the other and I don't think that's the case here.
However, the solutions that mention map side join also have the actual generic solution, that is, changing the query from a JOIN to a JOIN/WHERE combination.
SELECT ... FROM TAB1 JOIN TAB2 ON (TAB1.KEY2 = TAB2.KEY3 )
WHERE (TAB1.KEY1 = TAB2.KEY1 OR TAB1.KEY1 = TAB2.KEY2)

Try to join using hive map side join-
select /*+ MAPJOIN(TAB2) */ t1.* from TAB1 t1 join TAB2 on t1.KEY2 = t2.KEY3
where t1.KEY1 = t2.KEY1 OR t1.KEY1 = t2.KEY2

Related

SQL join on multiple columns or on single calculated column

I'm migrating the backend a budget database from Access to SQL Server and I ran into an issue.
I have 2 tables (let's call them t1 and t2) that share many fields in common: Fund, Department, Object, Subcode, TrackingCode, Reserve, and FYEnd.
If I want to join the tables to find records where all 7 fields match, I can create an inner join using each field:
SELECT *
FROM t1
INNER JOIN t2
ON t1.Fund = t2.Fund
AND t1.Department = t2.Department
AND t1.Object = t2.Object
AND t1.Subcode = t2.Subcode
AND t1.TrackingCode = t2.TrackingCode
AND t1.Reserve = t2.Reserve
AND t1.FYEnd = t2.FYEnd;
This works, but it runs very slowly. When the backend was in Access, I was able to solve the problem by adding a calculated column to both tables. It basically, just concatenated the fields using "-" as a delimiter. The revised query is as follows:
SELECT *
FROM t1 INNER JOIN t2
ON CalculatedColumn = CalculatedColumn
This looks cleaner and runs much faster. The problem is when I moved t1 and t2 to SQL Server, the same query gives me an error message:
I'm new to SQL Server. Can anyone explain what's going on here? Is there a setting I need to change for the calculated column?
Posting this as an answer from my comment.
Usually, this is an issue with mismatched Data types between the two columns referenced. Check and make sure the data types of the two fields (CompositeID) are the same.
You have to calculate the columns before joining them as the ON clause can only access columns for the table.
It is no good to have two identical tables anyway so you should rethink your design completely.
SELECT t1a.*,t2a.*
FROM (SELECT CalculatedColumn, * FROM t1) t1a INNER JOIN (SELECT CalculatedColumn, * FROM t2 ) t2a
ON t1a.CalculatedColumn = t2a.CalculatedColumn

joining three tables between each other

Is it possible to join three tables in this way .
select T1.[...],T2.[...],T3.[...]
from T1
full outer join T2 on T1.[key]=T2.[key]
full outer join T3 on T1.[key]=T3.[key]
full outer join T2 on T2.[key]=T3.[key]
My question is : Is this a valid Form?
And if no is there a way to do such operation?
It is "valid" but the full joins are not correct. The on conditions will change them to some other type of join.
Your query has other errors. But I speculate that you want:
select T1.[...], T2.[...], T3.[...]
from T1 full join
T2
on T2.[key] = T1.[key] full join
T3 join
on T3.[key] = coalesce(T2.[key], T1.[key]);
It is possible to join three tables, and your example could run with some changes, but you have syntax and scoping errors in the FROM clause.
Even those aside, I don't think it will do what you intend it to do. You'll probably want to use GROUP BY
See the examples / discussion here :
Multiple FULL OUTER JOIN on multiple tables
I also used this site as a source, as its been a while since I've touched SQL, it may be helpful to you also :
https://learnsql.com/blog/how-to-join-3-tables-or-more-in-sql/

ORACLE join multiple tables performance

I have kinda complex question.
Let's say that I have 7 tables (20mil+ rows each) (Table1, Table2 ...) with corresponding pk (pk1, pk2, ....) (cardinality among all tables is 1:1)
I want to get my final table (using hash join) as:
Create table final_table as select
t1.column1,
t2.column2,
t3.column3,
t4.column4,
t5.column5,
t6.column6,
t7.column7
from table1 t1
join table2 t2 on t1.pk1 = t2.pk2
join table2 t3 on t1.pk1 = t3.pk3
join table2 t4 on t1.pk1 = t4.pk4
join table2 t5 on t1.pk1 = t5.pk5
join table2 t6 on t1.pk1 = t6.pk6
join table2 t7 on t1.pk1 = t7.pk7
I would like to know if it would be faster to create partial tables and then final table, like this?
Create table partial_table1 as select
t1.column1,
t2.column2
from table1 t1
join table2 t2 on t1.pk1 = t2.pk2
create table partial_table2 as select
t1.column1, t1.column2
t3.column3
from partial_table1 t1
join table3 t3 on t1.pk1 = t3.pk3
create table partial_table3 as select
t1.column1, t1.column2, t1.column3
t4.column4
from partial_table1 t1
join table3 t4 on t1.pk1 = t4.pk4
...
...
...
I know it depends on RAM (because I want to use hash join), actual server usage, etc.. I am not looking for specific answer, I am looking for some explanations why and in what situations would it be better to use partial results or why it would it be better to use all 7 joins in 1 select.
Thanks, I hope that my question is easy to understand.
In general, it is not better to create temporary tables. SQL engines have an optimization phase and this optimization phase should do well as figuring out the best query plan.
In the case of a bunch of joins, this is mostly about join order, use of indexes, and the optimal algorithm.
This is a good default attitude. Does it mean that temporary tables are never useful for performance optimization? Not at all. Here are some exceptions:
The optimizer generates a suboptimal query plan. In this case, query hints can push the optimizer in the right direction. And, temporary tables can help.
Indexing the temporary tables. Sometimes an index on the temporary tables can be a big win for performance. The optimizer might not pick this up.
Re-use of temporary tables across queries.
For your particular goal of using hash joins, you can use a query hint to ensure that the optimizer does what you would like. I should note that if the joins are on primary keys, then a hash join might not be the optimal algorithm.
It is not a good idea to create temporary tables in your database. To Optimize your query for reporting purposes or faster results trying using views and it can lead to much better results.
For your specific case, you want to use hash join can you please explain a bit more like why you want to use that in particular because the optimizer will determine the best plan by itself and you don't need to worry about the type of join it performs.

which way is better performance (select nested tables or join)? [duplicate]

This question already has answers here:
Explicit vs implicit SQL joins
(12 answers)
Closed 9 years ago.
SELECT * FROM dbo.table1,
dbo.table2 AS T2,
dbo.table3 AS T3,
dbo.table4 AS T4
WHERE dbo.table1.ID = T2.ID
AND T2.ID = T3.ID
AND T3.ID = T4.ID
(OR)
SELECT
*
FROM dbo.table1 T1
INNER JOIN dbo.table2 T2 ON T1.ID = T2.ID
INNER JOIN dbo.table3 T3 ON T2.ID = T3.ID
INNER JOIN dbo.table4 T4 ON T3.ID = T4.ID
Both have no difference.It is better to stay away from “comma joins” because a) the ANSI join syntax is more expressive and you’re going to use it anyway for LEFT JOIN, and mixing styles is asking for trouble, so you might as well just use one style; b) ANSI style is clearer.
Both will take same time to execute, there is no performance difference .
Without Join keyword it behave as Cross Joins, produce results that consist of every combination of rows from two or more tables. That means if table table2 has 6 rows and table table3 has 3 rows, a cross join will result in 18 rows. There is no relationship established between the two tables – you literally just produce every possible combination.
With an inner join, column values from one row of a table are combined with column values from another row of another (or the same) table to form a single row of data.
If a WHERE clause is added to a cross join, it behaves as an inner join as the WHERE imposes a limiting factor.
In both the cases you mentioned above, there wont be any difference in the way sql engine executes them in the background. The only thing affects on performance is how effective are your indexes on joining columns in case of join and where clause in case of comma separated tables names.
So just make sure you have proper indexes,statistics updated etc
And one more important thing is you are using select "*", if possible try to use only the columns you are interested.
Both are joins, first is implicit, which will perform cross join as pointed in previous answer, the latter one is an explicit inner join notion. Though it should not make a difference in terms of performance.

INNER JOIN with complex condition dramatically increases the execution time

I have 2 tables with several identical fields needed to be linked in JOIN condition. E.g. in each table there are fields: P1, P2. I want to write the following join query:
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P1 = Table2.P1
OR Table1.P2 = Table2.P2
OR Table1.P1 = Table2.P2
OR Table1.P2 = Table2.P1
In the case I have huge tables this request is executing a lot of time.
I tried to test how long will be the request of a query with one condition only. First, I have modified the tables in such way all data from P2 & P1 where copied as new rows into Table1 & Table2. So my query is simple:
SELECT ... FROM Table1 INNER JOIN Table2 ON Table1.P = Table2.P
The result was more then surprised: the execution time from many hours (the 1st case) was reduced to 2-3 seconds!
Why is it so different? Does it mean the complex conditions are always reduce performance? How can I improve the issue? May be P1,P2 indexing will help? I want to remain the 1st DB schema and not to move to one field P.
The reason the queries are different is because of the join strategies being used by the optimizer. There are basically four ways that two tables can be joined:
"Hash join": Creates a hash table on one of the tables which it uses to look up the values in the second.
"Merge join": Sorts both tables on the key and then readsthe results sequentially for the join.
"Index lookup": Uses an index to look up values in one table.
"Nested Loop": Compars each value in each table to all the values in the other table.
(And there are variations on these, such as using an index instead of a table, working with partitions, and handling multiple processors.) Unfortunately, in SQL Server Management Studio both (3) and (4) are shown as nested loop joins. If you look more closely, you can tell the difference from the parameters in the node.
In any case, your original join is one of the first three -- and it goes fast. These joins can basically only be used on "equi-joins". That is, when the condition joining the two tables includes an equality operator.
When you switch from a single equality to an "in" or set of "or" conditions, the join condition has changed from an equijoin to a non-equijoin. My observation is that SQL Server does a lousy job of optimization in this case (and, to be fair, I think other databases do pretty much the same thing). Your performance hit is the hit of going from a good join algorithm to the nested loops algorithm.
Without testing, I might suggest some of the following strategies.
Build an index on P1 and P2 in both tables. SQL Server might use the index even for a non-equijoin.
Use the union query suggested in another solution. Each query should be correctly optimized.
Assuming these are 1-1 joins, you can also do this as a set of multiple joins:
from table1 t1 left outer join
table2 t2_11
on t1.p1 = t2_11.p1 left outer join
table2 t2_12
on t1.p1 = t2_12.p2 left outer join
table2 t2_21
on t1.p2 = t2_21.p2 left outer join
table2 t2_22
on t1.p2 = t2_22.p2
And then use case/coalesce logic in the SELECT to get the value that you actually want. Although this may look more complicated, it should be quite efficient.
you can use 4 query and Union there result
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P1 = Table2.P1
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P1 = Table2.P2
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P2 = Table2.P1
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P2 = Table2.P2
Does using CTEs help performance?
;WITH Table1_cte
AS
(
SELECT
...
[P] = P1
FROM Table1
UNION
SELECT
...
[P] = P2
FROM Table1
)
, Table2_cte
AS
(
SELECT
...
[P] = P1
FROM Table2
UNION
SELECT
...
[P] = P2
FROM Table2
)
SELECT ... FROM Table1_cte x
INNER JOIN
Table2_cte y
ON x.P = y.P
I suspect, as far as the processor is concerned, the above is just different syntax for the same complex conditions.