Optimising multiple join in hive

Optimising multiple join in hive - hive

I have 4 four Hive Tables:
A - 1.2 billion records and 250 GB
B - 4 billion records and 1 TB
C - 30 billion records and 2 TB
D - 2 billion records and 100 GB
All the tables are not partitioned
A is the parent of B (one to many foreign key relation), B is the parent of C (one to many foreign key relation) and C is the parent of D (one to many foreign key relation)
Now I have to join these tables ; what would be the best approach to join these tables
I need to create a table E with columns from A,B,C,D duplicate values in columns of A,B,C is ok

Tables are rather big and map join is not an option in this case.
If one A to many B and one B to many C and one C to many D and you join them simultaneously then obviously such join causes huge rows multiplication.
And this is quite normal join behavior. Say if A has 10 keys and B has 100 rows per each key in A then after join them it will be 10 x 100 = 1000 rows (if join key in A is unique) and even more if join key in A is not unique. This results in huge dataset on join reducer.
And I suppose your final goal is to aggregate rows. In such case the best approach would be to pre-aggregate rows to the required grain and join aggregated datasets:
select A.*, B.* --aggregate here if necessary
(select <some aggregation here > from A group by <key> ) A
join
(select <some aggregation here > from B group by <key> ) B
on A.key=B.key
and so on...

Not sure if it is the best approach.
I have created intermidiate partitioned tables for all the tables partitioned on a common column.
Now for each partition, I have incrementally run the join query.

Related

Almost equal table with different running time

I’m using oracle. I have two table A and B
Table A has 8000 rows and 5 five columns
Table B has 5500 rows and same 5 columns
All of 5500 rows in table B are contained in Table A and they are the same
I have a query like
With t1 as (select distinct(id) from table A/B)
,T2 as (select a.id, c.value, d.value from Table A/B a
Join table c c on “conditions”
Join table d d on “conditions”
) select * from t2
So the query with Table A works excellent but with Table B it freezes for eternity.
Data types and other properties are equal in table A and table A.
Where should i look for the problem?
I tried to explain plan but differences only in row “PX PARTITION HASH JOIN-FILTER” in Table A and “PX BLOCK ITERATOR ADAPTIVE” in table B

Joining two tables (with a many to one relationship) taking long

I have two tables, table 1 has a column X, table 2 has a column that relates to column X.
In Table 1 the entries of column X are repeated (i.e entries can be {1,1,2,3,5,4,4,4,9})
In Table 2 the entries that column X are not repeated but all the distinct entries in table 1 appear
So there is one to many relationship
Now I want to join the two tables(as seen below in the code) and the performance is extremely slow!
Any ideas?
> CTE_DE_Normalised AS
(
SELECT *
FROM Table1 a
JOIN Table2 b
ON a.Id = b.Table1Id
)

left join on MS SQL 2008 R2

I'm trying to left join two tables. Table A contains unique 100 records with field_a_1, field_a_2, field_a_3. The combination of field_a_1 and field_a_2 is unique.
Table B has multi-million records with multiple fields. field_b_1 is same as field_a_1 and field_b_2 is same as field_a_2.
I join the two tables together like this:
select a.*, b.*
from a
left join b
on field_a_1 = field_b_1
and field_a_2 = field_b_2
Instead of getting 100 records, I get multi-million records. Why is this?

Because table B has multiple rows for each table A entry.
For example:
TableA (ID)
1
2
3
TableB (ID, data)
1 hello
1 world
1 foo
1 bar
2 data
2 words
2 more
3 words
3 boring
If you left join from TableA to TableB, you will get a row for every TableB record that matches a TableA record - ie. all of them.
Can you explain what results you are looking for?

Because a left join returns all of the rows from the first table + all of the matching rows from the second table. Which of the millions of matching rows did you expect to get?

Left join or inner join don't really make a difference. A JOIN will return all rows that match the join condition. So if table b has millions of rows that match the JOIN criteria, then all the rows will be returned.
Depending on what you wish to accomplish you should consider using the DISTINCT keyword or GROUP BY to perform aggregate functions.

Should I use a temp table?

I have a report query that is taking 4 minutes, and under the maximum 30 seconds allowed limit applied on us.
I notice that it has a LOT of INNER JOINS. One, I see, is it joins to a Person table, which has millions of rows. I'm wondering if it would be more efficient to break up the query. Would it be more efficient to do something like:
Assume all keys are indexed.
Table C has 8 million records, Table B has 6 Million records, Table A has 400,000 records.
SELECT Fields
FROM TableA A
INNER JOIN TableB B
ON b.key = a.key
INNER JOIN Table C
ON C.key = b.CKey
WHERE A.id = AnInput
Or
SELECT *
INTO TempTableC
FROM TableC
WHERE id = AnInput
-- TempTableC now has 1000 records
Then
SELECT Fields
FROM TableA A
INNER JOIN TableB B --Maybe put this into a filtered temp table?
ON b.key = a.key
INNER JOIN TempTableC c
ON c.AField = b.aField
WHERE a.id = AnInput
Basically, bring the result sets into temp tables, then join.

If your Person table is indexed correctly, then the INNER JOIN should not be causing such a problem. Check that you have an index created on column(s) that are joined to in all your tables. Using temp tables for what appears to be a relatively simple query seems to be papering over the cracks of an inadequate database design.
As others have said, the only way to be sure is to post your query plan.

Consolidate 2 tables via a mapping table - Full Joins?

Briefly described, I have 2 tables that have 'equivalent' rows in each other. The equivalencies are maintained in a 3rd Mapping table (which maps ID A to ID B). In a view I want to create a consolidated view that shows:
All entries that exist in Table A but have no equivalent in Table B (1 row each)
All entries that exist in Table B but have no equivalent in Table A (1 row each)
All entries that exist in both Table A and B (single row per A/B match)
It's easier to explain graphically...
I have the following scenario (shown in picture linked below):
Current Scenario
I'm sure this is much simpler than it seems - I've been chewing on this for a little while and can't get it workable.

How about just
select a.ID as A_ID, a.Desc as A_Desc, b.ID as B_ID, b.Desc as B_DESC
from Table_A as a left outer join Mapping_Table as m on a.ID = m.A_ID
full outer join Table_B as b on m.B_ID = b.ID

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Optimising multiple join in hive - hive

Not sure if it is the best approach. I have created intermidiate partitioned tables for all the tables partitioned on a common column. Now for each partition, I have incrementally run the join query.

Related

Almost equal table with different running time

Joining two tables (with a many to one relationship) taking long

left join on MS SQL 2008 R2

Should I use a temp table?

Consolidate 2 tables via a mapping table - Full Joins?

Categories

Resources