Multiple left outer joins on Hive

Multiple left outer joins on Hive - hive

In Hive, I have two tables as shown below:
SELECT * FROM p_test;
OK
p_test.id p_test.age
01 1
02 2
01 10
02 11
Time taken: 0.07 seconds, Fetched: 4 row(s)
SELECT * FROM p_test2;
OK
p_test2.id p_test2.height
02 172
01 170
Time taken: 0.053 seconds, Fetched: 2 row(s)
I'm supposed to get the age differences between the same user in the p_test table. Hence, I run HiveQL via row_number function as following:
SELECT *
FROM
(SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t1
LEFT JOIN
(SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t2
ON t2.id=t1.id AND t1.rn=(t2.rn+1)
LEFT JOIN
(SELECT * FROM p_test2) t_2
ON t_2.id = t1.id;
The result of it is :
t1.id t1.age t1.rn t2.id t2.age t2.rn t_2.id t_2.height
01 1 1 NULL NULL NULL 01 170
01 10 2 01 1 1 01 170
02 11 1 NULL NULL NULL 02 172
02 2 2 02 11 1 02 172
Time taken: 60.773 seconds, Fetched: 4 row(s)
It is all ok so far. However, If I move the condition which left joins table t1 and table t2 shown above to the last line as shown below:
SELECT *
FROM
(SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t1
LEFT JOIN
(SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t2
LEFT JOIN
(SELECT * FROM p_test2) t_2
ON t_2.id = t1.id
AND t2.id=t1.id AND t1.rn=(t2.rn+1);
I got the unexpected result as following:
t1.id t1.age t1.rn t2.id t2.age t2.rn t_2.id t_2.height
01 1 1 01 1 1 NULL NULL
01 1 1 01 10 2 NULL NULL
01 1 1 02 11 1 NULL NULL
01 1 1 02 2 2 NULL NULL
01 10 2 01 1 1 01 170
01 10 2 01 10 2 NULL NULL
01 10 2 02 11 1 NULL NULL
01 10 2 02 2 2 NULL NULL
02 11 1 01 1 1 NULL NULL
02 11 1 01 10 2 NULL NULL
02 11 1 02 11 1 NULL NULL
02 11 1 02 2 2 NULL NULL
02 2 2 01 1 1 NULL NULL
02 2 2 01 10 2 NULL NULL
02 2 2 02 11 1 02 172
02 2 2 02 2 2 NULL NULL
It seems that the condition which I move to the last line doesn't work anymore. It bothers me for a long time. Do hope I can hear any valuable answers, thx for anyone who provides me with answers in advance.

In your second query LEFT JOIN with t2 without ON condition is transformed to CROSS JOIN. This is why you have duplication. For id=01 you have two rows in subquery t1 and 2 rows in t2 initially, this CROSS join gives you 2x2=4 rows.
And the ON condition works, but it is applied only to the last LEFT join with t_2 subquery, this condition is being checked only to determine which rows to join in the last join, not all joins, it does not affect first CROSS JOIN (LEFT JOIN without ON condition) at all.
Every join should have it's own ON condition, except cross joins.
See also this answer about joins without ON condition behavior: https://stackoverflow.com/a/46843832/2700344
BTW you can do the same without t2 join at all using lag or lead analytic functions for calculating values ordered by age.
Like this:
lag(height) over(partition by id order by age) -- to get previous height

Related

In Bigquery: How to pick max(date) row while performing full outer join in case of duplicates?

I'm performing full outer join to combine two tables in bigquery in order to get all rows and columns from both the tables.
select distinct t1.Org,t1.begindate,t1.enddate,<fetch unit based on enddate> as f_Unit
from table1 t1
full outer join table2 t2
on t1.Org = t2.Org
Now the problem here is, both the tables have some rows with same value for all columns except enddate and Unit column
table1
Org Store Product begindate enddate FalUnit
01 12 xx 2020-04-16 9999-12-31 5
01 13 yy 2011-03-23 null 0
table2
Org Store Product begindate enddate Unit
01 12 xx null null 1
01 14 zz null null 3
in that case have to pick up the max(enddate) and it's respective Unit as well.
Output_Table
Org Store Product begindate enddate FalUnit Unit f_Unit
01 12 xx 2020-04-16 9999-12-31 5 null 5
01 13 yy 2011-03-23 null 0 null 0
01 14 zz null null null 3 3
How to include this condition to this query or any other approach possible other than joins ?
Any help will be appreciated to solve this issue.

Hmmm . . . I am thinking a prioritization. Something like this:
select t1.*
from table1 t1
union all
select t2.*
from table2 t2
where not exists (select 1
from table1 t1
where t1.org = t2.org and t1.store = t2.store and t1.product = t2.product
);
At the very least, this will return your specified results for the specified data in the question.

Conditional filter with row numbers

I have a sample code below of containing an ID, a Date, a Value, along with a row numbered that is partitioned by the ID holder and ordered by their meeting date:
SELECT
c.ID
,m.CONTACT_DATE
,d.TEST
,row_number() over(partition by c.ID
order by m.CONTACT_DATE desc
) [rn]
FROM COMMUNITY C
INNER JOIN MEETING m
ON c.ID = m.CONTACT_ID
LEFT JOIN DISCUSSION d
ON m.DISCUSSION_TEST = d.TEST
A sample of the results of running such a query would bring:
ID CONTACT_DATE TEST rn
01 2017-05-01 NULL 1
01 2017-04-01 1 2
01 2017-03-01 NULL 3
02 2017-08-01 NULL 1
02 2017-09-01 NULL 2
02 2017-10-01 1 3
03 2017-02-01 NULL 1
03 2017-01-01 NULL 2
What I'd like to do is group each of the IDs to get the most recent CONTACT_DATE (ie. Place in subquery T, then WHERE T.rn = 1 GROUP BY T.ID)
However, if there's a value under TEST, then instead I want to see the most recent CONTACT_DATE that has a value, like below:
ID CONTACT_DATE TEST rn
01 2017-04-01 1 2
02 2017-10-01 1 3
03 2017-02-01 NULL 1
What can I do to filter the most recent CONTACT_DATE that has a value under TEST, while still getting the most recent CONTACT_DATE if all values for that ID is NULL?

You can change your row_number ordering:
row_number() over(partition by c.ID
order by CASE WHEN d.TEST IS NOT NULL THEN 1 ELSE 2 END
, m.CONTACT_DATE desc
)

How to join two tables to get the following result?

I'd like to join two tables.
TABLE_A
GROUP0 GROUP1 SUM_A
---------------------------
01 A 100
01 B 200
04 D 700
TABLE_B
GROUP0 GROUP1 SUM_B
---------------------------
01 300
01 A 350
02 B 400
03 C 500
How to join the tables to get the following result?
GROUP0 GROUP1 SUM_A SUM_B
------------------------------------------------
01 0 300
01 A 100 350
01 B 200 0
02 B 0 400
03 C 0 500
04 D 700 0

You want everything in the second table and then matching rows or new group0 in the first table.
I think this is the join logic:
select coalesce(t1.group0, t2.group0) as group0,
coalesce(t1.group1, t2.group1) as group1,
t1.sum_a, t2.sum_b
from table1 t1 full outer join
table2 t2
on t1.group0 = t2.group0
where (t2.group0 is not null and (t1.group1 = t2.group1 or t1.group0 is null)) or
t2.group0 is null;
This logic is easier with union all:
select t2.group0, t2.group1, t1.sum_a, t2.sum_b
from table2 t2 left join
table1 t1
on t2.group0 = t1.group0 and t2.group1 = t1.group1
union all
select t1.group1, t1.group1, t1.suma, 0
from table1
where not exists (select 1 from table2 t2 where t2.group0 = t1.group0);
EDIT:
The modified question is quite different from the original. That is a simple full outer join:
select coalesce(t1.group0, t2.group0) as group0,
coalesce(t1.group1, t2.group1) as group1,
coalesce(t1.sum_a, 0) as sum_a, coalesce(t2.sum_b, 0) as sum_b
from table1 t1 full outer join
table2 t2
on t1.group0 = t2.group0 and t1.group1 = t2.group1;

sql join with not in column include in result

I have two table like
id Name allocation
2 Ash 15
3 Alam 18
4 Rifat 20
and
Date Id Present
24 2 10
24 3 15
25 2 10
25 3 12
25 4 12
Now i want to get the following result
Date Id Alloc Present
24 2 15 10
24 3 18 15
24 4 20 NULL
25 2 15 10
25 3 18 12
25 4 20 12
I've tried JOIN query but it does not give desired result
How to get the above result?

SELECT
t1.id
, dd.date
, t1.allocation
, t2.present
FROM
table1 AS t1 --- all items
CROSS JOIN
( SELECT DISTINCT date
FROM table2
) AS dd --- all dates
LEFT JOIN
table2 AS t2 --- present allocations
ON t2.id = t1.id
AND t2.date = dd.date ;
Tested at SQL-Fiddle: test (thank you #JW.)

There is also an interesting other way:
SELECT
t2.date,
t1.id,
t1.allocation,
MAX(CASE WHEN t1.id = t2.id THEN t2.Present ELSE NULL END)
FROM
table1 t1, table2 t2
GROUP BY
t1.id, t2.date, t1.allocation
ORDER BY
t2.date, t1.id
Stolen from: SQL query inner join with 0 values
http://www.sqlfiddle.com/#!3/d2ded/44

How would you do this using SQL Server 2005?

Let's say I have a table table1 with the following structure:
id date v1 v2 v3 v4 ... vn
------------------------------
1 03 Y N 89 77 ... x
1 04 N N 9 7 ... i
1 05 N Y 6 90 ... j
1 06 N Y 9 34 ... i
1 07 N Y 0 88 ... i
2 03 N N 9 77 ... f
2 04 Y Y 90 7 ... y
2 05 Y N 6 90 ... v
2 06 N Y 9 34 ... i
2 07 N N 10 88 ... i
As you might see, the table has five rows for each id. I'd like to create two new columns:
-summarystory:= This variable is computed for those rows having the date between 05 and 07 and is the sum of the variable v3 for the last three rows.
Let me explain this better: the first two rows (date 03 and 04) must have NULL values, but the row having date=05 is the sum of the last three v3 values, i.e, 89+9+6=104. Likewise, the row having date=06 must be equal to 9+6+9=24. This have to be done for each id and for each date.
This is the desired result:
id date v3 summarystory
-------------------------
1 03 89 NULL
1 04 9 NULL
1 05 6 104
1 06 9 24
1 07 0 15
2 03 9 NULL
2 04 90 NULL
2 05 6 105
2 06 9 105
2 07 10 25
VcountYN:= the number of Y for each row (based only on variables v1 and v2). So. for instance, for the first row it would be VcountYN=1. This variable must be computed for all the rows.
Any help is much appreciated.

Here's how to do the computations. Turning it into the new table is left as an exercise:
-- SQL 2012 version
Select
t.id,
t.[date],
Case When [Date] Between 5 And 7 Then
Sum(v3) over (
partition by
id
order by
[date]
rows between
2 preceding and current row
) Else Null End,
Case When v1 = 'Y' Then 1 Else 0 End +
Case When v2 = 'Y' Then 1 Else 0 End
From
table1 t;
-- SQL 2005 version
Select
t1.id,
t1.[date],
Case When t1.[date] Between 5 And 7 Then t1.v3 + IsNull(t2.v3, 0) + IsNull(t3.v3, 0) Else Null End,
Case When t1.v1 = 'Y' Then 1 Else 0 End +
Case When t1.v2 = 'Y' Then 1 Else 0 End
From
table1 t1
Left Outer Join
table1 t2
On t1.id = t2.id and t1.[date] = t2.[date] + 1
Left Outer Join
table1 t3
On t2.id = t3.id and t2.[date] = t3.[date] + 1
http://sqlfiddle.com/#!6/a1c45/2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Multiple left outer joins on Hive - hive

Related

In Bigquery: How to pick max(date) row while performing full outer join in case of duplicates?

Conditional filter with row numbers

How to join two tables to get the following result?

sql join with not in column include in result

How would you do this using SQL Server 2005?

Categories

Resources