how to get the difference of tables in hive

how to get the difference of tables in hive - sql

I have two tables, A and B, and I just want get all the entries in A but not in B, and both tables are partitioned by dt, so I did the following:
1) select A.* from A left join B on A.key=B.key where B.key is null and A.dt=20170101 and B.dt=20170101 -- wrong result
2) select A.* from A left join B on (A.key=B.key and A.dt=20170101 and B.dt=20170101) -- wrong result
3) select A1.* from (select * from A where dt=20170101) A1 left join (select * from B where dt=2017101) B1 on A1.key=B1.key -- correct result
Why 1) and 2) don't work? I'm so confused...

1) select A.* from A left join B on A.key=B.key where B.key is null and A.dt=20170101 and B.dt=20170101 -- wrong result
where B.key is null and B.dt=20170101 are mutually exclusive if A.key=B.key. This basically turned your query into:
select A.*
from A
inner join B
on 1=0
2) select A.* from A left join B on (A.key=B.key and A.dt=20170101 and B.dt=20170101) -- wrong result`
A.dt=20170101 is only applied to the join condition, not the result. This means you would get all of the dt for A.
3) select A1.* from (select * from A where dt=20170101) A1 left join (select * from B where dt=2017101) B1 on A1.key=B1.key -- correct result
These would give you the same result:
select a.*
from A
left join B1
on A.Key = B.Key
and B.dt = 20170101
where A.dt = 20170101
select a.*
from A
left join B
on A.Key = B.Key
and A.dt = B.dt
where A.dt = 20170101
This is a sql server demo, but it might help illustrate: http://rextester.com/JCZENB83359

Related

Conditional join in SQL Server dependent on other table values

I need make a decision which table should be use in join statement depend on values in another table
I tried using CASE and COALESCE but can't achieve any success.
TableA has A and B and C and many other columns
TableB has ID and NAME columns
TableC has ID and NAME columns
My select statement is;
Select A.D, A.E, A.F From TableA A
If A.E = 1 then the following join should be used
left outer join TableB B ON A.B = B.ID
and B.NAME should be returned in the select statement
If A.E = 2 then the following join should be used
left outer join TableC C ON A.B = C.ID
and C.NAME should be returned in the select statement

Just add your conditions to the joins, and then use a case statement to pull the correct field to your result set e.g.
select A.D, A.E, A.F
, case when B.[Name] is not null then B.[Name] else C.[Name] end [Name]
from TableA A
left outer join TableB B ON A.B = B.ID and A.E = 1
left outer join TableC C ON A.B = C.ID and A.E = 2

Join tablea with the union of tableb with an extra column with value 1 and tablec with an extra column with value 2 and apply the conditions in the ON clause:
select
a.D, a.E, a.F, u.NAME
from tablea a
left join (
select *, 1 col from tableb
union all
select *, 2 col from tablec
) u on a.B = u.id and a.E = u.col

Query Logic best approach

i'm after the data obtained by my two queries plus any other data from the driving table. I'm using the following code but have a feeling my results are wrong.
select * from(
select * from tbl_a a
inner join tbl_b b on (a.id = b.id and a.col_a = b.col_b and a.col_c = '1')
union all
select * from tbl_a a
inner join tbl_b b on (a.col_a = b.col_b and a.col_c = '1')
where (1=1)
and a.id <> b.id
and a.start_time <= b.u_start_time
and a.end_time >= b.u_end_time
union all
select * from tbl_a a
where a.another_id
NOT IN ( -- either query above)
) results;
I'd just like to know if this makes sense or how I could possibly simplify some of this...

Here is query for the first 2 unions,and it is not clear what is the third union condition
SELECT *
FROM
tbl_a a
left join tbl_b b on b.id = a.id and b.col_b = a.col_a
left join tbl_b b1 on a.col_a= b1.col_b and a.id<>b1.id and a.start_time<=b1.u_start_time and a.end_time>=b1.u_end_time
WHERE
a.col_c=1
and COALESCE(b.id,b1.id) is not null

sql - multiple layers of correlated subqueries

I have table A, B and C
I want to return all entries in table A that do not exist in table B and of that list do not exist in table C.
select * from table_A as a
where not exists (select 1 from table_B as b
where a.id = b.id)
this gives me the first result of entries in A that are not in B. But now I want only those entries of this result that are also not in C.
I tried flavours of:
select * from table_A as a
where not exists (select 1 from table_B as b
where a.id = b.id)
AND
where not exists (select 1 from table_C as c
where a.id = c.id)
But that isnt the correct logic. If there is a way to store the results from the first query and then select * from that result that are not existent in table C. But I'm not sure how to do that. I appreciate the help.

Try this:
select * from (
select a.*, b.id as b_id, c.id as c_id
from table_A as a
left outer join table_B as b on a.id = b.id
left outer join table_C as c on c.id = a.id
) T
where b_id is null
and c_id is null
Another implementation is this:
select a1.*
from table_A as a1
inner join (
select a.id from table_A
except
select b.id from table_B
except
select c.id from table_c
) as a2 on a1.id = a2.id
Note the restrictions on the form of the sub-query as described here. The second implementation, by most succinctly and clearly describing the desired operation to SQL Server, is likely to be the most efficient.

You have two WHERE clauses in (the external part of) your second query. That is not valid SQL. If you remove it, it should work as expected:
select * from table_A as a
where not exists (select 1 from table_B as b
where a.id = b.id)
AND
not exists (select 1 from table_C as c -- WHERE removed
where a.id = c.id) ;
Tested in SQL-Fiddle (thnx #Alexander)

how about using LEFT JOIN
SELECT a.*
FROM TableA a
LEFT JOIN TableB b
ON a.ID = b.ID
LEFT JOIN TableC c
ON a.ID = c.ID
WHERE b.ID IS NULL AND
c.ID IS NULL
SQLFiddle Demo

One more option with NOT EXISTS operator
SELECT *
FROM dbo.test71 a
WHERE NOT EXISTS(
SELECT 1
FROM (SELECT b.ID
FROM dbo.test72 b
UNION ALL
SELECT c.ID
FROM dbo.test73 c) x
WHERE a.ID = x.ID
)
Demo on SQLFiddle
Option from #ypercube.Thank for the present;)
SELECT *
FROM dbo.test71 a
WHERE NOT EXISTS(
SELECT 1
FROM dbo.test72 b
WHERE a.ID = b.ID
UNION ALL
SELECT 1
FROM dbo.test73 c
WHERE a.ID = c.ID
);
Demo on SQLFiddle

I do not like "not exists" but if for some reason it seems to be more logical to you; then you can use a alias for your first query. Subsequently, you can re apply another "not exists" clause. Something like:
SELECT * FROM
( select * from tableA as a
where not exists (select 1 from tableB as b
where a.id = b.id) )
AS A_NOT_IN_B
WHERE NOT EXISTS (
SELECT 1 FROM tableC as c
WHERE c.id = A_NOT_IN_B.id
)

Left Join Multiple Tables and Avoid Duplicates

I have two tables with a 1:n relationship to my base table, both of which I want to LEFT JOIN.
-------------------------------
Table A Table B Table C
-------------------------------
|ID|DATA| |ID|DATA| |ID|DATA|
-------------------------------
1 A1 1 B1 1 C1
- - 1 C2
I'm using:
SELECT * FROM TableA a
LEFT JOIN TableB b
ON a.Id = b.Id
LEFT JOIN TableC c
ON a.Id = c.Id
But this is showing duplicates for TableB:
1 A1 B1 C1
1 A1 B1 C2
How can I write this join to ignore the duplicates? Such as:
1 A1 B1 C1
1 A1 null C2

I think you need to do logic to get what you want. You want for any multiple b.ids to eliminate them. You can identify them using row_number() and then use case logic to make subsequent values NULL:
select a.id, a.val,
(case when row_number() over (partition by b.id, b.seqnum order by b.id) = 1 then val
end) as bval
c.val as cval
from TableA a left join
(select b.*, row_number() over (partition by b.id order by b.id) as seqnum
from tableB b
) b
on a.id = b.id left join
tableC c
on a.id = c.id
I don't think you want a full join between B and C, because you will get multiple rows. If B has 2 rows for an id and C has 3, then you will get 6. I suspect that you just want 3. To achieve this, you want to do something like:
select *
from (select b.*, row_number() over (partition by b.id order by b.id) as seqnum
from TableB b
) b
on a.id = b.id full outer join
(select c.*, row_number() over (partition by c.id order by c.id) as seqnum
from TableC c
) c
on b.id = c.id and
b.seqnum = c.seqnum join
TableA a
on a.id = b.id and a.id = c.id
This is enumerating the "B" and "C" lists, and then joining them by position on the list. It uses a full outer join to get the full length of the longer list.
The last join references both tables so TableA can be used as a filter. Extra ids in B and C won't appear in the results.

Do you want to use distinct
SELECT distinct * FROM TableA a
LEFT JOIN TableB b
ON a.Id = b.Id
LEFT JOIN TableC c
ON a.Id = c.Id

Do it as a UNION, i.e.
SELECT TableA.ID, TableB.ID, TableC.Id
FROM TableA a
INNER JOIN TableB b ON a.Id = b.Id
LEFT JOIN TableC c ON a.Id = c.Id
UNION
SELECT TableA.ID, Null, TableC.Id
FROM TableA a
LEFT JOIN TableC c ON a.Id = c.Id
i.e. one SELECT to being back the first row and another to bring back the second row. It's a bit rough because I don't know anything about the data you are trying to read but the principle is sound. You may need to rework it a bit.

Aliasing derived table which is a union of two selects

I can't get the syntax right for aliasing the derived table correctly:
SELECT * FROM
(SELECT a.*, b.*
FROM a INNER JOIN b ON a.B_id = b.B_id
WHERE a.flag IS NULL AND b.date < NOW()
UNION
SELECT a.*, b.*
FROM a INNER JOIN b ON a.B_id = b.B_id
INNER JOIN c ON a.C_id = c.C_id
WHERE a.flag IS NOT NULL AND c.date < NOW())
AS t1
ORDER BY RAND() LIMIT 1
I'm getting a Duplicate column name of B_id. Any suggestions?

The problem isn't the union, it's the select a.*, b.* in each of the inner select statements - since a and b both have B_id columns, that means you have two B_id cols in the result.
You can fix that by changing the selects to something like:
select a.*, b.col_1, b.col_2 -- repeat for columns of b you need
In general, I'd avoid using select table1.* in queries you're using from code (rather than just interactive queries). If someone adds a column to the table, various queries can suddenly stop working.

In your derived table, you are retrieving the column id that exists in table a and table b, so you need to choose one of them or give an alias to them:
SELECT * FROM
(SELECT a.*, b.[all columns except id]
FROM a INNER JOIN b ON a.B_id = b.B_id
WHERE a.flag IS NULL AND b.date < NOW()
UNION
SELECT a.*, b.[all columns except id]
FROM a INNER JOIN b ON a.B_id = b.B_id
INNER JOIN c ON a.C_id = c.C_id
WHERE a.flag IS NOT NULL AND c.date < NOW())
AS t1
ORDER BY RAND() LIMIT 1

First, you could use UNION ALL instead of UNION. The two subqueries will have no common rows because of the excluding condtion on a.flag.
Another way you could write it, is:
SELECT a.*, b.*
FROM a
INNER JOIN b
ON a.B_id = b.B_id
WHERE ( a.flag IS NULL
AND b.date < NOW()
)
OR
( a.flag IS NOT NULL
AND EXISTS
( SELECT *
FROM c
WHERE a.C_id = c.C_id
AND c.date < NOW()
)
)
ORDER BY RAND()
LIMIT 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to get the difference of tables in hive - sql

Related

Conditional join in SQL Server dependent on other table values

Query Logic best approach

sql - multiple layers of correlated subqueries

Left Join Multiple Tables and Avoid Duplicates

Aliasing derived table which is a union of two selects

Categories

Resources