Hive combining full outer join and not in? - hive

I have a hive query which looks like this
SELECT
a.uid,
a.order_id
FROM table_a a
FULL OUTER JOIN
(
SELECT
uid,
order_id
FROM table_b
) b
ON (a.uid = b.uid AND a.order_id = b.order_id)
This query results in a set of uids and order_ids.
Now, I have a black_listed table which has a set of uids. I want to have this set of black listed uids not to be part of the final result.
Is there a way I can add this remove-blacklisted-uids subquery to the above query (do this in a single query)
So if I have a table called black_list with uid1 and uid2, both these uids should not be part of my final result of the first query.

This can be done with a left join.
SELECT
a.uid,
a.order_id
FROM table_a a
FULL OUTER JOIN
(
SELECT
uid,
order_id
FROM table_b
) b
ON (a.uid = b.uid AND a.order_id = b.order_id)
LEFT JOIN black_listed bl on bl.id = a.uid
WHERE bl.id IS NULL

Related

Pull columns from series of joins in SQL

I am kind of stuck at one problem at my job where I need to pull 2 cols from base table and 1 column from a series of joins.
Please note that, I can not provide real data so I am using dummy column/table names and there are 100s of columns in real project.
Select A.Name,B.Age, D.Sal
From A Left join B on A.iD=B.id and B.Date=CURRENT_DATE
(/* join A and B table and return distinct column which is B.XYZ)
inner join C on C.iD=B.XYZ
(/* join B and C take C.YYY column for next join */)
inner join D on D.id=C.YYY
(/* Take out the D.Sal column from this join */) where A.Dept='IT'
I have written this query but it is taking forever to run because B.XYZ column has a lot of duplicates. how can I get distinct of B.XYZ column from that join.
For Joining Table B, you first get a distinct table of the columns you need from B then join.
SELECT
A.Name,
B.Age,
D.Sal
From A
LEFT JOIN ( -- Instead of all cols (*), just id, Date, Age and xyz might do
SELECT DISTINCT * FROM B
) B ON A.iD = B.id AND B.Date = CURRENT_DATE
--(/* join A and B table and return distinct column which is B.XYZ */)
INNER JOIN C ON C.iD = B.XYZ
--(/*join B and C take C.YYY column for next join */)
INNER JOIN D ON D.id = C.YYY
--(/* Take out the D.Sal column from this join */)
WHERE A.Dept='IT'
You say you get the same rows multifold, because for a b.id, date and age you get the same xyz more than once, or so I understand it.
One option is to join with a subquery that gets the distinct data:
SELECT a.name, b.age, d.sal
FROM a
LEFT JOIN
(
SELECT DISTINCT id, date, age, xyz FROM b
) dist_b ON dist_b.id = a.id and dist_b.date = CURRENT_DATE
INNER JOIN c ON c.id = dist_b.xyz
INNER JOIN d ON d.id = c.yyy
WHERE a.dept = 'IT';
Of course you can even move the date condition inside the subquery:
SELECT a.name, b.age, d.sal
FROM a
LEFT JOIN
(
SELECT DISTINCT id, age, xyz FROM b WHERE date = CURRENT_DATE
) dist_b ON dist_b.id = a.id
INNER JOIN c ON c.id = dist_b.xyz
INNER JOIN d ON d.id = c.yyy
WHERE a.dept = 'IT';
Your LEFT OUTER JOIN doesn't work by the way. As you are inner joining the following tables, a match must exists, so your outer join becomes an inner join. For the outer join to work you would have to outer join the following tables, too.

SQL Server: how to write a DELETE statement with a GROUP BY

I am using SQL Server 2008.
I have a SELECT query as follows:
SELECT
Apples.ID, COUNT(Pips.Apples_ID)
FROM
Apples
LEFT JOIN
Table_B tb ON Apples.ID = tb.Apples_ID
LEFT JOIN
Table_C tc ON tb.xID = tc.xID
LEFT JOIN
Pips p ON tb.Apples_ID = p.Apples_ID
WHERE
tc.X IS NULL
GROUP BY
Apples.ID
The tables are:
Apples which has a unique entry (ID) for each Apple.
Pips which can have dozens of pips belonging to 1 Apple
Table_B and Table_C are mapping tables to refine the search
I need to group the results because I do not want an Apples result for each and every Pip that apples can have. The SELECT statement works and returns a list of unique Apple IDs
I now want to DELETE these Apples. I changed my statement to:
DELETE Apples
FROM Apples
LEFT JOIN Table_B tb ON Apples.ID = tb.Apples_ID
LEFT JOIN Table_C tc ON tb.xID = tc.xID
LEFT JOIN Pips p ON tb.Apples_ID = p.Apples_ID
WHERE tc.X IS NULL
GROUP BY Apples.ID
but got a syntax error on the GROUP BY.
I tried:
DELETE x
FROM
(SELECT Apples.ID
FROM Apples
LEFT JOIN Table_B tb ON Apples.ID = tb.Apples_ID
LEFT JOIN Table_C tc ON tb.xID = tc.xID
LEFT JOIN Pips p ON tb.Apples_ID = p.Apples_ID
WHERE tc.X IS NULL
GROUP BY Apples.ID) x;
But I got an error:
View or function not updatable because the modification affects multiple base tables
How can I delete these rows I have identified in the SELECT, without using a temporary table or script?
As others have pointed out, the sub-query approach can be adapted to work by using an IN ( ... ) clause on a normal single-table delete. This is the simplest way of adapting any select statement to a delete:
DELETE FROM Apples
WHERE ID IN (
-- Sub-query selecting a single column of ID values
)
The sub-query can then be as complex as you like, using GROUP BY, HAVING, etc, as long as it only has one column in the SELECT list.
In your specific case, however, there is no need:
You have no HAVING clause, so the COUNT() doesn't change the rows to delete
The LEFT JOIN to the Pips table has no effect on the result other than the COUNT()
Mentioning the same row twice in a DELETE has no effect, so eliminating duplicates is unnecessary
You can therefore simplify this particular case without using the sub-query:
DELETE Apples
FROM Apples
LEFT JOIN Table_B tb ON Apples.ID = tb.Apples_ID
LEFT JOIN Table_C tc ON tb.xID = tc.xID
WHERE tc.X IS NULL
DELETE FROM Apples WHERE ID in
(
SELECT a.ID FROM Apples a
LEFT JOIN Table_B tb ON a.ID = tb.a
LEFT JOIN Table_C tc ON tb.xID = tc.xID
LEFT JOIN Pips p ON tb.Apples_ID = p.a
WHERE tc.X IS NULL
GROUP BY a.ID
) as q
Are you trying to achieve this:
DELETE FROM APPLES WHERE ID IN
(
SELECT Apples.ID FROM Apples
LEFT JOIN Table_B tb ON Apples.ID = tb.Apples_ID
LEFT JOIN Table_C tc ON tb.xID = tc.xID
LEFT JOIN Pips p ON tb.Apples_ID = p.Apples_ID
WHERE tc.X IS NULL
GROUP BY Apples.ID
) x;
The only thing that has a role in the query is tc.X is null. It can be null if there is no match or there is a match but the field X is null:
delete from Apples
where AppleId in
(
SELECT Apples.ID FROM Apples
LEFT JOIN Table_b tb ON tApples.ID = tb.Apples_ID
LEFT JOIN Table_C tc ON tb.xID = tc.xID
WHERE tc.X IS NULL
);

Select and update on the rows in a table by using joins multiple table in SQL Server

I am trying to update 1.2 million rows in a table that had data inserted kind of incorrectly via legacy application. I am not very good at writing efficient SQL queries as I am experiencing these sort of larger set of data for the first time.
I have written query as below and it's taking a very long time to run this query. I have commented out my update logic in the statement.
SELECT T1.Old_id,
T1. Report_id,
T2.New_id /* update a set file_id = T2.new_id*/
FROM
(SELECT A.File_id AS Old_id,
A.Id AS Report_id,
A.User_id AS USER
FROM A
INNER JOIN B ON A.Id = B.A_id
INNER JOIN C ON B.Id = C. B_id
INNER JOIN D ON C.Id = D.C_id
INNER JOIN E ON D.Id = E.D_id
WHERE E.Name = 'student_report') AS T1
LEFT JOIN
(SELECT Max(C.Report_id) AS New_id,
C.Created_by AS User_id
FROM C
INNER JOIN D ON C.Id = D.C_id
INNER JOIN E ON D.Id = E.D_id
WHERE E.Name = 'teacher_report'
GROUP BY C.Created_by) ON T1.User_id = T2.User_id /* where a.id = T1.report_id*/
I need to update the file_id in table a by the report_id of c. With a small set of data, the select query works fine and gives the result as intended. But on the server where it has 1.2 million rows, it takes extremely long time.
Is there a way we could put those two sub-queries into one and make it work for 'update' as well? Because, update also fails as it has 'group by' on the second sub-query.
Main problem is using Subquery in join condition.
Second problem,when same resultset is to be use multiple time then you should put common resultset in CTE or #temp table.
create table #temp(B_id int,cTeport_id int,cUserID int,EName varchar(100))
insert into #temp
select B_id,C.Report_id,C.Created_by,E.Name
INNER JOIN C ON B.Id = C. B_id
INNER JOIN D ON C.Id = D.C_id
INNER JOIN E ON D.Id = E.D_id
WHERE E.Name in( 'teacher_report','student_report')
;With CTE as
(
SELECT Max(C.Report_id) AS New_id,
C.Created_by AS User_id
FROM #temp c
WHERE c.Name = 'teacher_report'
GROUP BY C.Created_by
)
SELECT T1.Old_id,
T1. Report_id,
T2.New_id /* update a set file_id = T2.new_id*/
FROM
(SELECT A.File_id AS Old_id,
A.Id AS Report_id,
A.User_id AS USER
FROM A
INNER JOIN B ON A.Id = B.A_id
INNER JOIN #temp t ON B.Id = t. B_id
WHERE t.Name = 'student_report'
and exists(select 1 from cte t1 T1.User_id = T.User_id)
My script is not Tested so you can fix any minor bug if any.
In Temp table carefully define all columns which is require for this query along with their datatype.
Please analyze the Query cost by using Execution Plan. Check the table which is making delay then check proper Indexing used for that particular table or not.

Select using LEFT OUTER JOIN with condition

I have two tables Table A and Table B
Table A
1. *id*
2. *name*
Table B
1. *A.id*
2. *datetime*
I want to select
1. *A.id*
2. *A.name*
3. *B.datetime*
Even if table B do not contains a row with A.id for specific day and it should replace that column with NULL
e.g
Table A contains
1. *(1 , Haris)*
2. *(2, Hashsim)*
Table B Contains following for today's date.
1. *(1, '2014-12-26 08:00:00')*
I should show 2 results with id 1 and 2 instead of only id 1.
Using LEFT OUTER JOIN with WHERE Clause makes it a LEFT INNER JOIN, how to work around that ?
SELECT A.id, A.name, b.datetime
FROM A
LEFT Outer JOIN B on B.id = A.id
Use LEFT OUTER JOIN to get all the rows from Left table and one that does not have match will have NULL values in Right table columns
SELECT A.id,
A.name,
B.[datetime]
FROM tableA A
LEFT OUTER JOIN tableB B
ON A.Id = B.id
AND B.[datetime] < #date
SELECT a.id, a.name, b.datetime
FROM A
LEFT JOIN B on B.aid = a.id
WHERE coalesce(B.datetie, '1900-01-01') < #MyDateTime
Select A.id,A.name,B.datetime
from tableA A
Left join
(
SELECT B.ID,B.datetime
FROM tableB B
WHERE B.datetime <= 'myDateTime'
)B
ON A.aid = B.id

Multi join tables with aggregate using main query parameter in subquery

Quick and easy question.
Say I am joining two tables. I have main query and a sub query. The sub query pulls out one extra column for my resultset. LEFT JOIN account for the fact if there are no matching column in table b i still want to get all columns from table a.
select
a.*, b.sumb
from
ta a
left join
(select
b.uid, sum(b.amount) as sumb
from tb b
group by b.uid) b on a.uid = b.uid
where
a.eid = 'value';
Works great. Problem I need to limit the list of results that get summed by the inner query based on per year grouping. Otherwise the query will just sum everything.
Something like that:
select
a.*, b.sumb
from
ta a
left join
(select
b.uid, sum(b.amount) as sumb
from tb b
where b.year = a.year
group by b.uid) b on a.uid = b.uid
where
a.eid = 'value';
Unfortunately this where clause throws an error
The multi-part identifier "a.year" could not be bound.
Can someone with the knowhow point me in the right direction please?
You want an additional join and group by column:
select a.*, b.sumb
from ta a left join
(select b.uid, b.year, sum(b.amount) as sumb
from tb b
group by b.uid, b.year
) b
on a.uid = b.uid and a.year = b.year
where a.eid = 'value';