Select non duplicate records from a hive join query - sql

I have the following Hive query:
select *
from A
left outer join B
on A.ID = B.ID
where B.ID IS NULL
The result produces duplicate data but I need only non-duplicate records.
After some research, I tried the below query:
select *
from (
select *
from A
left outer join on B
where A.ID = B.ID AND B.ID IS NULL ) join_result
group by jojn_result.ID
It's showing an ambiguous column reference ID error.
I do not have the columns name of table A.
Please help me to identify the solution to this .
Thank you .

Hmmm . . . How about select:
Select A.*
from A left outer join
B
on A.ID = B.ID
where B.ID IS NULL;
I removed the B columns because they are not needed.

One of your join columns may have NULL values. Whenever there is NULL in any of the join key values, it will skip that column. Try replacing the NULL with some default value while joining using NVL or COALESCE. I was looking for same answer and saw your post here. But there was no solution. But since i found the solution I just wanted to post here so that someone can benefit.
select *
from A
left outer join B
on coalesce(A.ID,000) = coalesce(B.ID,000)
where B.ID IS NULL

Related

SQL inner join with conditional selection

I am new in SQL. Lets say I have 2 tables one is table_A and the other one is table_B. And I want to create a view with two of them which is view_1.
table_A:
id
foo
1
d
2
e
null
f
table_B
id
name
1
a
2
b
3
c
and when I use this query :
SELECT DISTINCT table_A.id, table_B.name
FROM table_A
INNER JOIN table_B ON table_B.id = table_A.id
the null value in table_A can't be seen in the view_1 since it is not found in table_B. I want view_1 to show also this null row like :
id
name
1
a
2
b
null
no entry
Should I create a 4. table? I couldn't find a way.
Try this Query:
SELECT DISTINCT a.id,(CASE When b.name IS NULL OR b.name = '' Then 'No Entry' else b.name end) name FROM table_A a
LEFT JOIN table_B b on a.id = b.id
You are looking for an outer join. Thus you keep all table_A rows and join table_B rows where they exist. If no match exists, the table_B columns in the joined row are NULL.
You replace NULLs with a value with COALESCE.
SELECT a.id, COALESCE(b.name, 'no entry') AS name
FROM table_a a
LEFT OUTER JOIN table_b b ON b.id = a.id
ORDER BY a.id NULLS LAST;
You haven't tagged your request with your DBMS. Not all DBMS support the NULLS LAST clause.
Please note that there is no DISTINCT in my query. It is not needed. And every time you think you must use DISTINCT, think twice. SELECT DISTINCT is very seldom needed. Most often it is used, because the query is kind of flawed and causes the undesired duplicates itself.

LEFT JOIN THREE tables

How to create sql query to select the distinct table A data
as in the image
Thanks
One method is minus:
select . . .
from a
minus
select . . .
from b
minus
select . . .
from c;
Or, not exists:
select a.*
from a
where not exists (select 1 from b where . . . ) and
not exists (select 1 from c where . . . );
You don't clarify what the matching conditions are, so I've used . . . for generality.
These two versions are not the same. The first returns unique combinations of columns from a where those same columns are not in b or c. The second returns all columns from a, where another set is not in b or c.
If you must use LEFT JOIN to implement what is really an anti join, then do this:
SELECT *
FROM a
LEFT JOIN b ON b.a_id = a.a_id
LEFT JOIN c ON c.a_id = a.a_id
WHERE b.a_id IS NULL
AND c.a_id IS NULL
This reads:
FROM: Get all rows from a
LEFT JOIN: Optionally get the matching rows from b and c as well
WHERE: In fact, no. Keep only those rows from a, for which there was no match in b and c
Using NOT EXISTS() is a more elegant way to run an anti-join, though. I tend to not recommend NOT IN() because of the delicate implications around three valued logic - which can lead to not getting any results at all.
Side note on using Venn diagrams for joins
A lot of people like using Venn diagrams to illustrate joins. I think this is a bad habit, Venn diagrams model set operations (like UNION, INTERSECT, or in your case EXCEPT / MINUS) very well. Joins are filtered cross products, which is an entirely different kind of operation. I've blogged about it here.
Select what isn't in B nor C nor in A inner join B inner join C
Select * from A
where A.id not in ( select coalesce(b.id,c.id) AS ID
from b full outer join c on (b.id=c.id) )
or also: --- you don't need a join so jou can avoid doing it
select * from A
where a.id not in (select coalesce (B.ID,C.ID) AS ID from B,C)
I would do like this:
SELECT t1.name
FROM table1 t1
LEFT JOIN table2 t2 ON t2.name = t1.name
WHERE t2.name IS NULL
Someone already ask something related to your question, you should see it
here

join or merge two table based on id merge

I have two tables:
I am looking for the results like mentioned in the last.
I tried union (only similar col can be merged), left join, right join i am getting repeated fields in Null areas what can be other options where i can get null without column repeating
A full join would get all results from both tables.
select
A.ID,
A.ColA,
A.ColB,
B.ColC,
B.ColD
from TableA A
full join Table B on A.ID = B.ID
Here is a good post to understand joins
You can try distinct:
select distinct * from
tableA a,
tableB b
where a.id = b.id;
It will not give any duplicate tuples.

Hive and selecting non matching records

I have two table like table A , B , i need to select non matching records of A with B ( that is A minus B ) .
A is having multiple columns and B is single column( ID) .
I have tried like below but it is taking too much time
Select * from A where A.ID <> ( select B.ID from B).
And also I have tried
Select * from A left outer join on B where A.ID = B.ID AND B.ID IS NULL
It's showing me wrong result
Please help me to identify solution to this .
Thank you .
use where clause to filter.
Select * from A left outer join B on A.ID = B.ID where B.ID IS NULL

How do I find records that are not joined?

I have two tables that are joined together.
A has many B
Normally you would do:
select * from a,b where b.a_id = a.id
To get all of the records from a that has a record in b.
How do I get just the records in a that does not have anything in b?
select * from a where id not in (select a_id from b)
Or like some other people on this thread says:
select a.* from a
left outer join b on a.id = b.a_id
where b.a_id is null
select * from a
left outer join b on a.id = b.a_id
where b.a_id is null
The following image will help to understand SQL LET JOIN :
Another approach:
select * from a where not exists (select * from b where b.a_id = a.id)
The "exists" approach is useful if there is some other "where" clause you need to attach to the inner query.
SELECT id FROM a
EXCEPT
SELECT a_id FROM b;
You will probably get a lot better performance (than using 'not in') if you use an outer join:
select * from a left outer join b on a.id = b.a_id where b.a_id is null;
SELECT <columnns>
FROM a WHERE id NOT IN (SELECT a_id FROM b)
In case of one join it is pretty fast, but when we are removing records from database which has about 50 milions records and 4 and more joins due to foreign keys, it takes a few minutes to do it.
Much faster to use WHERE NOT IN condition like this:
select a.* from a
where a.id NOT IN(SELECT DISTINCT a_id FROM b where a_id IS NOT NULL)
//And for more joins
AND a.id NOT IN(SELECT DISTINCT a_id FROM c where a_id IS NOT NULL)
I can also recommended this approach for deleting in case we don't have configured cascade delete.
This query takes only a few seconds.
The first approach is
select a.* from a where a.id not in (select b.ida from b)
the second approach is
select a.*
from a left outer join b on a.id = b.ida
where b.ida is null
The first approach is very expensive. The second approach is better.
With PostgreSql 9.4, I did the "explain query" function and the first query as a cost of cost=0.00..1982043603.32.
Instead the join query as a cost of cost=45946.77..45946.78
For example, I search for all products that are not compatible with no vehicles. I've 100k products and more than 1m compatibilities.
select count(*) from product a left outer join compatible c on a.id=c.idprod where c.idprod is null
The join query spent about 5 seconds, instead the subquery version has never ended after 3 minutes.
Another way of writing it
select a.*
from a
left outer join b
on a.id = b.id
where b.id is null
Ouch, beaten by Nathan :)
This will protect you from nulls in the IN clause, which can cause unexpected behavior.
select * from a where id not in (select [a id] from b where [a id] is not null)