SQL: Eliminating duplicates with specific conditions - sql

I have a SQL database that SOMETIMES has duplicate values, but only in one column (phone number). If there is a duplicate, the other attributes in the same row are filled in with NULL. In other cases, the phone number is not duplicated, but still has NULL values in the rows. Ex:
first_name
last_name
phone_number
john
smith
123-456-7890
NULL
NULL
123-456-7890
NULL
NULL
456-789-1011
carry
smith
121-314-1516
I'm trying to write a query that eliminates cases where the phone number is duplicated and the other values in the row are NULL, to get:
first_name
last_name
phone_number
john
smith
123-456-7890
NULL
NULL
456-789-1011
carry
smith
121-314-1516
Any ideas?

In cases like this you probably want a NOT EXISTS clause. This does a lookup for each row in the table, to see if there are any other records with the same phone number and populated name fields.
select
first_name,
last_name,
phone_number
from
phone_numbers pn
where
not exists (
select 1
from phone_numbers pn2
where pn2.phone_number = pn.phone_number
and pn.first_name is not null
and pn.last_name is not null
)
Although I'm not sure it's perfect. If there is a case where two records have the same phone number and both have NULL names then neither would be returned.

One way, might be to use a subquery to identify the phone_numbers only once.. and then outer join to the records you want without nulls. Something like this:
SELECT *
FROM
(SELECT phone_number AS root_phone_number
FROM table
GROUP BY phone_number
) AS phonenumbers
LEFT OUTER JOIN
(SELECT *
FROM table
WHERE first_name IS NOT NULL
) as notnulls
ON phonenumbers.phone_number = notnulls.phone_number

Here is the fastest way to do it, left join to the items you want to remove and then add a where clause for null results for the join. Every row that meets the join requirements WILL NOT be in the results.
I call this an Exclusionary Left Join.
SELECT *
FROM tableyoudidnotname main
LEFT JOIN tableyoudidnotname sub on
main.phone_number = sub.phone_number
and sub.first_name is null
and sub.last_name is null
WHERE sub.phone_number is null

I would use cte for it. Here's the code that does it.
with cte as (
select phone_number from phone_numbers
group by phone_number
having count(*) > 1
)
delete phone_numbers
where phone_number in (select phone_number from cte)
and first_name is null and last_name is null

Related

Fetch Parent Record anyway even if child condition doesn't meet without using ON

I have two tables Employee and Address.
One Employee can have multiple Address.
Here I want to fetch 'active employee details' and 'active address details' of a particular emp_id. I can achieve this by below query :
Table Structure:
Employee(emp_id,name,is_active)
Address(add_id,emp_id,address,is_active)
Query:
SELECT * FROM EMPLOYEE e
LEFT OUTER JOIN ADDRESS a
ON e.emp_id=a.emp_id
WHERE e.is_active='A'
AND a.is_active='A';
Using above query it does not return any employee details if no active address. I want to return active employee details anyways even if it does not have any active address.
Note: as I am using Hibernate looking for a query without using ON . Only Where clause can be used here.
Kindly suggest.
You need to put a.is_active='A' in ON clause not in WHERE clause
SELECT * FROM EMPLOYEE e
LEFT OUTER JOIN ADDRESS a
ON e.emp_id=a.emp_id AND a.is_active='A'
WHERE e.is_active='A';
Since you have restrictions on using condition in on clause you can try below approach. It will return rows where address is active or address is not available (assuming that is_active column is never null in address table).
Schema and insert statements:
create table EMPLOYEES(emp_id int, name varchar(20), is_active varchar(10));
create table Address(add_id int ,emp_id int ,address varchar(50),is_active varchar(10));
insert into EMPLOYEES values (1,'NAME1','A');
insert into Address values(1,1,'addr1','N');
insert into Address values(2,1,'addr1','N');
Query:
SELECT * FROM EMPLOYEES e
LEFT OUTER JOIN (select * from Address where is_active='A') a
ON e.emp_id=a.emp_id
WHERE e.is_active='A'
AND (a.is_active='A' or a.is_active is null);
Output:
EMP_ID
NAME
IS_ACTIVE
ADD_ID
EMP_ID
ADDRESS
IS_ACTIVE
1
NAME1
A
null
null
null
null
db<>fiddle here

full outer join on two keys

I am trying to do merge two tables on phone numbers so that if I find phone in either of the tables then join the rest of the fields as shown below.
Now there are scenarios where phone doesn't exist in both the tables. Then the table should join on email_id, so basically first check if phone matches if not then check for email id match. If none then drop the record.
select COALESCE(icici.phone, hsbc.phone) as phone,
COALESCE(icici.email_id, hsbc.email_id) as email_id, city
from credit_card.icici
full outer join credit_card.hsbc on icici.phone = hsbc.phone
or icici.email_id = hsbc.email_id
limit 10
But I am getting this error
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions
SQL state: 0A000
Is there a way to solve it, or is there a better way to do this?
You can union the result of a left and a right join:
SELECT COALESCE(icici.phone, hsbc.phone) as phone,
COALESCE(icici.email_id, hsbc.email_id) as email_id, city
FROM credit_card.icici
LEFT OUTER JOIN credit_card.hsbc on icici.phone = hsbc.phone
OR icici.email_id = hsbc.email_id
UNION (
SELECT COALESCE(icici.phone, hsbc.phone) as phone,
COALESCE(icici.email_id, hsbc.email_id) as email_id, city
FROM credit_card.icici
RIGHT OUTER JOIN credit_card.hsbc on icici.phone = hsbc.phone
OR icici.email_id = hsbc.email_id
WHERE icici.id IS NULL
)
However, the right join may only contain the rows that were not found for any values from the left table. These rows are filtered out using WHERE, for example, by checking the primary key for NULL.
Use union all and aggregation:
select phone, max(email_id)
maxCOALESCE(icici.phone, hsbc.phone) as phone,
COALESCE(icici.email_id, hsbc.email_id) as email_id, city
from ((select phone, email_id
from credit_card.icici
) union all
(select phone, email_id
from credit_card.hsbc
)
) cc
group by phone,
(case when phone is null then email_id end);

Will this left join on same table ever return data?

In SQL Server, on a re-engineering project, I'm walking through some old sprocs, and I've come across this bit. I've hopefully captured the essence in this example:
Example Table
SELECT * FROM People
Id | Name
-------------------------
1 | Bob Slydell
2 | Jim Halpert
3 | Pamela Landy
4 | Bob Wiley
5 | Jim Hawkins
Example Query
SELECT a.*
FROM (
SELECT DISTINCT Id, Name
FROM People
WHERE Id > 3
) a
LEFT JOIN People b
ON a.Name = b.Name
WHERE b.Name IS NULL
Please disregard formatting, style, and query efficiency issues here. This example is merely an attempt to capture the exact essence of the real query I'm working with.
After looking over the real, more complex version of the query, I burned it down to this above, and I cannot for the life of me see how it would ever return any data. The LEFT JOIN should always exclude everything that was just selected because of the b.Name IS NULL check, right? (and it being the same table). If a row from People was found where b.Name IS NULL evals to true, then shouldn't that mean that data found in People a was never found? (impossible?)
Just to be very clear, I'm not looking for a "solution". The code is what it is. I'm merely trying to understand its behavior for the purpose of re-engineering it.
If this code indeed never returns results, then I'll conclude it was written incorrectly and use that knowledge during the re-engineering.
If there is a valid data scenario where it would/could return results, then that will be news to me and I'll have to go back to the books on SQL Joins! #DrivenCrazy
Yes. There are circumstances where this query will retrieve rows.
The query
SELECT a.*
FROM (
SELECT DISTINCT Id, PName
FROM People
WHERE Id > 3
) a
LEFT JOIN People b
ON a.PName = b.PName
WHERE b.PName IS NULL;
is roughly (maybe even exactly) equivalent to...
select distinct Id, PName
from People
where Id > 3 and PName is null;
Why?
Tested it using this code (mysql).
create table People (Id int, PName varchar(50));
insert into People (Id, Pname)
values (1, 'Bob Slydell'),
(2, 'Jim Halpert'),
(3,'Pamela Landy'),
(4,'Bob Wiley'),
(5,'Jim Hawkins');
insert into People (Id, PName) values (6,null);
Now run the query. You get
6, Null
I don't know if your schema allows null Name.
What value can P.Name have such that a.PName = b.PName finds no match and b.PName is Null?
Well it's written right there. b.PName is Null.
Can we prove that there is no other case where a row is returned?
Suppose there is a value for (Id,PName) such that PName is not null and a row is returned.
In order to satisfy the condition...
where b.PName is null
...such a value must include a PName that does not match any PName in the People table.
All a.PName and all b.PName values are drawn from People.PName ...
So a.PName may not match itself.
The only scalar value in SQL that does not equal itself is Null.
Therefore if there are no rows with Null PName this query will not return a row.
That's my proposed casual proof.
This is very confusing code. So #DrivenCrazy is appropriate.
The meaning of the query is exactly "return people with id > 3 and a null as name", i.e. it may return data but only if there are null-values in the name:
SELECT DISTINCT Id, PName
FROM People
WHERE Id > 3 and PName is null
The proof for this is rather simple, if we consider the meaning of the left join condition ... LEFT JOIN People b ON a.PName = b.PName together with the (overall) condition where p.pname is null:
Generally, a condition where PName = PName is true if and only if PName is not null, and it has exactly the same meaning as where PName is not null. Hence, the left join will match only tuples where pname is not null, but any matching row will subsequently be filtered out by the overall condition where pname is null.
Hence, the left join cannot introduce any new rows in the query, and it cannot reduce the set of rows of the left hand side (as a left join never does). So the left join is superfluous, and the only effective condition is where PName is null.
LEFT JOIN ON returns the rows that INNER JOIN ON returns plus unmatched rows of the left table extended by NULL for the right table columns. If the ON condition does not allow a matched row to have NULL in some column (like b.NAME here being equal to something) then the only NULLs in that column in the result are from unmatched left hand rows. So keeping rows with NULL for that column as the result gives exactly the rows unmatched by the INNER JOIN ON. (This is an idiom. In some cases it can also be expressed via NOT IN or EXCEPT.)
In your case the left table has distinct People rows with a.Id > 3 and the right table has all People rows. So the only a rows unmatched in a.Name = b.Name are those where a.Name IS NULL. So the WHERE returns those rows extended by NULLs.
SELECT * FROM
(SELECT DISTINCT * FROM People WHERE Id > 3 AND Name IS NULL) a
LEFT JOIN People b ON 1=0;
But then you SELECT a.*. So the entire query is just
SELECT DISTINCT * FROM People WHERE Id > 3 AND Name IS NULL;
sure.left join will return data even if the join is done on the same table.
according to your query
"SELECT a.*
FROM (
SELECT DISTINCT Id, Name
FROM People
WHERE Id > 3
) a
LEFT JOIN People b
ON a.Name = b.Name
WHERE b.Name IS NULL"
it returns null because of the final filtering "b.Name IS NULL".without that filtering it will return 2 records with id > 3

How to fetch the non matching rows in Oracle

Can anyone help me fetch the non matching rows from two tables in Oracle?
Table: Names
Class_id Stud_name
S001 JAMES
S001 PETER
S002 MARK
Table: Course
Course_id Stud_name
S001 JAMES
S001 KEITH
S002 MARK
Output
I need the rows to display as
CLASS ID STUD_NAME_FROM_NAME_TABLE STUD_NAME_FROM_COURSE_TABLE
---------------------------------------------------------------------
S001 PETER KEITH
I have used Oracle joins to fetch the non matching names:
SELECT *
FROM Names, Course
WHERE Names.Class_id=Course.Course_id
AND Names.Stud_name<>Course.Stud_name
This query is returning duplicate rows.
If you insist on Join you can use this one:
SELECT *
FROM Names
FULL OUTER JOIN Course ON Names.Class_id=Course.Course_id
AND Names.Stud_name = Course.Stud_name
WHERE Names.Stud_name IS NULL or Course.Stud_name IS NULL
Fetches unmatched rows in Names table
SELECT * FROM Names
WHERE
NOT EXISTS
(SELECT 'x' from Course
WHERE
Names.Class_id = Course.Course_id AND
Names.Stud_name = Course.Stud_name)
Fetches unmatched rows in Names and Course too!
SELECT Names.Class_id,Names.Stud_name,C1.Stud_name
FROM Names , Course C1
WHERE Names.Class_id = C1.Course_id AND
NOT EXISTS
(SELECT 'x' from Course C2
WHERE
Names.Class_id = C2.Course_id AND
Names.Stud_name = C2.Stud_name);
When you ask for unmatching rows I assume that you want rows that exist in names but not in course.
If this is the case you're probably after
select * from names
where (class_id, stud_name ) not in
(select course_id, stud_name from course);
Your query returned duplicate rows beacuse for each row in names it selected all rows in course that satisfied the where condition.
So, for the row S001, PETER in names it faound that S001, JAMES and S001, KEITH matched that condition, thus, that row was "returned" twice.
EDIT Since it is not clear if stud_name is a primary key, or unique (and on second sight I think it's not), you'd probably want a
select * from names
where not exists (
select 1 from course where
names.class_id = course.course_id and
names.stud_name <> course.stud_name
)
Edit II if you insist on using a join (as per your comment) you might want to try a
select distinct names.* from...
Hope it helps you
with not_in_class as
(select a.*
from Names a
where not exists ( select 'x'
from course b
where b.Course_id = a.class_id
and a.Stud_name = b.Stud_name)),
not_in_course as
(select b.*
from course b
where not exists ( select 'x'
from Names a
where b.Course_id = a.class_id
and a.Stud_name = b.Stud_name))
select x.class_id,
x.Stud_name NOT_IN_CLASS,
y.stud_name NOT_IN_COURSE
from not_in_class x, not_in_course y
where x.class_id = y.course_id
Output
| CLASS_ID | NOT_IN_CLASS | NOT_IN_COURSE |
|----------|--------------|---------------|
| S001 | PETER | KEITH |
Only problem is that if multiple mismatches are there in both the tables for a given id, it works for single mismatch for a particular id. You need to rework if multiple mismatches are there for the same id.
Well, I am not sure if I understand correctly what you are asking. I think you want a list of all IDs where the student list in class table and course table differs. Then you want to show the id and the students that are in class but not in course and the students that are in course but not in class.
To do so you would full outer join the tables. That gives you students that are both in class and course, students that are in class and not in course, and students that are in course and not in class. Filter your results where either class_id or course_id is null then to get the students missing in course or class. At last group by id and list the students.
select coalesce(class.class_id, course.course_id) as id
, listagg(class.stud_name, ',') within group (order by class.stud_name) as missing_in_course
, listagg(course.stud_name, ',') within group (order by course.stud_name) as missing_in_class
from class
full outer join course
on (class.class_id = course.course_id and class.stud_name = course.stud_name)
where class.class_id is null or course.course_id is null
group by coalesce(class.class_id, course.course_id);
Here is the SQL fiddle showing how it works: http://sqlfiddle.com/#!4/8aaaa/2
EDIT: In Oracle 9i there is no listagg. You can use the inofficial function wm_concat instead:
select coalesce(class.class_id, course.course_id) as id
, wm_concat(class.stud_name) as missing_in_course
, wm_concat(course.stud_name) as missing_in_class
from class
full outer join course
on (class.class_id = course.course_id and class.stud_name = course.stud_name)
where class.class_id is null or course.course_id is null
group by coalesce(class.class_id, course.course_id);

joining tables while keeping the Null values

I have two tables:
Users: ID, first_name, last_name
Networks: user_id, friend_id, status
I want to select all values from the users table but I want to display the status of specific user (say with id=2) while keeping the other ones as NULL. For instance:
If I have users:
? first_name last_name
------------------------
1 John Smith
2 Tom Summers
3 Amy Wilson
And in networks:
user_id friend_id status
------------------------------
2 1 friends
I want to do search for John Smith for all other users so I want to get:
id first_name last_name status
------------------------------------
2 Tom Summers friends
3 Amy Wilson NULL
I tried doing LEFT JOIN and then WHERE statement but it didn't work because it excluded the rows that have relations with other users but not this user.
I can do this using UNION statement but I was wondering if it's at all possible to do it without UNION.
You need to put your condition into the ON clause of the LEFT JOIN.
Select
u.first_name,
u.last_name,
n.status
From users u
Left Join networks n On ( ( n.user_id = 1 And n.friend_id = u.id )
Or ( n.friend_id = 1 And n.user_id = u.id )
Where u.id <> 1
This should return you all users (except for John Smith) and status friend if John Smith is either friend of this user, or this user is friend of John Smith.
You probably don't need a WHERE clause, and instead of that, put the condition into the "ON" clause that follows your "LEFT JOIN". That should fix your issues. Also, make sure that the main table is on the left side of the left join, otherwise, you should use a right join.
In addition to the (correct) replies above that such conditions should go in the ON clause, if you really want to put them in the WHERE clause for some reason, just add a condition that the value can be null.
WHERE (networks.friendid = 2 OR networks.friendid IS NULL)
From what you've described, it should be a case of joining a subset of networks to users.
select id, first_name, last_name, status
from users u
left join networks n on u.id = n.user_id
and n.friend_id = 1
where id <> 1;
The left join will keep rows from users that do not have a matching row in networks and adding the and n.friend_id = 1 limits when the 'friends' status is returned. Lastly, you may choose to exclude the row from users that you are running the query for.