SQL Query needs to match similar records - sql

I have a very large table of contacts which I am building an interface to help my client to de-dupe. Here is an example of the table content
id | firstname | lastname | email | address1 | addres2 | verifiedAt |
1 | James | johnson | james#test.com | | | |
2 | David | bloggs | james#bloggs.com | | | |
3 | John | nobel | james#nobel.com | | | |
4 | Terry | jacket | james#jacket.com | | | 05/05/2013 |
5 | James | johnson | james#johnson.com| | | |
6 | James | privett | james#test.com | | | |
I need to write a query that will return the first contact that has another contact in the same table where either the email addresses match or the firstname + lastname match.
Is this possible in a single query?
Thanks in advance

Try this (SQL Fiddle).
SELECT DISTINCT *
FROM
( SELECT
MIN(id) as [id]
FROM mytable
GROUP BY email
HAVING COUNT(*) > 1
UNION ALL
SELECT
MIN(id) as [id]
FROM mytable
GROUP BY firstName,lastName
HAVING Count(*) > 1 )dups
JOIN myTable t
ON t.Id = dups.id

This works (SQLFiddle DEMO):
SELECT a.* FROM mytable a
JOIN (
SELECT email
FROM mytable
GROUP BY email
HAVING count(*) > 1
) b ON a.email = b.email
UNION
SELECT a.* FROM mytable a
JOIN (
SELECT firstname, lastname
FROM mytable
GROUP BY firstname, lastname
HAVING count(*) > 1
) b ON a.firstname = b.firstname AND a.lastname = b.lastname
To make sure that this query works fast, be sure to have at least following indexes:
CREATE INDEX i1 ON mytable(email);
CREATE INDEX i2 ON mytable(firstname, lastname);

One method:
with cte as
(select c.*,
row_number() over (partition by email order by id) rnem,
count(*) over (partition by email) ctem,
row_number() over (partition by firstname, lastname order by id) rnfl,
count(*) over (partition by firstname, lastname) ctfl
from contacts c)
select * from cte
where (ctem > 1 and rnem = 1) or (ctfl > 1 and rnfl = 1)
SQLFiddle here.

Related

How to handle duplicates created by LEFT JOIN

LEFT TABLE:
+------+---------+--------+
| Name | Surname | Salary |
+------+---------+--------+
| Foo | Bar | 100 |
| Foo | Kar | 300 |
| Fo | Ba | 35 |
+------+---------+--------+
RIGHT TABLE:
+------+-------+
| Name | Bonus |
+------+-------+
| Foo | 10 |
| Foo | 20 |
| Foo | 50 |
| Fo | 10 |
| Fo | 100 |
| F | 1000 |
+------+-------+
DESIRED OUTPUT:
+------+---------+--------+-------+
| Name | Surname | Salary | Bonus |
+------+---------+--------+-------+
| Foo | Bar | 100 | 80 |
| Foo | Kar | 300 | 0 |
| Fo | Ba | 35 | 110 |
+------+---------+--------+-------+
The closest I get is this:
SELECT
a.Name,
Surname,
sum(Salary),
sum(Bonus)
FROM (SELECT
Name,
Surname,
sum(Salary) as Salary
FROM input
GROUP BY 1,2) a LEFT JOIN (SELECT Name,
SUM(Bonus) as Bonus
FROM input2
GROUP BY 1) b
ON a.Name = b.Name
GROUP BY 1,2;
Which gives:
+------+---------+-------------+------------+
| Name | Surname | sum(Salary) | sum(Bonus) |
+------+---------+-------------+------------+
| Fo | Ba | 35 | 110 |
| Foo | Bar | 100 | 80 |
| Foo | Kar | 300 | 80 |
+------+---------+-------------+------------+
I can't figure out how to get rid of Bonus duplication. Ideal solution for me would be as specified in the 'DESIRED OUTPUT', which is adding Bonus to only one Name and for other records with the same Name adding 0.
You can use row_number():
select l.*, (case when l.seqnum = 1 then r.bonus else 0 end) as bonus
from (select l.*, row_number() over (partition by name order by salary) as seqnum
from "left" l
) l left join
(select r.name, sum(bonus) as bonus
from "right" r
group by r.name
) r
on r.name = l.name
Try a Row_number over the Name category partioned by Name. This will give you different numbers for your duplicates. You can then search for the case when this number is 1 and return the result you want. Else return 0. The code can look something like this.
SELECT
a.Name,
Surname,
sum(Salary),
Case when Duplicate_Order = 1
then bonus
else 0
end as 'Bonus'
FROM (SELECT
Name,
Surname,
sum(Salary) as Salary
,ROW_NUMBER() over (partition by Name order by name) as [Duplicate_Order]
FROM input
GROUP BY 1,2) a
LEFT JOIN (SELECT Name,
SUM(Bonus) as Bonus
FROM input2
GROUP BY 1) b
ON a.Name = b.Name
GROUP BY 1,2;
Hope that helps!
You can use Correlated Subquery with sum() aggregation to compute the bonus column, and then apply lag() window analytic function to get the zeros for successively identical valued column values for the name column :
select Name, Surname, Salary,
bonus - lag(bonus::int,1,0) over (partition by name order by salary) as bonus
from
(
select i1.*,
( select sum(Bonus)
from input2 i2
where i1.Name = i2.Name
group by i2.Name ) as bonus
from input i1
) ii
order by name desc, surname;
Demo

SQL - SELECT duplicates between IDs, but not show records if duplicates occur for same ID

I have the following table (simplified from the real table) at the moment:
+----+-------+-------+
| ID | Name | Phone |
+----+-------+-------+
| 1 | Tom | 123 |
| 1 | Tom | 123 |
| 1 | Tom | 123 |
| 2 | Mark | 321 |
| 2 | Mark | 321 |
| 3 | Kate | 321 |
+----+-------+-------+
My desired output in the SELECT statement is:
+----+------+-------+
| ID | Name | Phone |
+----+------+-------+
| 2 | Mark | 321 |
| 3 | Kate | 321 |
+----+------+-------+
I want to select duplicates only when they occur between two different IDs (like Mark and Kate sharing the same phone number), but not to show any records for IDs that share the same phone number with themselves only (like Tom).
Could someone advise how this can be achieved?
You can use an EXISTS condition with a correlated subquery to ensure that another record exists that has the same phone and a different id. We also need DISTINCT to remove the duplicates in the resultset.
SELECT DISTINCT id, name, phone
FROM mytable t
WHERE EXISTS (
SELECT 1
FROM mytable t1
WHERE t1.phone = t.phone AND t1.id <> t.id
)
Demo on DB Fiddle:
| id | name | phone |
| --- | ---- | ----- |
| 2 | Mark | 321 |
| 3 | Kate | 321 |
You can use window functions for this:
select t.*
from (select t.*,
row_number() over (partition by phone, name order by id) as seqnum,
min(id) over (partition by phone) as min_id,
max(id) over (partition by phone) as max_id
from t
) t
where seqnum = 1 and min_id <> max_id;
Another method uses aggregation and a window function:
select phone, name, id
from (select phone, name, id,
count(*) over (partition by phone) as num_ids
from t
group by phone, name, id
) pn
where num_ids > 1;
Both of these have the advantage over the exists solution (GMB's) that they refer to the "table" only once. That can be a big advantage if the table is a complex view or query. If performance is an issue, I would encourage you to test several variants to see which works best.
Can use somewhat a corelated query with group by and having as below
Select ID, NAME, max(PHONE) From
(Select * From Table) t group by id,
name having
1= max(
case
When phone in (select phone from
table where t.id<>Id) then 1 else 0)
end)

SQL Server aggregate functions - how to?

Input table contains 2 columns i.e. name and dept
+------+------+
| name | dept |
+------+------+
| A | 123 |
| B | 456 |
| A | 789 |
| C | 123 |
| A | 456 |
| B | 789 |
+------+------+
Output is
name
-----
A
so here A is working in 3 depts (123, 456, 789). How to retrieve the name who is working in all the 3 depts?
This might help you.
SELECT NAME
FROM TABLE1
GROUP BY NAME
HAVING COUNT(DISTINCT DEPT) =
(
SELECT COUNT(DISTINCT DEPT)
FROM TABLE1
)
Here's one option using a window function:
select name
from (
select name, count(distinct dept) cnt,
count(distinct dept) over () overallcnt
from yourtable
group by name
) t
where cnt = overallcnt
Try this:
SELECT NAME
FROM TABLE1
GROUP BY NAME
HAVING COUNT(DISTINCT DEPT)=(SELECT COUNT(DISTINCT DEPT) FROM TABLE1 )

SQL: Select only one row of table with same value

Im a bit new to sql and for my project I need to do some Database sorting and filtering:
Let's assume my database looks like this:
==========================================
| id | email | name
==========================================
| 1 | 123#test.com | John
| 2 | 234#test.com | Peter
| 3 | 234#test.com | Steward
| 4 | 123#test.com | Ethan
| 5 | 542#test.com | Bob
| 6 | 123#test.com | Patrick
==========================================
What should I do to only have the last column with the same email te be returned:
==========================================
| id | email | name
==========================================
| 3 | 234#test.com | Steward
| 5 | 542#test.com | Bob
| 6 | 123#test.com | Patrick
==========================================
Thanks in advance!
SQL Query:
SELECT * FROM test.test1 WHERE id IN (
SELECT MAX(id) FROM test.test1 GROUP BY email
);
Hope this solves your problem. Thanks.
A generic way to do this in SQL is to use the ANSI standard row_number() function:
select t.*
from (select t.*, row_number() over (partition by email order by id desc) as seqnum
from t
) t
where seqnum = 1;
Here is a clearer way:
SELECT *
FROM table
ORDER BY email DESC
LIMIT 1;
You can use following query to get the MAX id value per email:
SELECT email, MAX(id)
FROM mytable
GROUP BY email
Using the above query as a derived table you can obtain the whole record:
SELECT t1.*
FROM mytable AS t1
JOIN (
SELECT email, MAX(id) AS id
FROM mytable
GROUP BY email
) AS t2 ON t1.id = t2.id

Select duplicate records info

I have a person table:
Phone | Id1 | Id2 | Fname | Lname| Street
111111111 | A1 | 1000 | David | Luck | 123 Main Street
111111111 | A2 | 1001 | David | Luck | blank
111111111 | A3 | 1002 | David | Luck | blank
222222222 | B1 | 2000 | Smith | Nema | blank
333333333 | C1 | 3000 | Lanyn | Buck | 456 Street
I would like to have the result below:
Phone | Id1 | Id2 | Fname | Lname| Street
111111111 | A1 | 1000 | David | Luck | 123 Main Street
222222222 | B1 | 2000 | Smith | Nema | blank
333333333 | C1 | 3000 | Lanyn | Buck | 456 Street
What SQL2008 query should I be using to pick the dup phone records that have street info? Thanks
You want to choose a particular row. This is where the window function row_number() is most useful. The challenge is finding the right order by clause:
select p.Phone, p.Id1, p.Id2, p.Fname, p.Lname, p.Street
from (select p.*,
row_number() over (partition by phone
order by (case when street is not null then 0 else 1 end),
id2
) as seqnum
from person p
) p
where seqnum = 1
The function row_number() assigns a sequential number to rows with the same value of phone (based on the partition by clause). The one with non-blank street and lowest id2 gets a value of 1. If none exist, then the one with the lowest id2 gets the value. That is the one chosen by the outer filter.
If your street is blank (as in empty set '' or NULL) when not populated with an actual address, you can use this to get your results:
SELECT a.*
FROM Person a
JOIN (SELECT Phone, MAX(Street)'Street'
FROM Person
GROUP BY Phone
)b
ON a.Phone = b.Phone
AND a.Street = b.Street
Demo: SQL Fiddle
If your street was literally the string 'Blank' then the above would not return the desired results.
SELECT a.*
FROM person a
JOIN ( SELECT Phone, Street,
ROW_NUMBER() OVER (PARTITION BY Phone
ORDER BY CASE WHEN street is null then 0 else 1 end) as 'Rank'
FROM Person
)b
ON a.Phone = b.Phone
AND a.Street = b.Street
WHERE b.Rank = 1
Try this
select a.* from Table1 a
inner join
(
select distinct Phone from Table1
group by Phone
) as b
on a.Phone= b.Phone