SQL Server : finding duplicates based on first few characters on column - sql

I want to find duplicates based on the first three characters of the surname, is there a way a to do that on SQL? I can compare the whole name, but how to do we compare the first few characters?
Below are my tables
custid forename surname dateofbirth
----------------------------------------
1 David John 16-09-1985
2 David Jon 16-09-1985
3 Sarah Smith 10-08-2015
4 Peter Proca 11-06-2011
5 Peter Proka 11-06-2011
This is my query that I am currently running to compare
SELECT
y.id, y.forename, y.surname
FROM
customers y
INNER JOIN
(SELECT
forename, surname, COUNT(*) AS CountOf
FROM customers
GROUP BY forename, surname
HAVING COUNT(*) > 1) dt ON y.forename = dt.forename

You can use left():
select c.*
from (select c.*, count(*) over (partition by left(surname, 3)) as cnt
from customers c
) c
order by surname;
You can include the forename as well in the partition by if you mean forename and first three letters of surname.

You can use exists as follows:
select t.* from t
Where exists
(select 1 from t tt
Where left(t.surname, 3) = left(tt.surname, 3) and t.custid <> tt.custid
)
order by t.surname;

Related

SQL Server: Duplicates but based on specific criteria

I am trying to find duplicates based on forename, surname, and dateofbirth in my database. Below is what I have
Customers table:
custid cust_refno forename surname dateofbirth
1 10 David John 10-02-1980
2 20 Peter Broad 15-08-1978
3 30 Sarah Holly 16-09-1982
4 40 Mathew Mark 25-08-2001
5 50 Matt Mark 25-08-2001
Address table:
addid cust_refno addresstype line1
1 10 address No. 10, Mineview Road
2 10 address No. 20, Mineview Lane
3 20 address Rockview cottage, blackthorn
4 30 mobile 0504135864
5 40 address No. 64, New Lane
6 40 mobile 0504896532
7 50 address No. 11, John's cottage
Some customers have multiple addresses, so they are not duplicates. I am trying to find a way to avoid displaying those as duplicates. Can you advice how I can do that?
my query:
SELECT DISTINCT t.FORENAME, t.SURNAME, t.CUST_REFNO, t.DATE_OF_BIRTH , a.LINE1 FROM CUSTOMERS AS t
LEFT OUTER JOIN dbo.ADDRESS a
ON t.CUST_REFNO = a.CUST_REFNO
INNER JOIN (
SELECT FORENAME, surname, DTTM_OF_BIRTH
FROM CUSTOMERS GROUP BY FORENAME, SURNAME, DATE_OF_BIRTH
HAVING COUNT(*) > 1) AS td
ON t.FORENAME = td.FORENAME AND t.DTTM_OF_BIRTH = td.DATE_OF_BIRTH
AND t.SURNAME = td.SURNAME
WHERE a.addresstype = 'address'
my result is:
Forename surname cust_refno dateofbirth line1
David John 10 10-02-1980 No. 10, Mineview Road
David John 10 10-02-1980 No. 20, Mineview Lane
But in reality it is not a duplicate. Its just that the addresses are different. Is there a way to compare the cust_refno and see if it already exists so even if the address is different if the cust_refno is the same it does not show again?
If you want to get the customers with duplicates address, you can count how many times a customer has the same address and return just that with more than one:
SELECT t.FORENAME, t.SURNAME, t.CUST_REFNO, t.DATE_OF_BIRTH , a.LINE1
FROM CUSTOMERS AS t INNER JOIN ADDRESS a ON t.CUST_REFNO = a.CUST_REFNO
GROUP BY t.FORENAME, t.SURNAME, t.CUST_REFNO, t.DATE_OF_BIRTH , a.LINE1
HAVING COUNT(a.LINE1) > 1
You can use window functions to filter out customers with more than one address. Then aggregation can be used to return the duplicates:
select forename, surname, dateofbirth
from customers c join
(select a.*,
count(*) over (partition by cust_refno) as cnt
from addresses a
where addresstype = 'address'
) a
on c.cust_refno = a.cust_refno
where cnt = 1
group by forename, surname, dateofbirth
having count(*) > 1;
If you want the full customer record, just use window functions twice:
select c.*
from (select c.*,
count(*) over (partition by forename, surname, dateofbirth) as cnt
from customers c
) c join
(select a.*,
count(*) over (partition by cust_refno) as cnt
from addresses a
where addresstype = 'address'
) a
on c.cust_refno = a.cust_refno
where a.cnt = 1 and c.cnt > 1;
You can use the analytical function count and row_number as follows:
select * from
(SELECT t.FORENAME, t.SURNAME, t.CUST_REFNO, t.DATE_OF_BIRTH ,
a.LINE1,
row_number() over (partition by t.FORENAME, t.SURNAME, t.DATE_OF_BIRTH
order by 1) as rn,
count(1) over (partition by t.FORENAME, t.SURNAME, t.DATE_OF_BIRTH) as cnt
FROM CUSTOMERS AS t
LEFT OUTER JOIN dbo.ADDRESS a ON t.CUST_REFNO = a.CUST_REFNO
WHERE a.addresstype = 'address') t
where cnt > 1 and rn = 1

Name of Teacher with Highest Wage - recursive CTE

I am trying to get the max salary of each dept and display that teacher by first name as a separate column. So dept 1 may have 4 rows but one name showing for max salary. I'm Using SQL SERVER
With TeacherList AS(
Select Teachers.FirstName,Teachers.LastName,
Teachers.FacultyID,TeacherID, 1 AS LVL,PrincipalTeacherID AS ManagerID
FROM dbo.Teachers
WHERE PrincipalTeacherID IS NULL
UNION ALL
Select Teachers.FirstName,Teachers.LastName,
Teachers.FacultyID,Teachers.TeacherID, TeacherList.LVL +
1,Teachers.PrincipalTeacherID
FROM dbo.Teachers
INNER JOIN TeacherList ON Teachers.PrincipalTeacherID =
TeacherList.TeacherID
WHERE Teachers.PrincipalTeacherID IS NOT NULL)
SELECT * FROM TeacherList;
SAMPLE OUTPUT :
Teacher First Name | Teacher Last Name | Faculty| Highest Paid In Faculty
Eric Smith 1 Eric
Alex John 1 Eric
Jessica Sewel 1 Eric
Aaron Gaye 2 Aaron
Bob Turf 2 Aaron
I'm not sure from your description but this will return all teachers and the last row is the name of the teacher with the highest pay on the faculty.
select tr.FirstName,
tr.LastName,
tr.FacultyID,
th.FirstName
from Teachers tr
join (
select FacultyID, max(pay) highest_pay
from Teachers
group by FacultyID
) t on tr.FacultyID = t.FacultyID
join Teachers th on th.FacultyID = t.FacultyID and
th.pay = t.highest_pay
this will produce an unexpected result (duplicate rows) if there are more persons with the highest salary on the faculty. In such case you may use window functions as follows:
select tr.FirstName,
tr.LastName,
tr.FacultyID,
t.FirstName
from Teachers tr
join
(
select t.FirstName,
t.FacultyID
from
(
select t.*,
row_number() over (partition by FacultyID order by pay desc) rn
from Teachers t
) t
where t.rn = 1
) t on tr.FacultyID = t.FacultyID
This will display just one random teacher from faculty with highest salary.
dbfiddle demo
You can do this with a CROSS APPLY.
SELECT FirstName, LastName, FacultyID, HighestPaid
FROM Teachers t
CROSS APPLY (SELECT TOP 1 FirstName AS HighestPaid
FROM Teachers
WHERE FacultyID = t.FacultyID
ORDER BY Salary DESC) ca

SQL select statement avoiding duplicated rows based on primary key

I have a table employee with two columns-empid(primary key), name. Suppose it has below three rows.
EmpID Name
---------------
11 Name1
12 Name2
11 Name3
How would I write a select statement to select records avoiding the two rows which have duplicating empid. I used query like:
select empid, name
from(select empid, name, row_number() over(partition by empid order by empid desc) rnk
from t)a
where a.rnk=1
But this query will give
EmpID Name
---------------
11 Name1
12 Name2
As the result. But all I need is
EmpID Name
---------------
12 Name2
try this query, this will work and give you the row 12 Name2
select empid, name from employee a
join (
select empid , count(empid) as count1 from employee
group by empid
having count(empid)=1 ) b on a.empid=b.empid
select empid, name
from(select empid, name,count(*) over(partition by empid) cnt from t) t
where cnt=1
An anti join using NOT EXISTS might be the fastest approach:
SELECT empID, Name
FROM T
WHERE NOT EXISTS (SELECT 1 FROM T AS T2 WHERE T2.EmpID = T.EmpID AND T2.Name <> T.Name);
I have done no testing, so it is possible that the optimiser might be able to generate a anti semi-join using a count = 1 operation, but this gives it the best possible chance of getting to that plan.
Would not SELECT max(empid) as empid, name from employee group by name having count(distinct empid) < 2 work?

SQL: select from same table and same column, just different counts

I have a table called names, and I want to select 2 names after being count(*) as uniq, and then another 2 names just from the entire sample pool.
firstname
John
John
Jessica
Mary
Jessica
John
David
Walter
So the first 2 names would select from a pool of John, Jessica, and Mary etc giving them equal chances of being selected, while the second 2 names will select from the entire pool, so obvious bias will be given to John and Jessica with multiple rows.
I'm sure there's a way to do this but I just can't figure it out. I want to do something like
SELECT uniq.firstname
FROM (SELECT firstname, count(*) as count from names GROUP BY firstname) uniq
limit 2
AND
SELECT firstname
FROM (SELECT firstname from names) limit 2
Is this possible? Appreciate any pointers.
I think you are close but you need some randomness for the sampling:
(SELECT uniq.firstname
FROM (SELECT firstname, count(*) as count from names GROUP BY firstname) uniq
ORDER BY rand()
limit 2
)
UNION ALL
(SELECT firstname
FROM from names
ORDER BY rand()
limit 2
)
As mentioned here you can use RAND or similar functions to achieve it depending on the database.
MySQL:
SELECT firstname
FROM (SELECT firstname, COUNT(*) as count FROM names GROUP BY firstname)
ORDER BY RAND()
LIMIT 2
PostgreSQL:
SELECT firstname
FROM (SELECT firstname, COUNT(*) as count FROM names GROUP BY firstname)
ORDER BY RANDOM()
LIMIT 2
Microsoft SQL Server:
SELECT TOP 2 firstname
FROM (SELECT firstname, COUNT(*) as count FROM names GROUP BY firstname)
ORDER BY NEWID()
IBM DB2:
SELECT firstname , RAND() as IDX
FROM (SELECT firstname, COUNT(*) as count FROM names GROUP BY firstname)
ORDER BY IDX FETCH FIRST 2 ROWS ONLY
Oracle:
SELECT firstname
FROM(SELECT firstname, COUNT(*) as count FROM names GROUP BY firstname ORDER BY dbms_random.value )
WHERE rownum in (1,2)
Follow the similar approach for selecting from entire pool

sql command to display highest repeated field in a column

How to display the highest repeated field in a column in sql ?
for eg if a column contains:
jack
jack
john
john
john
how to display the maximum repeated field (i.e) john from the above column?
select chairman
from mytable
group by chairman
HAVING COUNT(*) = (
select TOP 1 COUNT(*)
from mytable
group by chairman
ORDER BY COUNT(*) DESC)
select name from persons
group by name
having count(*) = (
select count(*) from persons
group by name
order by count(*) desc
limit 1)