SQL: Finding duplicate records based on custom criteria - sql

I need to find duplicates based on two tables and based on custom criteria. The following determines whether it's a duplicate, and if so, show only the most recent one:
If Employee Name and all EmployeePolicy CoverageId(s) are an exact match another record, then that's considered a duplicate.
--Employee Table
EmployeeId Name Salary
543 John 54000
785 Alex 63000
435 John 75000
123 Alex 88000
333 John 67000
--EmployeePolicy Table
EmployeePolicyId EmployeeId CoverageId
1 543 8888
2 543 7777
3 785 5555
4 435 8888
5 435 7777
6 123 4444
7 333 8888
8 333 7776
For example, the duplicates in the example above are the following:
EmployeeId Name Salary
543 John 54000
435 John 75000
This is because they are the only ones that have a matching name in the Employee table as well as both have the same exact CoverageIds in the EmployeePolicy table.
Note: EmployeeId 333 also with Name = John is not a match because both of his CoverageIDs are not the same as the other John's CoverageIds.
At first I have been trying to find duplicates the old fashioned way by Grouping records and saying having count(*) > 1, but then quickly realized that it would not work because while in English my criteria defines a duplicate, in SQL the CoverageIDs are different so they are NOT considered duplicates.
By that same accord, I tried something like:
-- Create a TMP table
INSERT INTO #tmp
SELECT *
FROM Employee e join EmployeePolicy ep on e.EmpoyeeId = ep.EmployeeId
SELECT info.*
FROM
(
SELECT
tmp.*,
ROW_NUMBER() OVER(PARTITION BY tmp.Name, tmp.CoverageId ORDER BY tmp.EmployeeId DESC) AS RowNum
FROM #tmp tmp
) info
WHERE
info.RowNum = 1 AND
Again, this does not work because SQL does not see this as duplicates. Not sure how to translate my English definition of duplicate into SQL definition of duplicate.
Any help is most appreciated.

The easiest way is to concatenate the policies into a string. That, alas, is cumbersome in SQL Server. Here is a set-based approach:
with ep as (
select ep.*, count(*) over (partition by employeeid) as cnt
from employeepolicy ep
)
select ep.employeeid, ep2.employeeid
from ep join
ep ep2
on ep.employeeid < ep2.employeeid and
ep.CoverageId = ep2.CoverageId and
ep.cnt = ep2.cnt
group by ep.employeeid, ep2.employeeid, ep.cnt
having count(*) = cnt -- all match
The idea is to match the coverages for different employees. A simple criteria is that the number of coverages need to match. Then, it checks that the number of matching coverages is the actual count.
Note: This puts the employee id pairs in a single row. You can join back to the employees table to get the additional information.

I have not tested the T-SQL but I believe the following should give you the output you are looking for.
;WITH CTE_Employee
AS
(
SELECT E.[Name]
,E.[EmployeeId]
,P.[CoverageId]
,E.[Salary]
FROM Employee E
INNER JOIN EmployeePolicy P ON E.EmployeeId = P.EmployeeId
)
, CTE_DuplicateCoverage
AS
(
SELECT E.[Name]
,E.[CoverageId]
FROM CTE_Employee E
GROUP BY E.[Name], E.[CoverageId]
HAVING COUNT(*) > 1
)
SELECT E.[EmployeeId]
,E.[Name]
,MAX(E.[Salary]) AS [Salary]
FROM CTE_Employee E
INNER JOIN CTE_DuplicateCoverage D ON E.[Name] = D.[Name] AND E.[CoverageId] = D.[CoverageId]
GROUP BY E.[EmployeeId], E.[Name]
HAVING COUNT(*) > 1
ORDER BY E.[EmployeeId]

Related

Access Query: Subtract last 2 values, specific to ID

Help appreciated! My table is setup as follows:
fake data TableName = GAD7
[PatientID Date Value
Sam 10/21/2022 15
George 06/12/2022 7
Luke 09/03/2021 11
Sam 05/15/2020 20
George 12/02/2017 2
George 01/01/1992 6][1]
So I have potentially multiple rows of the same patient, w/different dates.
I need to create a query that subtracts the LAST 2/most recent values for each patient.
So my query would show only those with 2+ records. Negative values are fine/expected.
My successful query would then show:
PatientID (LastScore - 2nd_toLastScore)
Sam -5.0
George 5.0
Luke is not shown because he only has one value
I was able to formulate a query to show only those PatientIDs with >= 2 records and last date and last value. I am not sure how to get the second from last date/value AND THEN subtract those values.
Access query
The SQL view :
SELECT GAD7.PatientID, Count(GAD7.PatientID) AS CountOfPatientID, Last(GAD7.TestDate) AS LastDate, Last(GAD7.Score) AS LastScore
FROM GAD7
GROUP BY GAD7.PatientID
HAVING (((Count(GAD7.PatientID))>=2))
ORDER BY GAD7.PatientID;
Consider:
Query1: Score1
SELECT GAD7.*
FROM GAD7
WHERE 1=(SELECT Count(*)+1 FROM GAD7 AS G7
WHERE G7.PatientID=GAD7.PatientID AND G7.TestDate>GAD7.TestDate);
Query2: Score2
SELECT GAD7.*
FROM GAD7
WHERE 2=(SELECT Count(*)+1 FROM GAD7 AS G7
WHERE G7.PatientID=GAD7.PatientID AND G7.TestDate>GAD7.TestDate);
Query3:
SELECT Score2.PatientID, [Score2].[Score]-[Score1].[Score] AS D
FROM Score1 INNER JOIN Score2 ON Score1.PatientID = Score2.PatientID;
Could nest the SQL statements for an all-in-one query.
Or this all-in-one version using TOP N to pull previous Score:
SELECT GAD7.*, (SELECT TOP 1 Score FROM GAD7 AS Dupe
WHERE Dupe.PatientID = GAD7.PatientID AND Dupe.TestDate<GAD7.TestDate
ORDER BY Dupe.TestDate DESC) AS PrevScore
FROM GAD7 WHERE PatientID IN
(SELECT PatientID FROM GAD7 GROUP BY PatientID HAVING Count(*)>1)
AND 1=(SELECT Count(*)+1 FROM GAD7 AS G7 WHERE G7.PatientID=GAD7.PatientID AND G7.TestDate>GAD7.TestDate);

How to get the newest address for each person

I have created a simple Access database an example of table structure and a query. Two tables, person(contains 3 records) & address(contains 5 records), provide the ability to capture multiple addresses for each person. I am good with normal conditional statements, but this one is throwing me for a loop...
I am looking for a query that will return only the newest address for a given person.
Table Relationship
Current sql for the query:
SELECT Person.PersonID_PK, Address.Address, Address.StatusDate
FROM Person INNER JOIN Address ON Person.[PersonID_PK] = Address.PersonID_FK;
My current returns:
EmployeeID_PK Address StatusDate
1 12 Elm St, MN 23569 11/13/2017
1 15 Apple Ln, NY 12345 7/15/2018
2 30 Mulberry, TN 38456 6/11/2018
2 10 Lonesome Pine, KY 15487 12/4/2018
3 100 Plaze Place, LA 14563 6/17/2018
I need to return each person along with the greatest(newest) StatusDate
My expected return should be:
EmployeeID_PK Address StatusDate
1 15 Apple Ln, NY 12345 7/15/2018
2 10 Lonesome Pine, KY 15487 12/4/2018
3 100 Plaze Place, LA 14563 6/17/2018
You can use a correlated subquery:
select a.*
from address as a
where a.statusdate = (select max(a2.statusdate)
from address as a2
where a2.EmployeeID_PK = a.EmployeeID_PK
);
Using a CTE with a ranking function works neatly.
;WITH empaddress AS
(
SELECT person.personid_pk,
address.address,
address.statusdate,
Dense_rank() OVER (partition BY id ORDER BY statusdate DESC) AS d_rank
FROM person
INNER JOIN address
ON person.[PersonID_PK] = address.personid_fk; )
SELECT person.personid_pk,
address.address,
address.statusdate
FROM empaddress
WHERE d_rank = 1;
Thanks for the help. I modified Gordon's code to support my query, the following provided the answer.
SELECT Person.PersonID_PK, Address.Address, Address.StatusDate
FROM Person
INNER JOIN Address
ON Person.[PersonID_PK] = Address.PersonID_FK
WHERE (((Address.StatusDate)=(SELECT MAX(Address.StatusDate)
FROM Address
WHERE Person.PersonID_PK = Address.PersonID_FK)));

Oracle SQL - finding sets that contain another set

Say I have a table A that contains a list of a potential employees ID's and their professional skills in the form of a skill code:
ID | skill code
005 12
005 3
007 42
007 8
013 6
013 22
013 18
And I have another table B that lists several job position ID's and their corresponding required skill ID's:
Job ID | skill code
1 3
1 32
1 21
1 44
2 15
2 62
.
.
.
How can I find out which Job Id's a specific person is qualified for? I need to select all Job Id's that contain all the person's skills. Say for instance I need to find all job ID's that employee ID 003 is qualified for, how would I structure an Oracle SQL query to get this information?
I want to be able to enter any employee ID in a WHERE clause to find what jobs that person is qualified for.
An idea would be to count the number of skills for every person and job:
SELECT A.id as person_id,
B.JOB_ID
FROM A
JOIN B
ON A.skill_code=B.skill_code
GROUP BY a.id, b.job_id
HAVING count(*) = (select count(*) from b b2 where b2.job_id = b.job_id);
Not tested and assuming that tables are well normalized.
UPDATE after the OP's comment.
It is asked for all the jobs which necessitate all skills of a person:
SELECT A.id as person_id,
B.JOB_ID
FROM A
JOIN B
ON A.skill_code=B.skill_code
GROUP BY a.id, b.job_id
HAVING count(*) = (select count(*) from a a2 where a2.job_id = b.job_id);
Update2: The question was updated with:
I want to be able to enter any employee ID in a WHERE clause to find what jobs that person is qualified for.
For this, you just add WHERE a.id = :emp_id to the first query. (above group by)
Try this one
WITH b1 AS
(SELECT job_id,
skill,
COUNT(*) over (partition BY job_id order by job_id) rr
FROM b
) ,
res1 AS
(SELECT a.id,
b1.job_id,
rr,
COUNT(*) over (partition BY id, job_id order by id) rr2
FROM A
JOIN B1
ON A.skill=B1.skill
)
SELECT id, job_id FROM res1 WHERE rr=rr2

Ranking Aggregate Field in Access Query

I am trying to rank an aggregate field in access but my efforts are in vain with errors based on referencing. I am ranking using a subquery but the problem comes about due to the alias names resulting from performing an average on a field. The code is as below:
SELECT [Exams].[StudentID],
Avg([Exams].[Biology]) AS [AvgBiology],
(SELECT Avg(T.Biology) AS [TAvgBiology],
Count(*)
FROM [Exams] AS T
WHERE T.[TAvgBiology] > [AvgBiology])
+ 1 AS Rank
FROM [Exams]
GROUP BY [Exams].[StudentID]
ORDER BY Avg([Exams].[Biology]) DESC;
Errors that come about state: "You have selected a subquery that can return more than one value blah blah...please use the Exist keyword.. ".
From the code above I think you get the gist of what I am trying to achieve.
Start with the basic GROUP BY query Gordon Linoff suggested to compute the average Biology for each StudentID.
SELECT
e.StudentID,
Avg(e.Biology) AS AvgBiology
FROM Exams AS e
GROUP BY e.StudentID
Save that query as qryAvgBiology and then use it in another query where you compute Rank.
SELECT
q.StudentID,
q.AvgBiology,
(
(
SELECT Count(*)
FROM qryAvgBiology AS q2
WHERE q2.AvgBiology > q.AvgBiology
)
+1
) AS Rank
FROM qryAvgBiology AS q
ORDER BY 3;
For example, if qryAvgBiology returns this result set ...
StudentID AvgBiology
--------- ----------
1 70
2 80
3 90
The ranking query will transform it to this ...
StudentID AvgBiology Rank
--------- ---------- ----
3 90 1
2 80 2
1 70 3
I assume your basic query is:
SELECT e.StudentId Avg(e.Biology) AS AvgBiology
FROM exams as e
GROUP BY e.StudentId;
(Square braces don't help me understand the query at all.)
I think the following will work in Access:
SELECT e.StudentId Avg(e.Biology) AS AvgBiology,
(SELECT 1 + COUNT(*)
FROM (SELECT e.StudentId, Avg(e.Biology) AS AvgBiology
FROM exams as e
GROUP BY e.StudentId
) e2
WHERE e2.AvgBiology > Avg(e.Biology)
) as ranking
FROM exams as e
GROUP BY e.StudentId;

How do I write a standard SQL GROUP BY that includes columns not in the GROUP BY clause

Let's say I have a table called Customer, defined like this:
Id Name DepartmentId Hired
1 X 101 2001/01/01
2 Y 102 2002/01/01
3 Z 102 2003/01/01
And I want to retrieve the date of the last hiring in each department.
Obviously I would do this
SELECT c.DepartmentId, MAX(c.Hired)
FROM Customer c
GROUP BY c.DepartmentId
Which returns:
101 2001/01/01
102 2003/01/01
But what do I do if I want to return the name of the guy hired? I.e. I would want this result set:
101 2001/01/01 X
102 2003/01/01 Z
Note that the following does not work, as it would return three rows rather than the two I'm looking for:
SELECT c.DepartmentId, c.Name, MAX(c.Hired)
FROM Customer c
GROUP BY c.DepartmentId
I can't remember seeing a query that achieves this.
NOTE: It's not acceptable to join on the Hired field, as that would not be guaranteed to be accurate.
A subselect would do the job and would handle the case where more than one person was hired in the same department on the same day:
SELECT c.DepartmentId, c.Name, c.Hired from Customer c,
(SELECT DepartmentId, MAX(Hired) as MaxHired
FROM Customer
GROUP BY DepartmentId) as sub
WHERE c.DepartmentId = sub.DepartmentId AND c.Hired = sub.MaxHired
Standard Sql:
select *
from Customer C
where exists
(
-- Linq to Sql put NULL instead ;-)
-- In fact, you can even put 1/0 here and would not cause division by zero error
-- An RDBMS do not parse the select clause of correlated subquery
SELECT NULL
FROM Customer
where c.DepartmentId = DepartmentId
GROUP BY DepartmentId
having c.Hired = MAX(Hired)
)
If Sql Server happens to support tuple testing, this is the most succint:
select *
from Customer
where (DepartmentId, Hired) in
(select DepartmentId, MAX(Hired)
from Customer
group by DepartmentId)
SELECT a.*
FROM Customer AS a
JOIN
(SELECT DepartmentId, MAX(Hired) AS Hired
FROM Customer GROUP BY DepartmentId) AS b
USING (DepartmentId,Hired);