identifying duplicates withing a partition with different ID's

identifying duplicates withing a partition with different ID's - sql

i am new to SQL and Data analysis.
I have a scenario i am trying to identify using SQL partitions.
Basically i want to find duplicates [same first_name, last_name, suffix code and Zip code but only if the id's are different.
This query gives me only partial results which is not correct...i know i am missing a filter here and there.
SELECT i.party_id,
I.FIRST_NM,
I.LAST_NM,
I.SFFX_CD,
A.ZIP_CD,
ROW_NUMBER() OVER (PARTITION BY I.FIRST_NM,
I.LAST_NM,
I.SFFX_CD,
A.ZIP_CD
ORDER BY I.PARTY_ID) AS RN
FROM INDVDL I,
PARTY_ADDR A
WHERE I.PARTY_ID = A.PARTY_ID
i should only get the ones marked with ** and not the rest
PARTY_ID FIRST_NM LAST_NM SFFX_CD ZIP_CD RN
886874 John Doe Jr. 45402 1
886874 John Doe Jr. 45406 1
934635 John Doe Jr. 45406 2
886874 John Doe Jr. 45415 1
886874 John Doe Jr. 45415 2
886874 John Doe Jr. 45415 3
886874 John Doe Jr. 45415 4
886874 John Doe Jr. 45415 5
886874 John Doe Jr. 45415 6
**886874 John Doe Jr. 45415 7
**934635 John Doe Jr. 45415 8
934635 John Doe Jr. 45415 9
934635 John Doe Jr. 45415 10

Here is my suggestion. Use window functions to get the minimum and maximum values of PARTY_ID for the groups you have in mind. Then, filter to return only rows where these are different:
SELECT *
FROM (SELECT i.*, a.*,
MIN(I.PARTY_ID) OVER (PARTITION BY I.FIRST_NM, I.LAST_NM, I.SFFX_CD, A.ZIP_CD) as min_pi,
MAX(I.PARTY_ID) OVER (PARTITION BY I.FIRST_NM, I.LAST_NM, I.SFFX_CD, A.ZIP_CD) as max_pi
FROM INDVDL I JOIN
PARTY_ADDR A
ON I.PARTY_ID = A.PARTY_ID
) ia
WHERE min_pi <> max_pi;
Note: I fixed your join syntax to use explicit joins. Simple rule: never use commas in the from clause.
Also, I replaced the column lists with * for convenience. Add in the columns you want.

Related

Stuff() Not Grouping Accurately

I am using an older version of SQL Server and trying to convert rows to concatenated columns. From researching here on stack overflow I see that I should be using STUFF(). However, when I attempt to replicate the answers I found here, I can't get the grouping correct. Instead of concatenating names tied to my GROUP BY, it's concatenating every single row and then just duplicating the results for every single row.
My base table #Temp is laid out as such:
CleanName
FullName
Total
Doe, Jane
DO, JANE
4
Doe, Jane
DOE, JANE S.
15
Doe, Jane
Doe, J.
23
Smith, John
Smith, J.
4
Smith, John
Smith, Jon
10
Smith, John
Smith, John
103
I am trying to get results like this:
CleanName
Concat_FullName
Sum(Total)
Doe, Jane
DO, JANE; DOE, JANE S.; Doe, J.
42
Smith, John
Smith, J.; Smith, Jon; Smith, John
117
This is what I tried running based on my research on stack overflow:
SELECT
STAND_PRESC_NAME,
CONCAT_FULLNAME = STUFF(( SELECT '; ' + FULLNAME
FROM #TEMP
FOR XML PATH(''), TYPE).value('.', 'VARCHAR(MAX)'),1,1,''),
SUM(TOTAL)
FROM #TEMP
GROUP BY STAND_PRESC_NAME
However what resulted was every row concatenated together which is not the desired results:
CleanName
Concat_FullName
Sum(Total)
Doe, Jane
DO, JANE; DOE, JANE S.; Doe, J.; Smith, J.; Smith, Jon; Smith, John
42
Smith, John
DO, JANE; DOE, JANE S.; Doe, J.; Smith, J.; Smith, Jon; Smith, John
117
How do I need to alter my STUFF() usage to appropriately group by CleanName?

You forgot to add the correlation:
SELECT
STAND_PRESC_NAME,
CONCAT_FULLNAME = STUFF(( SELECT '; ' + FULLNAME
FROM #TEMP t
WHERE t.STAND_PRESC_NAME = t2.STAND_PRESC_NAME -- this
FOR XML PATH(''), TYPE).value('.', 'VARCHAR(MAX)'),1,1,''),
SUM(TOTAL)
FROM #TEMP t2
GROUP BY STAND_PRESC_NAME

Select random records with no duplicates

For an auditing project, need to select at random three tracking IDs per associate and cannot be dups. Wondering if it's possible with SQL?
Sample SQL Server Data:
Associate
Tracking ID
Smith, Mary
TRK65152
Smith, Mary
TRK74183
Smith, Mary
TRK35154
Smith, Mary
TRK23117
Smith, Mary
TRK11889
Jones, Walter
TRK17364
Jones, Walter
TRK91736
Jones, Walter
TRK88234
Jones, Walter
TRK80012
Jones, Walter
TRK55874
Williams, Tony
TRK58142
Williams, Tony
TRK47336
Williams, Tony
TRK13254
Williams, Tony
TRK28596
Williams, Tony
TRK33371

You may use ROW_NUMBER here with a random ordering:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Associate ORDER BY NEWID()) rn
FROM yourTable
)
SELECT Associate, TrackingID
FROM cte
WHERE rn <= 3;

How to make a DISTINCT CONCAT statement?

New SQL developer here, how do I make a DISTINCT CONCAT statement?
Here is my statement without the DISTINCT key:
COLUMN Employee FORMAT a25;
SELECT CONCAT(CONCAT(EMPLOYEEFNAME, ' '), EMPLOYEELNAME) AS "Employee", JOBTITLE "Job Title"
FROM Employee
ORDER BY EMPLOYEEFNAME;
Here is it's output:
Employee Job Title
------------------------- -------------------------
Bill Murray Cable Installer
Bill Murray Cable Installer
Bob Smith Project Manager
Bob Smith Project Manager
Frank Herbert Network Specilist
Henry Jones Technical Support
Homer Simpson Programmer
Jane Doe Programmer
Jane Doe Programmer
Jane Doe Programmer
Jane Fonda Project Manager
John Jameson Cable Installer
John Jameson Cable Installer
John Carpenter Technical Support
John Carpenter Technical Support
John Jameson Cable Installer
John Carpenter Technical Support
John Carpenter Technical Support
Kathy Smith Network Specilist
Mary Jane Project Manager
Mary Jane Project Manager
21 rows selected
If I were to use the DISTINCT key I should only have 11 rows selected, however
if I use SELECT DISTINCT CONCAT I get an error.

One option is to use GROUP BY:
SELECT CONCAT(CONCAT(EMPLOYEEFNAME, ' '), EMPLOYEELNAME) AS "Employee",
JOBTITLE AS "Job Title"
FROM Employee
GROUP BY CONCAT(CONCAT(EMPLOYEEFNAME, ' '), EMPLOYEELNAME),
JOBTITLE
ORDER BY "Employee"
Another option, if you really want to use DISTINCT, would be to subquery your current query:
SELECT DISTINCT t.Employee,
t."Job Title"
FROM
(
SELECT CONCAT(CONCAT(EMPLOYEEFNAME, ' '), EMPLOYEELNAME) AS "Employee",
JOBTITLE AS "Job Title"
FROM Employee
) t

How do I transpose multiple rows to columns in SQL

My first time reading a question on here.
I am working at a university and I have a table of student IDs and their supervisors, some of the students have one supervisor and some have two or three depending on their subject.
The table looks like this
ID Supervisor
1 John Doe
2 Peter Jones
2 Sarah Jones
3 Peter Jones
3 Sarah Jones
4 Stephen Davies
4 Peter Jones
4 Sarah Jones
5 John Doe
I want to create a view that turns that into this:
ID Supervisor 1 Supervisor 2 Supervisor 3
1 John Doe
2 Peter Jones Sarah Jones
3 Peter Jones Sarah Jones
4 Stephen Davies Peter Jones Sarah Jones
5 John Doe
I have looked at PIVOT functions, but don't think it matches my needs.
Any help is greatly appreciated.

PIVOT was the right clue, it only needs a little 'extra' :)
DECLARE #tt TABLE (ID INT,Supervisor VARCHAR(128));
INSERT INTO #tt(ID,Supervisor)
VALUES
(1,'John Doe'),
(2,'Peter Jones'),
(2,'Sarah Jones'),
(3,'Peter Jones'),
(3,'Sarah Jones'),
(4,'Stephen Davies'),
(4,'Peter Jones'),
(4,'Sarah Jones'),
(5,'John Doe');
SELECT
*
FROM
(
SELECT
ID,
'Supervisor ' + CAST(ROW_NUMBER() OVER(PARTITION BY ID ORDER BY Supervisor) AS VARCHAR(128)) AS supervisor_id,
Supervisor
FROM
#tt
) AS tt
PIVOT(
MAX(Supervisor) FOR
supervisor_id IN ([Supervisor 1],[Supervisor 2],[Supervisor 3])
) AS piv;
Result:
ID Supervisor 1 Supervisor 2 Supervisor 3
1 John Doe NULL NULL
2 Peter Jones Sarah Jones NULL
3 Peter Jones Sarah Jones NULL
4 Peter Jones Sarah Jones Stephen Davies
5 John Doe NULL NULL
You will notice that the assignment to Supervisor X is done by ordering by the Supervisor-VARCHAR. If you want the ordering done differently, you might want to include an [Ordering] column; then change to ROW_NUMBER() OVER(PARTITION BY ID ORDER BY [Ordering]). Eg an [Ordering] column could be an INT IDENTITY(1,1). I'll leave that as an excercise to you if that's what's really needed.

Alias scoping in SQL

I'm having an issue with a complex query on an SQLite3 database that I think has to do with a misunderstanding on my part of how to refer to columns in a results table returned by a select statement, especially when aliases are involved.
Here is an example table - a list of movie IDs with a row for each actor working on the movie:
CREATE TABLE movie_actor (imdb_id TEXT, actor TEXT);
INSERT INTO movie_actor VALUES('44r4', 'John Doe');
INSERT INTO movie_actor VALUES('44r4', 'Jane Doe');
INSERT INTO movie_actor VALUES('44r4', 'Jermaine Doe');
INSERT INTO movie_actor VALUES('44r4', 'Jacob Doe');
INSERT INTO movie_actor VALUES('55r5', 'John Doe');
INSERT INTO movie_actor VALUES('55r5', 'Jane Doe');
INSERT INTO movie_actor VALUES('55r5', 'Nathan Deer');
INSERT INTO movie_actor VALUES('66r6', 'Bob Duck');
INSERT INTO movie_actor VALUES('66r6', 'John Doe');
INSERT INTO movie_actor VALUES('66r6', 'Jermaine Doe');
INSERT INTO movie_actor VALUES('66r6', 'Jane Doe');
INSERT INTO movie_actor VALUES('77r7', 'John Doe');
I am trying to find out the how many times each pair of actors worked with each other across all movies. I decided to go about this with a self-join, but ran into issues where I would get record pairs such as "John Doe, Jane Doe, 3" and "Jane Doe, John Doe, 3" - this is really the same thing, and I wanted to only count the first version. This is the code that resulted:
SELECT DISTINCT
CASE WHEN d.actor_1 > d.actor_2 THEN d.actor_1 ELSE d.actor_2 END d.actor_1,
CASE WHEN d.actor_2 > d.actor_1 THEN d.actor_2 ELSE d.actor_1 END d.actor_2,
d.v
FROM (
SELECT c.actor_1 AS actor_1, c.actor_2 AS actor_2, COUNT(*) AS v
FROM (
SELECT a.actor AS actor_1, b.actor AS actor_2
FROM movie_actor a JOIN movie_actor b ON a.imdb_id=b.imdb_id
) AS c
WHERE c.actor_1 <> c.actor_2
GROUP BY c.actor_1, c.actor_2
HAVING COUNT(*) > 2
ORDER BY COUNT(*) DESC
LIMIT 20
)
AS d
This doesn't run, but I can't figure out why. My assumption is that I am not using aliases properly, but I really don't know. Any ideas?
(SQL Fiddle link here)

We get a simpler query, if we add the condition a.actor < b.actor. This excludes pairs with equal actors and at the same time it removed the need of swapping actors.
SELECT
a.actor AS actor_1, b.actor AS actor_2, COUNT(*) AS v
FROM
movie_actor a
INNER JOIN movie_actor b
ON a.imdb_id = b.imdb_id
WHERE
a.actor < b.actor
GROUP BY a.actor, b.actor
ORDER BY COUNT(*) DESC, a.actor, b.actor
LIMIT 20
Note: SQL always creates a cross product when joining, i.e. it creates all possible combinations of records that match the join condition. Therefore for imdb 55r5 (including 3 actors) it will first generate the following 3 x 3 = 9 pairs:
John Doe John Doe
John Doe Jane Doe
John Doe Nathan Deer
Jane Doe John Doe
Jane Doe Jane Doe
Jane Doe Nathan Deer
Nathan Deer John Doe
Nathan Deer Jane Doe
Nathan Deer Nathan Deer
Then the WHERE-clause excludes all a >= b pairs and we get
John Doe Nathan Deer
Jane Doe John Doe
Jane Doe Nathan Deer

Generate the distinct pairs first, then count them.
select actor_1, actor_2, count(*)
from (select distinct a.imdb_id, a.actor as actor_1, b.actor as actor_2
from movie_actor a
inner join movie_actor b on a.imdb_id = b.imdb_id
where a.actor < b.actor) x
group by actor_1, actor_2
order by actor_1, actor_2;
actor_1 actor_2 count(*)
---------- ---------- ----------
Bob Duck Jane Doe 1
Bob Duck Jermaine D 1
Bob Duck John Doe 1
Jacob Doe Jane Doe 1
Jacob Doe Jermaine D 1
Jacob Doe John Doe 1
Jane Doe Jermaine D 2
Jane Doe John Doe 3
Jane Doe Nathan Dee 1
Jermaine D John Doe 2
John Doe Nathan Dee 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

identifying duplicates withing a partition with different ID's - sql

Related

Stuff() Not Grouping Accurately

Select random records with no duplicates

How to make a DISTINCT CONCAT statement?

How do I transpose multiple rows to columns in SQL

Alias scoping in SQL

Categories

Resources