How do I create a SQL Distinct query and add some additional fields - sql

I have the following query that selects combinations of first and last names and show me dupes. It works, not problems here.
I want to include three other fields for reference; Id, cUser, and cDate. These additional fields, however, should not be used to determine duplicates as I'd likely not end up with any duplicates.
SELECT * FROM
(SELECT FirstName, LastName, COUNT(*) as "Count"
FROM Contacts
WHERE ContactTypeID = 1
GROUP BY LastName,FirstName
) AS X
WHERE COUNT > 1
ORDER BY COUNT DESC
Any suggestions? Thanks!

SELECT *
FROM (
SELECT *, COUNT(*) OVER (PARTITION BY FirstName, LastName) AS cnt
FROM Contacts
WHERE ContactTypeId = 1
) q
WHERE cnt > 1
ORDER BY
cnt DESC
This will return all fields for each of the duplicated records.

If these fields are always the same then you can include them in GROUP BY and it will not affect the detection of duplicates
If they are not then you must decide what kind of aggregate function you will apply on them, for example MAX() or MIN() would work and would give you some indication of which values are associated with some of the attributes for the duplicates.
Otherwise, if you want to see all of the records you can join back to the source
SELECT X2.* FROM
(SELECT FirstName, LastName, COUNT(*) as "Count"
FROM Contacts
WHERE ContactTypeID = 1
GROUP BY LastName,FirstName
) AS X INNER JOIN Contact X2 ON X.LastName = X2.LastName AND X.FirstName = X2.FirstName
WHERE COUNT > 1
ORDER BY COUNT DESC

Related

Postgresql query: update status of limit number of records based on group size

I have a postgresql table contains a list of email addresses. The table has three columns, Email, EmailServer (e.g., gmail.com, outlook.com, msn.com, and yahoo.com.ca etc.), and Valid (boolean).
Now, I want to group those emails by EmailServer and then update the first 3 records of each large group (count >=6) as Valid = true while leaving the rest of each group as Valid = false.
I failed to get the wanted output by below query:
UPDATE public."EmailContacts"
SET "Valid"=true
WHERE "EmailServer" IN (
SELECT "EmailServer"
FROM public."EmailContacts"
GROUP by "EmailServer"
HAVING count(*) >=6
LIMIT 5)
Please help to modify so as to get the expected results. Would be greatly appreciated for any kind of your help!
WITH major_servers AS (
SELECT email_server
FROM email_address
GROUP by email_server
HAVING count(*) >=6
),
enumerated_emails AS (
SELECT email,
email_server,
row_number() OVER (PARTITION BY email_server ORDER BY email) AS row_number --TODO:: ORDER BY email - attention
FROM email_address
WHERE email_server IN (SELECT email_server FROM major_servers)
)
UPDATE email_address
SET valid = true
WHERE email IN (SELECT email
FROM enumerated_emails ee
WHERE ee.row_number <= 3);
The first query major_servers finds major groups where more than 5 email servers exist.
The second query enumerated_emails enumerates emails by their natural order (see a TODO comment, I think you should choose another ORDER BY criteria) which belong to major groups using window function row_number().
The last query updates the first 3 rows in each major server group.
Find the sql-fiddle here.
You need to get the servers, then order the mails from which one and then perform the update. Something like this:
WITH DataSourceServers AS
(
SELECT "EmailServer"
FROM public."EmailContacts"
GROUP by "EmailServer"
HAVING count(*) >=6
),DataSourceEmails AS
(
SELECT "Email", row_number() OVER (PARTITION BY "EmailServer" ORDER BY "Email") AS rn
FROM public."EmailContacts"
WHERE "EmailServer" IN (SELECT "EmailServer" FROM DataSourceServers)
)
UPDATE public."EmailContacts"
SET "Valid" = true
FROM public."EmailContacts" E
INNER JOIN DataSourceEmails SE
WHERE E."EmailServer" = SE."EmailServer"
AND E."Email" = SE."Email"
AND SE.rn <= 3;

SQL loop through an array and run a query multiple times

I have an array with some status ids
[1,2,3,4,5]
I want to run this query
SELECT firstname, lastname FROM calls
WHERE status_id = "STATUS ID"
ORDER BY RANDOM() LIMIT 50
What would the syntax be to do a loop though the status_id array and run the query each time?
I want to end up with 250 rows with each group of 50 having a one of the status_id's.
This won't be terribly efficient, but it's the best I can think of:
select firstname, lastname
FROM (
SELECT firstname,
lastname,
row_number() over (partition by status_id order by random()) as rn
FROM calls
WHERE status_id = ANY (ARRAY [1,2,3,4,5])
) t
where rn <= 50;
The inner select will retrieve all rows with the desired values for status_id and will give each row a random row number for each value in status_id. The outer select will then only select 50 rows for each status (unless there are less than 50 for that specific status of course)
Use In operator:
SELECT firstname, lastname FROM calls WHERE status_id IN [1,2,3,4,5]

Can we use join with in same table while using group by function?

For instance, I have a table with columns below:
pk_id,address,first_name,last_name
and I have a query like this to display the first name ans last name that are repetitive(duplicates)
select first_name,last_name
from table
group by first_name,last_name
having count(*)>1;
but the above query just returns first and last names but I want to display pk_id and address too that are tied to these duplicate first and last names
Can we use joins to do this on the same table.Please help!!
A simple way of doing is to build a view with the pk_id and the count of duplicates. Once you have it, it is only a matter of using a JOIN on the base table, and a filter to only keep rows having a duplicate:
SELECT T.*
FROM T
JOIN (SELECT "pk_id",
COUNT(*) OVER(PARTITION BY "first_name", "last_name") cnt
FROM T) V
ON T."pk_id" = V."pk_id"
WHERE cnt > 1
See http://sqlfiddle.com/#!4/3ecd0/9
You have to call it from an outer query, like this:
select * from table
where first_name||last_name in
(select first_name||last_name from
(select first_name, last_name, count( * )
from table
group by first_name,last_name
having count( * ) > 1
)
)
note: you may not need to concatenate the 2 fields, but I haven't tested thaT.
with
my_duplicates as
(
select
first_name,
last_name
from
my_table
group by
first_name,
last_name
having
count(*) > 1
)
select
bb.pk_id,
bb.address,
bb.first_name,
bb.last_name
from
my_duplicates aa
join my_table bb on
(
aa.first_name = bb.first_name
and
aa.last_name = bb.last_name
)
order by
bb.last_name,
bb.first_name,
bb.pk_id

Select a NON-DISTINCT column in a query that return distincts rows

The following query returns the results that I need but I have to add the ID of the row to then update it. If I add the ID directly in the select statement it will return me more results then I need because each ID is unique so the DISTINCT statement see the line as unique.
SELECT DISTINCT ucpse.MemberID, ucpse.ProductID, ucpse.UserID
FROM UserCustomerProductSalaryExceptions as ucpse
WHERE EXISTS (SELECT NULL
FROM UserCustomerProductSalaryExceptions as upcse2
WHERE ucpse.userid = upcse2.userid AND ucpse.MemberID = upcse2.MemberID AND ucpse.ProductID = upcse2.ProductID
GROUP BY upcse2.UserID, upcse2.memberid, upcse2.productid
HAVING COUNT(UserID) >= 2
)
So basically I need to add ucpse.ID in the Select statement while keeping DISTINCT values for MemberID,ProductID and UserID.
Any Ideas ?
Thank you
According to you comment:
If the data has been duplicated 67 times for a given employee with a given product and a given client, I need to keep only one of thoses records. It's not important which one, so this is why I use DISTINC to obtain unique combinaison of given employee with a given product and a given client.
You can use MIN() or MAX() and GROUP BY instead of DISTINCT
SELECT MAX(ucpse.ID) AS ID, ucpse.MemberID, ucpse.ProductID, ucpse.UserID
FROM UserCustomerProductSalaryExceptions as ucpse
WHERE EXISTS (SELECT NULL
FROM UserCustomerProductSalaryExceptions as upcse2
WHERE ucpse.userid = upcse2.userid AND ucpse.MemberID = upcse2.MemberID AND ucpse.ProductID = upcse2.ProductID
GROUP BY upcse2.UserID, upcse2.memberid, upcse2.productid
HAVING COUNT(UserID) >= 2
)
GROUP BY ucpse.MemberID, ucpse.ProductID, ucpse.UserID
UPDATE:
From you comments I think the below query is what you need
DELETE FROM UserCustomerProductSalaryExceptions
WHERE ID NOT IN ( SELECT MAX(ucpse.ID) AS ID
FROM #UserCustomerProductSalaryExceptions
GROUP BY ucpse.MemberID, ucpse.ProductID, ucpse.UserID
HAVING COUNT(ucpse.ID) >= 2
)
If all you want is to delete the duplicates, this will do it:
WITH X AS
(SELECT ID,
ROW_NUMBER() OVER (PARTITION BY MemberID, ProductID, UserID ORDER BY ID) AS DupRowNum<br
FROM UserCustomerProductSalaryExceptions
)
DELETE X WHERE DupRowNum > 1
ID's not necessary - try:
UPDATE uu SET
<your settings here>
FROM UserCustomerProductSalaryExceptions uu
JOIN ( <paste your entire query above here>
) uc ON uc.MemberID=uu.MemberId AND uc.ProductID=uu.ProductId AND uc.UserID=uu.UserId
From the sound of your data structure (which I would STRONGLY advise normalizing as soon as possible), it sounds like you should be updating all the records. It sounds as if each duplicate is important because it contains some information about an employee's relation to a customer or product.
I would probably update all the records. Try this:
UPDATE UCPSE
SET
--Do your updates here
FROM UserCustomerProductSalaryExceptions as ucpse
JOIN
(
SELECT UserID, MemberID, ProductID
FROM UserCustomerProductSalaryExceptions
GROUP BY UserID, MemberID, ProductID
HAVING COUNT(UserID) >= 2
) T
ON ucpse.UserID = T.UserID AND ucpse.MemberID = T.MemberID AND ucpse.ProductID = T.ProductID

How do I see if there are multiple rows with an identical value in particular column?

I'm looking for an efficient way to exclude rows from my SELECT statement WHERE more than one row is returned with an identical value for a certain column.
Specifically, I am selecting a bunch of accounts, but need to exclude accounts where more than one is found with the same SSN associated.
this will return all SSNs with exactly 1 row
select ssn,count(*)
from SomeTable
group by ssn
having count(*) = 1
this will return all SSNs with more than 1 row
select ssn,count(*)
from SomeTable
group by ssn
having count(*) > 1
Your full query would be like this (will work on SQL Server 7 and up)
select a.* from account a
join(
select ssn
from SomeTable
group by ssn
having count(*) = 1) s on a.ssn = s.ssn
For SQL 2005 or above you can try this:
WITH qry AS
(
SELECT a.*,
COUNT(*) OVER(PARTITION BY ssn) dup_count
FROM accounts a
)
SELECT *
FROM qry
WHERE dup_count = 1
For SQL 2000 and 7:
SELECT a.*
FROM accounts a INNER JOIN
(
SELECT ssn
FROM accounts b
GROUP BY ssn
HAVING COUNT(1) = 1
) b ON a.ssn = b.ssn
SELECT *
FROM #Temp
WHERE SSN NOT IN (SELECT ssn FROM #Temp GROUP BY ssn HAVING COUNT(ssn) > 1)
Thank you all for your detailed suggestions. When it was all said and done, I needed to use a correlated subquery. Essentially, this is what I had to do:
SELECT acn, ssn, [date] FROM Account a
WHERE NOT EXISTS (SELECT 1 FROM Account WHERE ssn = a.ssn AND [date] < a.[date])
Hope this helps someone.
I never updated this... In my final submission, I achieved this through a left join to increase efficiency (the correlated subquery was not acceptable as it took a significant amount of time to run, checking each record against over 150K others).
Here is what had to be done to solve my problem:
SELECT acn, ssn
FROM Account a
LEFT JOIN (SELECT ssn, COUNT(1) AS counter FROM Account
GROUP BY ssn) AS counters
ON a.ssn = counters.ssn
WHERE counter IS NULL OR counter = 0