Distinct column with primary key column - sql

Distinct column count differs when adding the primary key column in the Select query
The count distinct for supplier_payment_terms is 110, but when adding the PK column, the count changes to thousands.
select distinct supplier, unique_id from indirect_spend;
I expect the same record count of 110 when including the PK column in the select. The Select must only include the unique_id of the supplier.

"I expect the same record count of 110 when including the PK column in the select"
Then you expect wrong. SELECT DISTINCT causes all rows appearing in the result to be distinct, i.e. no duplicate rows in the result.
Besides. Imagine two rows (supplier-id unique-id) (1 2) and (1 5). You say you expect only one row in the result. How is the system going to determine which one of the two rows to deliver ?

You can use aggregation to get example primary keys:
select supplier, min(unique_id), max(unique_id)
from indirect_spend
group by supplier;

Related

How to break ties when comparing columns in SQL

I am trying to delete duplicates in Postgres. I am using this as the base of my query:
DELETE FROM case_file as p
WHERE EXISTS (
SELECT FROM case_file as p1
WHERE p1.serial_no = p.serial_no
AND p1.cfh_status_dt < p.cfh_status_dt
);
It works well, except that when the dates cfh_status_dt are equal then neither of the records are removed.
For rows that have the same serial_no and the date is the same, I would like to keep the one that has a registration_no (if any do, this column also has NULLS).
Is there a way I can do this with all one query, possibly with a case statement or another simple comparison?
DELETE FROM case_file AS p
WHERE id NOT IN (
SELECT DISTINCT ON (serial_no) id -- id = PK
FROM case_file
ORDER BY serial_no, cfh_status_dt DESC, registration_no
);
This keeps the (one) latest row per serial_no, choosing the smallest registration_no if there are multiple candidates.
NULL sorts last in default ascending order. So any row with a not-null registration_no is preferred.
If you want the greatest registration_no instead, to still sort NULL values last, use:
...
ORDER BY serial_no, cfh_status_dt DESC, registration_no DESC NULLS LAST
See:
Select first row in each GROUP BY group?
Sort by column ASC, but NULL values first?
If you have no PK (PRIMARY KEY) or other UNIQUE NOT NULL (combination of) column(s) you can use for this purpose, you can fall back to ctid. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
NOT IN is typically not the most efficient way. But this deals with duplicates involving NULL values. See:
How to delete duplicate rows without unique identifier
If there are many duplicates - and you can afford to do so! - it can be (much) more efficient to create a new, pristine table of survivors and replace the old table, instead of deleting the majority of rows in the existing table.
Or create a temporary table of survivors, truncate the old and insert from the temp table. This way depending objects like views or FK constraints can stay in place. See:
How to delete duplicate entries?
Surviving rows are simply:
SELECT DISTINCT ON (serial_no) *
FROM case_file
ORDER BY serial_no, cfh_status_dt DESC, registration_no;

Select all rows of a SQL table which do not share a name

I have a table, call it widgets which has columns name and created_at, among others. I want to run a query that returns the count of all the rows of widgets which share the same name and have been created within a millisecond of each other.
This is the query that I have come up with, but it returns a number greater than the total number of rows in the table, can someone point out where I am going wrong?
SELECT COUNT (DISTINCT "t1"."id")
FROM
"tasks" "t1" ,"tasks" "t2"
WHERE
"t1"."name" = "t2"."name"
AND
date_trunc('milliseconds',"t1"."created_at") = date_trunc('milliseconds',"t2"."created_at")
You should add the condition:
and "t1"."id" <> "t2"."id"
where "id" is a primary key. In the lack of a primary key you can use ctid:
and "t1".ctid <> "t2".ctid

How to get one row from duplicated rows?

I have 2 tables SVC_ServiceTicket and SVC_CustomersVehicle
The table ServiceTicket has a column customerID which is a foreign key to CustomersVehicle.So in ServiceTicket column customerID can have duplicate values.
When I do
select sst.ServiceTicketID,sst.CustomerID
from ServiceTicket sst,CustomersVehicle scv
where sst.CustomerID=scv.CV_ID
then it gives me duplicate customerID.So my requirement is if there are duplicate values of customerID then I want the latest customerID and as well serviceticket of that corresponding(latest customerID)
For example in the below screenshot there are customerID 13 is repeating so in this case I want latest customerID as well as serviceticket so the values I want is 8008 and 13
Please tell me how to do
Use aggregate function MAX. Also I would recommend to use a JOIN.
SELECT MAX(sst.ServiceTicketID) AS ServiceTicketID,sst.CustomerID
FROM ServiceTicket sst JOIN
CustomersVehicle scv ON sst.CustomerVehicleID=scv.CV_ID
GROUP BY sst.CustomerID

Logically determine a composite key in SQL

I'm working with an MSSQL table that does not have a primary or unique key contstraint defined. There are two fields, lets call them xId and yId, that I believe together would be a composite key, but I want to confirm this by examining the data.
I'm thinking that I should be able to write a SQL count statement that I can compare to the total number of records on the table that would logically determine if the combination of xId and yId (or a third column id necessary) could in fact act as a composite key. However, I'm having trouble coming up with the right GROUP BY or other type of clause that would confirm or disprove this.
Any ideas?
Use group by and having:
select xid,yid
from table
group by xid,yid
having count(1) > 1
This will show any pairs that are non-unique, so if there are no rows returned its a good key.
Just do a count of the total rows of the table, and then do
select count(1)
from(
select xid,yid
from table
group by xid,yid
)a;
if all pairs of xid and yid form a unique identifier, then the two numbers will be the same.
Alternatively, you could count the number of distinct pairs of xid and yid and find the largest such number:
select max(num_rows)
from(
select xid,yid,count(1) as num_rows
from table
group by xid,yid
)a;
The result of this query is 1 if and only if (xid,yid) pairs form a unique identifier for your table.
this will list all the problem combinations (if any) of xid,yid:
SELECT
COUNT(*),xid,yid
FROM YourTable
GROUP BY xid,yid
HAVING COUNT(*)>1

How to mark duplicates in an SQL query

I have an SQL query which looks at date-of-birth, last name and a soundex of first name to identify duplicates. The following query finds some 8,000 rows (which I assume means there are around 8,000 duplicate records).
select dob,last_name,soundex(first_name),count(*)
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
Almost all of the results have a count of 2, a few have a count of 3 where obviously the record existed twice in one of the two databases which were merged.
The next step I need to take is to mark one of the rows, doesn't really matter, with a duplicate flag and to mark each row with the opposite rows key. Is there a way of doing this using SQL?
This should do what you are after, the UPDATE in one go.
UPDATE FROM clients c
INNER JOIN
(
select dob,last_name,soundex(first_name),MIN(id) as keep
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
) k
ON c.dob=k.dob AND c.last_name=k.last_name AND soundex(c.first_name)=soundex(k.first_name)
SET duplicateid = NULLIF(k.keep, c.id),
hasduplicate = (k.keep = c.id)
It assumes you have 3 columns not stated in the question
id: primary key
duplicateid: points to the dup being kept
hasduplicate: boolean, marks the one to keep
Well, you could use SELECT DISTINCT, and then mark a single row as "not duplicate" -- then search for rows that are "not duplicate" to find the duplicate.
Here is a query that will give you not only the duplicates, but also the first id inserted (assuming Id is the sequential primary-key column) and the newest id.
OTTOMH
select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
You can use this in a JOIN to do your update
UPDATE Clients
SET OppositeRowId = DuplicateRows.NewestId
FROM
(
select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
) DuplicateRows
WHERE
DuplicateRows.OldestId = Clients.Id
All of this assumes that you have one duplicate. If you have more than one, you are going to have to try something different.