Logically determine a composite key in SQL - sql

I'm working with an MSSQL table that does not have a primary or unique key contstraint defined. There are two fields, lets call them xId and yId, that I believe together would be a composite key, but I want to confirm this by examining the data.
I'm thinking that I should be able to write a SQL count statement that I can compare to the total number of records on the table that would logically determine if the combination of xId and yId (or a third column id necessary) could in fact act as a composite key. However, I'm having trouble coming up with the right GROUP BY or other type of clause that would confirm or disprove this.
Any ideas?

Use group by and having:
select xid,yid
from table
group by xid,yid
having count(1) > 1
This will show any pairs that are non-unique, so if there are no rows returned its a good key.

Just do a count of the total rows of the table, and then do
select count(1)
from(
select xid,yid
from table
group by xid,yid
)a;
if all pairs of xid and yid form a unique identifier, then the two numbers will be the same.
Alternatively, you could count the number of distinct pairs of xid and yid and find the largest such number:
select max(num_rows)
from(
select xid,yid,count(1) as num_rows
from table
group by xid,yid
)a;
The result of this query is 1 if and only if (xid,yid) pairs form a unique identifier for your table.

this will list all the problem combinations (if any) of xid,yid:
SELECT
COUNT(*),xid,yid
FROM YourTable
GROUP BY xid,yid
HAVING COUNT(*)>1

Related

How to break ties when comparing columns in SQL

I am trying to delete duplicates in Postgres. I am using this as the base of my query:
DELETE FROM case_file as p
WHERE EXISTS (
SELECT FROM case_file as p1
WHERE p1.serial_no = p.serial_no
AND p1.cfh_status_dt < p.cfh_status_dt
);
It works well, except that when the dates cfh_status_dt are equal then neither of the records are removed.
For rows that have the same serial_no and the date is the same, I would like to keep the one that has a registration_no (if any do, this column also has NULLS).
Is there a way I can do this with all one query, possibly with a case statement or another simple comparison?
DELETE FROM case_file AS p
WHERE id NOT IN (
SELECT DISTINCT ON (serial_no) id -- id = PK
FROM case_file
ORDER BY serial_no, cfh_status_dt DESC, registration_no
);
This keeps the (one) latest row per serial_no, choosing the smallest registration_no if there are multiple candidates.
NULL sorts last in default ascending order. So any row with a not-null registration_no is preferred.
If you want the greatest registration_no instead, to still sort NULL values last, use:
...
ORDER BY serial_no, cfh_status_dt DESC, registration_no DESC NULLS LAST
See:
Select first row in each GROUP BY group?
Sort by column ASC, but NULL values first?
If you have no PK (PRIMARY KEY) or other UNIQUE NOT NULL (combination of) column(s) you can use for this purpose, you can fall back to ctid. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
NOT IN is typically not the most efficient way. But this deals with duplicates involving NULL values. See:
How to delete duplicate rows without unique identifier
If there are many duplicates - and you can afford to do so! - it can be (much) more efficient to create a new, pristine table of survivors and replace the old table, instead of deleting the majority of rows in the existing table.
Or create a temporary table of survivors, truncate the old and insert from the temp table. This way depending objects like views or FK constraints can stay in place. See:
How to delete duplicate entries?
Surviving rows are simply:
SELECT DISTINCT ON (serial_no) *
FROM case_file
ORDER BY serial_no, cfh_status_dt DESC, registration_no;

Distinct column with primary key column

Distinct column count differs when adding the primary key column in the Select query
The count distinct for supplier_payment_terms is 110, but when adding the PK column, the count changes to thousands.
select distinct supplier, unique_id from indirect_spend;
I expect the same record count of 110 when including the PK column in the select. The Select must only include the unique_id of the supplier.
"I expect the same record count of 110 when including the PK column in the select"
Then you expect wrong. SELECT DISTINCT causes all rows appearing in the result to be distinct, i.e. no duplicate rows in the result.
Besides. Imagine two rows (supplier-id unique-id) (1 2) and (1 5). You say you expect only one row in the result. How is the system going to determine which one of the two rows to deliver ?
You can use aggregation to get example primary keys:
select supplier, min(unique_id), max(unique_id)
from indirect_spend
group by supplier;

Column 'of realationship' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause

I want to retrieve the row with the Maximum Value of 'PNRno' column, where PNRno is a Primary key of Tktrsrv and have realatioship with multiple tables. the code is written as:
Select
PNRcd,PNRno, Tktno, Tno, Tname, Doj, Class, brding, rsrvdupto
from Tktrsrv
GROUP BY PNRno
Having PNRno= Max(PNRno);
Please help me.
When using GROUP BY, you can't use any column in the column list which is not aggregated or mentioned in the GROUP BY clause.
If you want to select just the one row with the maximum value of PNRno, you don't even need GROUP BY; use this query:
Select
PNRcd,PNRno, Tktno, Tno, Tname, Doj, Class, brding, rsrvdupto
from Tktrsrv
WHERE PNRno = (SELECT Max(PNRno) FROM Tktsrv)
If Prdno is a primary key then you dont need to do a group by as each value can only exist once in the tktrsrv table.
Therefore you can select the max value with
select max(PNRno) from tktrsrv
and if you want to select from another table that has a foreign key to PNRno tktqueue is a tablename i made up)
select * from tktqueue
where PNRno=(select max(PNRno) from tktrsvr);
Likewise if you want to select the data from tktrsrv that is the highest in another table
select * from tktrsvr
where PNRno=(select max(PNRno) from tktqueue);

Interview - Detect/remove duplicate entries

how to detect/remove duplicate entries from a database in a table where there is no primary key ?
[If we use 'DISTINCT' how do we know which record is the correct one and duplicate one ? ]
delete f
from
(
select ROW_NUMBER()
over (partition by
YourFirstPossibleDuplicateField,
YourSecondPossibleDuplicateField
order by WhateverFieldYouWantSortedBy) as DelId
from YourTable
) as f
where DelId > 1
I created a view where DISTINCT actually was not a part of the query, but PARTITION. I needed the most recent entry to records with the same Ordernum and RecordType fields, discarding the others. The partitions are ordered by date, and then the top row is selected, like this:
SELECT *, ROW_NUMBER()
OVER (PARTITION BY OrderNum, RecordType ORDER BY DateChanged DESC) rn
FROM HistoryTable SELECT * FROM q WHERE rn = 1
If we use 'DISTINCT' how do we know which record is the correct one
and duplicate one?
If you have duplicate rows then doesn't matter which duplicate is picked because they are all the same!
I guess when you say "there is no primary key" that you actually mean there is no simple single-column 'surrogate' candidate key such as an incrementing sequence of integers, preferably with no gaps, but that there is a multi-column compound 'natural' candidate key (though does not comprise all the columns).
If this is the case, you'd look for something to break ties e.g. a column named DateChanged as per #Dave's answer. Otherwise, you need to pick am arbitrary row e.g. the answer by #Surfer513 does this using the ROW_NUMBER() windowed function over (YourFirstPossibleDuplicateField, YourSecondPossibleDuplicateField) (i.e. your natural key) then picking the duplicate that got arbitrarily assigned the row number 1.

How to mark duplicates in an SQL query

I have an SQL query which looks at date-of-birth, last name and a soundex of first name to identify duplicates. The following query finds some 8,000 rows (which I assume means there are around 8,000 duplicate records).
select dob,last_name,soundex(first_name),count(*)
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
Almost all of the results have a count of 2, a few have a count of 3 where obviously the record existed twice in one of the two databases which were merged.
The next step I need to take is to mark one of the rows, doesn't really matter, with a duplicate flag and to mark each row with the opposite rows key. Is there a way of doing this using SQL?
This should do what you are after, the UPDATE in one go.
UPDATE FROM clients c
INNER JOIN
(
select dob,last_name,soundex(first_name),MIN(id) as keep
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
) k
ON c.dob=k.dob AND c.last_name=k.last_name AND soundex(c.first_name)=soundex(k.first_name)
SET duplicateid = NULLIF(k.keep, c.id),
hasduplicate = (k.keep = c.id)
It assumes you have 3 columns not stated in the question
id: primary key
duplicateid: points to the dup being kept
hasduplicate: boolean, marks the one to keep
Well, you could use SELECT DISTINCT, and then mark a single row as "not duplicate" -- then search for rows that are "not duplicate" to find the duplicate.
Here is a query that will give you not only the duplicates, but also the first id inserted (assuming Id is the sequential primary-key column) and the newest id.
OTTOMH
select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
You can use this in a JOIN to do your update
UPDATE Clients
SET OppositeRowId = DuplicateRows.NewestId
FROM
(
select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
) DuplicateRows
WHERE
DuplicateRows.OldestId = Clients.Id
All of this assumes that you have one duplicate. If you have more than one, you are going to have to try something different.