Select all rows of a SQL table which do not share a name - sql

I have a table, call it widgets which has columns name and created_at, among others. I want to run a query that returns the count of all the rows of widgets which share the same name and have been created within a millisecond of each other.
This is the query that I have come up with, but it returns a number greater than the total number of rows in the table, can someone point out where I am going wrong?
SELECT COUNT (DISTINCT "t1"."id")
FROM
"tasks" "t1" ,"tasks" "t2"
WHERE
"t1"."name" = "t2"."name"
AND
date_trunc('milliseconds',"t1"."created_at") = date_trunc('milliseconds',"t2"."created_at")

You should add the condition:
and "t1"."id" <> "t2"."id"
where "id" is a primary key. In the lack of a primary key you can use ctid:
and "t1".ctid <> "t2".ctid

Related

How to fetch required data using single query in sql developer

I have a subject table that has subject_id column. In the table I have one row that has subject_id null other than that subject_id has a distinct value.
I am looking for single query I can fetch the data on basis of subject_id.
Select * from subject where subject_id = x;
If there is no data found w.r.t x than it should return the row with subject_id = null
In general this is a terrible pattern for tables. NULL as a primary key value is only going to cause you pain and suffering in the long run. Using a NULL-keyed row as a default for when your query matches no other rows will lead to strange behavior somewhere unexpected.
The simplest way would be to simply include the NULL row as the last row of any query and then only fetch the first row. But that only works when your query can only return at most one valid result.
select *
from subject
where subject_id = ? or subject_id is null
order by subject_id asc nulls last
Possibly the biggest problem with a NULL PK for your default/placeholder row in subject is that anywhere else you have a NULL subject_id cannot simply join to that row using x.subject_id = y.subject_id.
If you really need such a row, I suggest using -1 instead of NULL as the "not exists" value. It will make your life much easier across the board, especially if you need to join to it.

How to break ties when comparing columns in SQL

I am trying to delete duplicates in Postgres. I am using this as the base of my query:
DELETE FROM case_file as p
WHERE EXISTS (
SELECT FROM case_file as p1
WHERE p1.serial_no = p.serial_no
AND p1.cfh_status_dt < p.cfh_status_dt
);
It works well, except that when the dates cfh_status_dt are equal then neither of the records are removed.
For rows that have the same serial_no and the date is the same, I would like to keep the one that has a registration_no (if any do, this column also has NULLS).
Is there a way I can do this with all one query, possibly with a case statement or another simple comparison?
DELETE FROM case_file AS p
WHERE id NOT IN (
SELECT DISTINCT ON (serial_no) id -- id = PK
FROM case_file
ORDER BY serial_no, cfh_status_dt DESC, registration_no
);
This keeps the (one) latest row per serial_no, choosing the smallest registration_no if there are multiple candidates.
NULL sorts last in default ascending order. So any row with a not-null registration_no is preferred.
If you want the greatest registration_no instead, to still sort NULL values last, use:
...
ORDER BY serial_no, cfh_status_dt DESC, registration_no DESC NULLS LAST
See:
Select first row in each GROUP BY group?
Sort by column ASC, but NULL values first?
If you have no PK (PRIMARY KEY) or other UNIQUE NOT NULL (combination of) column(s) you can use for this purpose, you can fall back to ctid. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
NOT IN is typically not the most efficient way. But this deals with duplicates involving NULL values. See:
How to delete duplicate rows without unique identifier
If there are many duplicates - and you can afford to do so! - it can be (much) more efficient to create a new, pristine table of survivors and replace the old table, instead of deleting the majority of rows in the existing table.
Or create a temporary table of survivors, truncate the old and insert from the temp table. This way depending objects like views or FK constraints can stay in place. See:
How to delete duplicate entries?
Surviving rows are simply:
SELECT DISTINCT ON (serial_no) *
FROM case_file
ORDER BY serial_no, cfh_status_dt DESC, registration_no;

Distinct column with primary key column

Distinct column count differs when adding the primary key column in the Select query
The count distinct for supplier_payment_terms is 110, but when adding the PK column, the count changes to thousands.
select distinct supplier, unique_id from indirect_spend;
I expect the same record count of 110 when including the PK column in the select. The Select must only include the unique_id of the supplier.
"I expect the same record count of 110 when including the PK column in the select"
Then you expect wrong. SELECT DISTINCT causes all rows appearing in the result to be distinct, i.e. no duplicate rows in the result.
Besides. Imagine two rows (supplier-id unique-id) (1 2) and (1 5). You say you expect only one row in the result. How is the system going to determine which one of the two rows to deliver ?
You can use aggregation to get example primary keys:
select supplier, min(unique_id), max(unique_id)
from indirect_spend
group by supplier;

Logically determine a composite key in SQL

I'm working with an MSSQL table that does not have a primary or unique key contstraint defined. There are two fields, lets call them xId and yId, that I believe together would be a composite key, but I want to confirm this by examining the data.
I'm thinking that I should be able to write a SQL count statement that I can compare to the total number of records on the table that would logically determine if the combination of xId and yId (or a third column id necessary) could in fact act as a composite key. However, I'm having trouble coming up with the right GROUP BY or other type of clause that would confirm or disprove this.
Any ideas?
Use group by and having:
select xid,yid
from table
group by xid,yid
having count(1) > 1
This will show any pairs that are non-unique, so if there are no rows returned its a good key.
Just do a count of the total rows of the table, and then do
select count(1)
from(
select xid,yid
from table
group by xid,yid
)a;
if all pairs of xid and yid form a unique identifier, then the two numbers will be the same.
Alternatively, you could count the number of distinct pairs of xid and yid and find the largest such number:
select max(num_rows)
from(
select xid,yid,count(1) as num_rows
from table
group by xid,yid
)a;
The result of this query is 1 if and only if (xid,yid) pairs form a unique identifier for your table.
this will list all the problem combinations (if any) of xid,yid:
SELECT
COUNT(*),xid,yid
FROM YourTable
GROUP BY xid,yid
HAVING COUNT(*)>1

How to mark duplicates in an SQL query

I have an SQL query which looks at date-of-birth, last name and a soundex of first name to identify duplicates. The following query finds some 8,000 rows (which I assume means there are around 8,000 duplicate records).
select dob,last_name,soundex(first_name),count(*)
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
Almost all of the results have a count of 2, a few have a count of 3 where obviously the record existed twice in one of the two databases which were merged.
The next step I need to take is to mark one of the rows, doesn't really matter, with a duplicate flag and to mark each row with the opposite rows key. Is there a way of doing this using SQL?
This should do what you are after, the UPDATE in one go.
UPDATE FROM clients c
INNER JOIN
(
select dob,last_name,soundex(first_name),MIN(id) as keep
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
) k
ON c.dob=k.dob AND c.last_name=k.last_name AND soundex(c.first_name)=soundex(k.first_name)
SET duplicateid = NULLIF(k.keep, c.id),
hasduplicate = (k.keep = c.id)
It assumes you have 3 columns not stated in the question
id: primary key
duplicateid: points to the dup being kept
hasduplicate: boolean, marks the one to keep
Well, you could use SELECT DISTINCT, and then mark a single row as "not duplicate" -- then search for rows that are "not duplicate" to find the duplicate.
Here is a query that will give you not only the duplicates, but also the first id inserted (assuming Id is the sequential primary-key column) and the newest id.
OTTOMH
select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
You can use this in a JOIN to do your update
UPDATE Clients
SET OppositeRowId = DuplicateRows.NewestId
FROM
(
select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1
) DuplicateRows
WHERE
DuplicateRows.OldestId = Clients.Id
All of this assumes that you have one duplicate. If you have more than one, you are going to have to try something different.