SQL to find why PK canidate has duplicates on unkeyed table - sql

If my title hurts your head... I'm with you. I don't want to get into why this table exists except that it is part of a legacy system, also the system does "record level access"(RLA) and this I know will be an issue for many tables, anyways the RLA is mentioned because adding a column will change the table format and then many very old programs will no longer work...
Apparently adding a PK has been shown not to change the table format. So I've been told that a certain set of keys is guarantied to be unique, well what do you know... it isn't. And now I need to show where they aren't.
All I can think of is:
Get the cross product where the table matches on it's primary key.
Somehow get a count column onto the result set for the number of entries where the PK matches it self.
Filter that result set for values where count id greater than 2.
I'm going to see if I expand the PK sufficiently I'll actually find something unique.

Remove the constraints / unique indexes, insert the data, and then run this query:
SELECT col1, col2, ..., coln, COUNT(*)
FROM your_table
GROUP BY col1, col2, ..., coln
HAVING COUNT(*) > 1
where col1, col2, ..., coln is the list of columns in your key (one or more columns). The result will be the list of keys that occur more than once together with a count showing how often they occur.

select col1, ... from tab group by col1, ... having count(*)>1;

SELECT * FROM (SELECT ID, COUNT(*) CNT FROM MY_TABLE GROUP BY ID) WHERE CNT > 1

Related

SQL: If two rows have the same value A delete the row with the lowest value B [duplicate]

I have a table in a PostgreSQL 8.3.8 database, which has no keys/constraints on it, and has multiple rows with exactly the same values.
I would like to remove all duplicates and keep only 1 copy of each row.
There is one column in particular (named "key") which may be used to identify duplicates, i.e. there should only exist one entry for each distinct "key".
How can I do this? (Ideally, with a single SQL command.)
Speed is not a problem in this case (there are only a few rows).
A faster solution is
DELETE FROM dups a USING (
SELECT MIN(ctid) as ctid, key
FROM dups
GROUP BY key HAVING COUNT(*) > 1
) b
WHERE a.key = b.key
AND a.ctid <> b.ctid
DELETE FROM dupes a
WHERE a.ctid <> (SELECT min(b.ctid)
FROM dupes b
WHERE a.key = b.key);
This is fast and concise:
DELETE FROM dupes T1
USING dupes T2
WHERE T1.ctid < T2.ctid -- delete the older versions
AND T1.key = T2.key; -- add more columns if needed
See also my answer at How to delete duplicate rows without unique identifier which includes more information.
EXISTS is simple and among the fastest for most data distributions:
DELETE FROM dupes d
WHERE EXISTS (
SELECT FROM dupes
WHERE key = d.key
AND ctid < d.ctid
);
From each set of duplicate rows (defined by identical key), this keeps the one row with the minimum ctid.
Result is identical to the currently accepted answer by a_horse. Just faster, because EXISTS can stop evaluating as soon as the first offending row is found, while the alternative with min() has to consider all rows per group to compute the minimum. Speed is of no concern to this question, but why not take it?
You may want to add a UNIQUE constraint after cleaning up, to prevent duplicates from creeping back in:
ALTER TABLE dupes ADD CONSTRAINT constraint_name_here UNIQUE (key);
About the system column ctid:
Is the system column “ctid” legitimate for identifying rows to delete?
If there is any other column defined UNIQUE NOT NULL column in the table (like a PRIMARY KEY) then, by all means, use it instead of ctid.
If key can be NULL and you only want one of those, too, use IS NOT DISTINCT FROM instead of =. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
As that's slower, you might instead run the above query as is, and this in addition:
DELETE FROM dupes d
WHERE key IS NULL
AND EXISTS (
SELECT FROM dupes
WHERE key IS NULL
AND ctid < d.ctid
);
And consider:
Create unique constraint with null columns
For small tables, indexes generally do not help performance. And we need not look further.
For big tables and few duplicates, an existing index on (key) can help (a lot).
For mostly duplicates, an index may add more cost than benefit, as it has to be kept up to date concurrently. Finding duplicates without index becomes faster anyway because there are so many and EXISTS only needs to find one. But consider a completely different approach if you can afford it (i.e. concurrent access allows it): Write the few surviving rows to a new table. That also removes table (and index) bloat in the process. See:
How to delete duplicate entries?
I tried this:
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
provided by Postgres wiki:
https://wiki.postgresql.org/wiki/Deleting_duplicates
I would use a temporary table:
create table tab_temp as
select distinct f1, f2, f3, fn
from tab;
Then, delete tab and rename tab_temp into tab.
I had to create my own version. Version written by #a_horse_with_no_name is way too slow on my table (21M rows). And #rapimo simply doesn't delete dups.
Here is what I use on PostgreSQL 9.5
DELETE FROM your_table
WHERE ctid IN (
SELECT unnest(array_remove(all_ctids, actid))
FROM (
SELECT
min(b.ctid) AS actid,
array_agg(ctid) AS all_ctids
FROM your_table b
GROUP BY key1, key2, key3, key4
HAVING count(*) > 1) c);
Another approach (works only if you have any unique field like id in your table) to find all unique ids by columns and remove other ids that are not in unique list
DELETE
FROM users
WHERE users.id NOT IN (SELECT DISTINCT ON (username, email) id FROM users);
Postgresql has windows function, you can use rank() to archive your goal, sample:
WITH ranked as (
SELECT
id, column1,
"rank" () OVER (
PARTITION BY column1
order by column1 asc
) AS r
FROM
table1
)
delete from table1 t1
using ranked
where t1.id = ranked.id and ranked.r > 1
Here is another solution, that worked for me.
delete from table_name a using table_name b
where a.id < b.id
and a.column1 = b.column1;
How about:
WITH
u AS (SELECT DISTINCT * FROM your_table),
x AS (DELETE FROM your_table)
INSERT INTO your_table SELECT * FROM u;
I had been concerned about execution order, would the DELETE happen before the SELECT DISTINCT, but it works fine for me.
And has the added bonus of not needing any knowledge about the table structure.
Here is a solution using PARTITION BY and the virtual ctid column, which is works like a primary key, at least within a single session:
DELETE FROM dups
USING (
SELECT
ctid,
(
ctid != min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])
) AS is_duplicate
FROM dups
) dups_find_duplicates
WHERE dups.ctid == dups_find_duplicates.ctid
AND dups_find_duplicates.is_duplicate
A subquery is used to mark all rows as duplicates or not, based on whether they share the same "key columns", but not the same ctid, as the "first" one found in the "partition" of rows sharing the same keys.
In other words, "first" is defined as:
min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])
Then, all rows where is_duplicate is true are deleted by their ctid.
From the documentation, ctid represents (emphasis mine):
The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. A primary key should be used to identify logical rows.
well, none of this solution would work if the id is duplicated which is my use case, then the solution is simple:
myTable:
id name
0 value
0 value
0 value
1 value1
1 value1
create dedupMyTable as select distinct * from myTable;
delete from myTable;
insert into myTable select * from dedupMyTable;
select * from myTable;
id name
0 value
1 value1
well you shouldn't have duplicates id into your table unless it doesn't have PK constraints or simply doesn't support it such as Hive/data lake tables
Better pay attention when loading your data to avoid dups over ID's
DELETE FROM tracking_order
WHERE
mvd_id IN (---column you need to remove duplicate
SELECT
mvd_id
FROM (
SELECT
mvd_id,thoi_gian_gui,
ROW_NUMBER() OVER (
PARTITION BY mvd_id
ORDER BY thoi_gian_gui desc) AS row_num
FROM
tracking_order
) s_alias
WHERE row_num > 1)
AND thoi_gian_gui in ( --column you used to compare to delete duplicates, eg last update time
SELECT
thoi_gian_gui
FROM (
SELECT
thoi_gian_gui,
ROW_NUMBER() OVER (
PARTITION BY mvd_id
ORDER BY thoi_gian_gui desc) AS row_num
FROM
tracking_order
) s_alias
WHERE row_num > 1)
My code, I remove all duplicates 7800445 row and keep only 1 copy of each row with 7 min 28 secs.
enter image description here
This worked well for me. I had a table, terms, that contained duplicate values. Ran a query to populate a temp table with all of the duplicate rows. Then I ran the a delete statement with those ids in the temp table. value is the column that contained the duplicates.
CREATE TEMP TABLE dupids AS
select id from (
select value, id, row_number()
over (partition by value order by value)
as rownum from terms
) tmp
where rownum >= 2;
delete from [table] where id in (select id from dupids)

Copying all rows from one table to another without writing out all of the columns

I'm trying to copy over all rows from one table into another that are distinct on one column (Using a Postgresql database). I know that this can be done like so:
INSERT INTO table2(col1, col2, col3, ...)
SELECT
DISTINCT ON (col1) col1, col2, col3, ...
FROM table1;
The problem I'm having is that table1 has 100+ columns and so I don't want to write out all of the column names. I tried to do something like:
INSERT INTO table2 (*)
SELECT
DISTINCT ON (col1) *
FROM table1;
which resulted in a syntax error. Could someone please provide a code snippet with the correct syntax?
If the columns exactly line up, you can use:
INSERT INTO table2
SELECT DISTINCT ON (col1) t1.*
FROM table1 t1
ORDER BY col1;
Very importantly: When using DISTINCT ON, you should always have an ORDER BY, where the keys for the ORDER BY match the expressions in parentheses.
Leaving out the explicit columns in the INSERT is dangerous -- precisely because there might be some slip-up (columns out of order or a different number of columns). Sometimes when you are writing scripts and you know that the destination table really does match the source table, though, it can be handy.

SQL select distinct by 2 or more columns

I have a table with a lot of columns and what I need to do is to write select that would take only unique values. The main problem is that I need to check three columns at the same time and if all three columns have same values in their columns(not between them, but in their own column) then distinct. Idea should be something like distinct(column1 and column2 and column3)
Any ideas? Or you need more information, because I'm not sure if everybody gets what I have in mind.
This is example. Select should return two rows from this, one where last column would have Yes and other row withNo`.
This is exactly what the distinct keyword is for:
SELECT distinct col1, col2, col3
FROM mytable

Fast way to eyeball possible duplicate rows in a table?

Similar: How can I delete duplicate rows in a table
I have a feeling this is impossible and I'm going to have to do it the tedious way, but I'll see what you guys have to say.
I have a pretty big table, about 4 million rows, and 50-odd columns. It has a column that is supposed to be unique, Episode. Unfortunately, Episode is not unique - the logic behind this was that occasionally other fields in the row change, despite Episode being repeated. However, there is an actually unique column, Sequence.
I want to try and identify rows that have the same episode number, but something different between them (aside from sequence), so I can pick out how often this occurs, and whether it's worth allowing for or I should just nuke the rows and ignore possible mild discrepancies.
My hope is to create a table that shows the Episode number, and a column for each table column, identifying the value on both sides, where they are different:
SELECT Episode,
CASE WHEN a.Value1<>b.Value1
THEN a.Value1 + ',' + b.Value1
ELSE '' END AS Value1,
CASE WHEN a.Value2<>b.Value2
THEN a.Value2 + ',' + b.Value2
ELSE '' END AS Value2
FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode
WHERE a.Value1<>b.Value1
OR a.Value2<>b.Value2
(That is probably full of holes, but the idea of highlighting changed values comes through, I hope.)
Unfortunately, making a query like that for fifty columns is pretty painful. Obviously, it doesn't exactly have to be rock-solid if it will only be used the once, but at the same time, the more copy-pasta the code, the more likely something will be missed. As far as I know, I can't just do a search for DISTINCT, since Sequence is distinct and the same row will pop up as different.
Does anyone have a query or function that might help? Either something that will output a query result similar to the above, or a different solution? As I said, right now I'm not really looking to remove the duplicates, just identify them.
Use:
SELECT DISTINCT t.*
FROM TABLE t
ORDER BY t.episode --, and whatever other columns
DISTINCT is just shorthand for writing a GROUP BY with all the columns involved. Grouping by all the columns will show you all the unique groups of records associated with the episode column in this case. So there's a risk of not having an accurate count of duplicates, but you will have the values so you can decide what to remove when you get to that point.
50 columns is a lot, but setting the ORDER BY will allow you to eyeball the list. Another alternative would be to export the data to Excel if you don't want to construct the ORDER BY, and use Excel's sorting.
UPDATE
I didn't catch that the sequence column would be a unique value, but in that case you'd have to provide a list of all the columns you want to see. IE:
SELECT DISTINCT t.episode, t.column1, t.column2 --etc.
FROM TABLE t
ORDER BY t.episode --, and whatever other columns
There's no notation that will let you use t.* but not this one column. Once the sequence column is omitted from the output, the duplicates will become apparent.
Instead of typing out all 50 columns, you could do this:
select column_name from information_schema.columns where table_name = 'your table name'
then paste them into a query that groups by all of the columns EXCEPT sequence, and filters by count > 1:
select
count(episode)
, col1
, col2
, col3
, ...
from YourTable
group by
col1
, col2
, col3
, ...
having count(episode) > 1
This should give you a list of all the rows that have the same episode number. (But just neither the sequence nor episode numbers themselves). Here's the rub: you will need to join this result set to YourTable on ALL the columns except sequence and episode since you don't have those columns here.
Here's where I like to use SQL to generate more SQL. This should get you started:
select 't1.' + column_name + ' = t2.' + column_name
from information_schema.columns where table_name = 'YourTable'
You'll plug in those join parameters to this query:
select * from YourTable t1
inner join (
select
count(episode) 'epcount'
, col1
, col2
, col3
, ...
from YourTable
group by
col1
, col2
, col3
, ...
having count(episode) > 1
) t2 on
...plug in all those join parameters here...
select count distinct ....
Should show you without having to guess. You can get your columns by viewing your table definition so you can copy/paste your non-sequence columns.
I think something like this is what you want:
select *
from t
where t.episode in (select episode from t group by episode having count(episode) > 1)
order by episode
This will give all rows that have episodes that are duplicated. Non-duplicate rows should stick out fairly obviously.
Of course, if you have access to some sort of scripting, you could just write a script to generate your query for you. It seems pretty straight-forward. (i.e. describe t and iterate over all the fields).
Also, your query should have some sort of ordering, like FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode AND a.Sequence < b.Sequence, otherwise you'll get duplicate non-duplicates.
A relatively simple solution that Ponies sparked:
SELECT t.*
FROM Table t
INNER JOIN ( SELECT episode
FROM Table
GROUP BY Episode
HAVING COUNT(*) > 1
) AS x ON t.episode = x.episode
And then, copy-paste into Excel, and use this as conditional highlighting for the entire result set:
=AND($C2=$C1,A2<>A1)
Column C is Episode. This way, you get a visual highlight when the data's different from the row above (as long as both rows have the same value for episode).
Generate and store a hash key for each row, designed so the hash values mirror your
definition of sameness. Depending on the complexity of your rows, updating the
hash might be a simple trigger on modifying the row.
Query for duplicates of the hash key, which are your "very probably" identical rows.

Corrupt SQL Server Index?

I'm encountering a very strange problem concerning what appears to be a corrupt index of some kind. Not corrupt in the sense that dbcc checkdb will pick it up, but corrupt in the sense that it has rows that it shouldn't have.
I have two tables, TableA and TableB. For the purposes of my application, some rows are considered functionally duplicate, meaning while not all the column values are the same, the row is treated as a dup by my app. To filter these out, I created a view, called vTableAUnique. The view is defined as follows:
SELECT a.*
FROM TableA a
INNER JOIN
(
SELECT ID, ROW_NUMBER() OVER
(PARTITION By Col1
ORDER BY Col1) AS Num
FROM TableA
) numbered ON numbered.ID = a.ID
WHERE numbered.Num = 1
The results of the view is all the records from TableA that don't have any other rows in TableA with the same values for Col1. For this example, let's say that TableA has 10 total rows, but only 7 with distinct values that show up in vTableAUnique.
TableB is basically just a list of values that match the values of Col1 from TableA. In this case, let's say that TableB has all 8 unique values that appear in vTableAUnique. So the data from TableA, TableB, and vTableAUnique would look like:
TableA (ID, Col1, Col2, Col3)
1,A,X,X
2,A,X,X
3,B,X,X
4,A,X,X
5,E,X,X
6,F,X,X
7,G,X,X
8,H,X,X
9,I,X,X
10,J,X,X
TableB (ID)
A
B
C
D
E
F
G
H
I
J
vTableAUnique (ID, Col1, Col2, Col3)
1,A,X,X
3,B,X,X
5,E,X,X
6,F,X,X
7,G,X,X
8,H,X,X
9,I,X,X
10,J,X,X
So here is the strange part. Sometimes when I join vTableAUnique with TableB on Col1, I get back the non-distinct values from TableA. In other words, rows that do NOT exist in vTableAUnique, but that do exist in TableA, appear when I do the join. If I do the select just off vTableAUnique, I don't get these rows. In this case, I would get back not just rows with the ids of 1,3,5,6,7,8,9,10, but ALSO rows with the ids of 2 and 4!
After banging my head against my desk, I decided to try and rebuild all the indexes in the DB. Sure enough, the problem disappeared. The same query now returned the correct rows. After an indererminant period of time, however, the problem comes back. DBCC CHECKDB doesn't show any issues, and I'm having a hard time tracking down which index might be causing this.
I'm using SQL Server 2008 Developer Edition on Vista x64.
HELP!
ROW_NUMBER() OVER (PARTITION By Col1 ORDER BY Col1)
is not a stable sort order, it can change from query to query depending on access path.
Your view may return different results being run several times.
Rebuilding indexes seems to affect the sort order.
Use this:
ROW_NUMBER() OVER (PARTITION By Col1 ORDER BY Id)
instead, it guarantees stable sort order.
script out the indexes and look at the script, was it created with ALLOW_DUP_ROW? if so then that could be your problem