delete duplicate rows but keep preferred row - sql

I have a simple database table
create table demo (
id integer PRIMARY KEY,
fv integer,
sv text,
rel_id integer,
FOREIGN KEY (rel_id)
REFERENCES demo(id));
and i want to delete all duplicate rows grouped by fv and sv. Which is already a fairly popular question with great answers.
But I need a twist on that scenario. As in cases where rel_id is NULL I want to keep that row. In any other case anything goes.
So by using the following values
insert into demo (id,fv,sv,rel_id)
VALUES (1,1,'somestring',NULL),
(2,2,'somemorestring',1),
(3,1,'anotherstring',NULL),
(4,2,'somemorestring',3),
(5,1,'somestring',3)
Either
id | fv | sv | rel_id
---+----+------------------+-------
1 | 1 | 'somestring' | NULL
2 | 2 | 'somemorestring' | 1
3 | 1 | 'anotherstring' | NULL
or
id | fv | sv | rel_id
---+----+------------------+-------
1 | 1 | 'somestring' | NULL
3 | 1 | 'anotherstring' | NULL
4 | 2 | 'somemorestring' | 3
would be valid results. Where as
id | fv | sv | rel_id
---+----+------------------+-------
3 | 1 | 'anotherstring' | NULL
4 | 2 | 'somemorestring' | 3
5 | 1 | 'somestring' | 3
would not be. As the first entry had NULL as rel_id which takes presidency above NOT NULL.
I currently have this (which is an answer on the basic duplicate question) as a query to remove duplicates but I am not sure how to continue to modify the query to fit my needs.
DELETE FROM demo
WHERE id NOT IN (SELECT min(id) as id
FROM demo
GROUP BY fv,sv)
As as soon as the NOT NULL entry is inserted into the database before the NULL entry the NOT NULL one will not be deleted. It is guaranteed that rel_id will always point to an entry where rel_id is NULL therefore there is no danger of deleting a referenced entry. Further it is guaranteed that there will be no two rows in the same group with rel_id IS NULL. Therefore a row with rel_id IS NULL is unique for the whole table.
Or as a basic algorithm:
Go over all rows and group them by fv and sv
Look into each group for a row where rel_id IS NULL. If there is keep that row (and delete the rest). Else pick one row of your choice and delete the rest.
sqlfiddle

I seem to have worked it out
DELETE FROM demo
WHERE id NOT IN (SELECT min(id) as id
FROM demo AS out_buff
WHERE rel_id IS NULL OR
NOT EXISTS (SELECT id FROM demo AS in_buff
WHERE rel_id IS NULL AND
in_buff.fv = out_buff.fv AND
in_buff.sv = out_buff.sv)
GROUP BY fv,sv);
by selecting in the inner SELECT either only the row with the rel_id with the value NULL or all rows matching on the GROUP BY arguments, by using the anti-condition to the existence of a row with rel_id IS NULL. But my query looks really ineffective. As a naive assumption would put the running time at at least O(n^2).

Related

PostgreSQL add new not null column and fill with ids from insert statement

I´ve got 2 tables.
CREATE TABLE content (
id bigserial NOT NULL,
name text
);
CREATE TABLE data (
id bigserial NOT NULL,
...
);
The tables are already filled with a lot of data.
Now I want to add a new column content_id (NOT NULL) to the data table.
It should be a foreign key to the content table.
Is it possible to automatically create an entry in the content table to set a content_id in the data table.
For example
**content**
| id | name |
| 1 | abc |
| 2 | cde |
data
| id |... |
| 1 |... |
| 2 |... |
| 3 |... |
Now I need an update statement that creates 3 (in this example) content entries and add the ids to the data table to get this result:
content
| id | name |
| 1 | abc |
| 2 | cde |
| 3 | ... |
| 4 | ... |
| 5 | ... |
data
| id |... | content_id |
| 1 |... | 3 |
| 2 |... | 4 |
| 3 |... | 5 |
demo:db<>fiddle
According to the answers presented here: How can I add a column that doesn't allow nulls in a Postgresql database?, there are several ways of adding a new NOT NULL column and fill this directly.
Basicly there are 3 steps. Choose the best fitting (with or without transaction, setting a default value first and remove after, leave the NOT NULL contraint first and add afterwards, ...)
Step 1: Adding new column (without NOT NULL constraint, because the values of the new column values are not available at this point)
ALTER TABLE data ADD COLUMN content_id integer;
Step 2: Inserting the data into both tables in a row:
WITH inserted AS ( -- 1
INSERT INTO content
SELECT
generate_series(
(SELECT MAX(id) + 1 FROM content),
(SELECT MAX(id) FROM content) + (SELECT COUNT(*) FROM data)
),
'dummy text'
RETURNING id
), matched AS ( -- 2
SELECT
d.id AS data_id,
i.id AS content_id
FROM (
SELECT
id,
row_number() OVER ()
FROM data
) d
JOIN (
SELECT
id,
row_number() OVER ()
FROM inserted
) i ON i.row_number = d.row_number
) -- 3
UPDATE data d
SET content_id = s.content_id
FROM (
SELECT * FROM matched
) s
WHERE d.id = s.data_id;
Executing several statements one after another by using the results of the previous one can be achieved using WITH clauses (CTEs):
Insert data into content table: This generates an integer series starting at the MAX() + 1 value of the current content's id values and has as many records as the data table. Afterwards the new ids are returned
Now we need to match the current records of the data table with the new ids. So for both sides, we use row_number() window function to generate a consecutive row count for each records. Because both, the insert result and the actual data table have the same number of records, this can be used as join criterion. So we can match the id column of the data table with the new content's id values
This matched data can used in the final update of the new content_id column
Step 3: Add the NOT NULL constraint
ALTER TABLE data ALTER COLUMN content_id SET NOT NULL;

Deleting duplicate rows with primary keys that are connected to other tables

A process was causing duplicate rows in a table where there were not supposed to be any. There are several great answers to deleting duplicate rows online. But, what if those duplicates with ID primary keys all have data in other tables tied to them?
Is there a way to delete all duplicates in the first table and migrate all data tied to those keys to the single PK ID that wasn't deleted?
For example:
TABLE 1
+-------+----------+----------+------------+
| ID(PK)| Model | ItemType | Color |
+-------+----------+----------+------------+
| 1 | 4 | B | Red |
| 2 | 4 | B | Red |
| 3 | 5 | A | Blue |
+-------+----------+----------+------------+
TABLE 2
+-------+----------+---------+
| ID(PK)| OtherID | Type |
+-------+----------+---------+
| 1 | 1 | Type1 |
| 2 | 1 | Type2 |
| 3 | 2 | Type3 |
| 4 | 2 | Type4 |
| 5 | 2 | Type5 |
+-------+----------+---------+
So I would theoretically want to delete the entry with ID: 2 from TABLE 1, and then have the OtherID fields in TABLE 2 switch to 1. This would actually be needed for X number of tables. This particular situation has 4 tables connected to its ID PK.
You cannot do this automatically. But you can do this with some queries. First, you set all the foreign keys to the correct id, which is presumably the smallest one:
with ids (
select t1.*, min(id) over (partition by Model, ItemType, Color) as min_id
from table1 t1
)
update t2
set t2.otherid = ids.min_id
from table2 t2 join
ids
on t2.otherid = ids.id
where ids.id <> ids.min_id;
Then delete the ids that are either duplicated or not referenced in table2 (depending on which you actually want):
with ids (
select t1.*, min(id) over (partition by Model, ItemType, Color) as min_id
from table1 t1
)
delete from ids
where id <> min_id;
Note: If the database has concurrent users, you might want to put it in single user mode for this operation or lock the tables so they are not modified during these two operations.
To do this right, you want to wrap everything in a single transaction and perform this during a regular maintenance period. Anything else could leave things as inconsistent as they are now.
Make a determination as to which "key" you will use.
Update all of the child tables to use the new "key" where the value is the old "key".
There should be no FK dependencies on the duplicate records, delete them.
Once all ambiguities are resolved, place an unique constraint on (ItemType,Color) (or whatever the real columns are).
If there are a lot of instances, you may need to write a script to handle this and use the information in sys.foreign_keys and sys.foreign_key_columns to determine which records to update and in which order.

Update statement to set a column based the maximum row of another table

I have a Family table:
SELECT * FROM Family;
id | Surname | Oldest | Oldest_Age
---+----------+--------+-------
1 | Byre | NULL | NULL
2 | Summers | NULL | NULL
3 | White | NULL | NULL
4 | Anders | NULL | NULL
The Family.Oldest column is not yet populated. There is another table of Children:
SELECT * FROM Children;
id | Name | Age | Family_FK
---+----------+------+--------
1 | Jake | 8 | 1
2 | Martin | 7 | 2
3 | Sarah | 10 | 1
4 | Tracy | 12 | 3
where many children (or no children) can be associated with one family. I would like to populate the Oldest column using an UPDATE ... SET ... statement that sets it to the Name and Oldest_Age of the oldest child in each family. Finding the name of each oldest child is a problem that is solved quite well here: How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL?
However, I don't know how to use the result of this in an UPDATE statement to update the column of an associated table using the h2 database.
The following is ANSI-SQL syntax that solves this problem:
update family
set oldest = (select name
from children c
where c.family_fk = f.id
order by age desc
fetch first 1 row only
)
In h2, I think you would use limit 1 instead of fetch first 1 row only.
EDIT:
For two columns -- alas -- the solution is two subqueries:
update family
set oldest = (select name
from children c
where c.family_fk = f.id
order by age desc
limit 1
),
oldest_age = (select age
from children c
where c.family_fk = f.id
order by age desc
limit 1
);
Some databases (such as SQL Server, Postgres, and Oracle) support lateral joins that can help with this. Also, row_number() can also help solve this problem. Unfortunately, H2 doesn't support this functionality.

Oracle Hierarchical query with condition on the whole tree

I need, using the hierarchical (or other) query, to select tree-structured data where a certain condition must hold for the whole tree (ie. all the nodes in the tree).
That means that if a single node of a tree violates the condition, then the tree is not selected at all (not even other the nodes of that tree that do comply with the condition, so the complete tree is thrown away).
Also I want to select all trees - all the nodes of such trees where the condition holds for every node (ie. select not just one such tree but all such trees).
EDIT:
Consider this example of table of files that are connected to each other through parent_id column so they form trees. There is also a foreign key owner_id, which references other table primary key.
PK file_id | name | parent_id | owner_id
----------------------------------------
1 | a1 | null | null -- root of one tree
2 | b1 | 1 | null
3 | c1 | 1 | null
4 | d1 | 2 | 100
5 | a2 | null | null -- root of another tree
6 | b2 | 5 | null
7 | c2 | 6 | null
8 | d2 | 7 | null
Column parent_id has a foreign key constraint to file_id column (making the hierarchies).
And there is one more table (let's call it junction table) where (among others) the foreign keys file_ids are stored in many-to-one relation ship to the table of files above:
FK file_id | other data
-----------------------
1 | ...
1 | ...
3 | ...
Now the query I need is to select all such whole trees of files where following conditions are met for each and every file in that tree:
owner_id of the file is null
and the file has no related records in the junction table (there are no records referencing the file by file_id FK)
For the example above, the query should result in:
file_id | name | parent_id | owner_id
---------------------------------------
5 | a2 | null | null
6 | b2 | 1 | null
7 | c2 | 1 | null
8 | d2 | 2 | null
All nodes make a whole tree as it is in the table (no missing children or parents) and each of the nodes holds to the conditions above (has no owner and no relation in junction table).
This generates the tree with a simple hierarchical query - which is really only needed to establish the root file_id for each row - while joining to junction to check for a record there. That can get duplicates, which is OK at that stage. The analytic version of max() is then applied to the intermediate result set to determine whether your conditions are met for any row with the same root:
select file_id, name, parent_id, owner_id
from (
select file_id, name, parent_id, owner_id,
max(j_id) over (partition by root_id) as max_j_id,
max(owner_id) over (partition by root_id) as max_o_id
from (
select f.*, j.file_id as j_id,
connect_by_root f.file_id as root_id
from files f
left outer join junction j
on j.file_id = f.file_id
connect by prior f.file_id = f.parent_id
start with f.parent_id is null
)
)
where max_j_id is null
and max_o_id is null
order by file_id;
FILE_ID NAME PARENT_ID OWNER_ID
--------- ------ ----------- ----------
5 a2 (null) (null)
6 b2 5 (null)
7 c2 6 (null)
8 d2 7 (null)
The innermost query gets the root and any matching junction records (with duplicates). The next level adds the analytic max owner and junction value (if there is one), giving the same result to every row for the same root. The outer query then filters out any rows which have either value for any row.
SQL Fiddle.

Removing duplicate SQL records to permit a unique key

I have a table ('sales') in a MYSQL DB which should rightfully have had a unique constraint enforced to prevent duplicates. To first remove the dupes and set the constraint is proving a bit tricky.
Table structure (simplified):
'id (unique, autoinc)'
product_id
The goal is to enforce uniqueness for product_id. The de-duping policy I want to apply is to remove all duplicate records except the most recently created, eg: the highest id.
Or to put another way, I would like to delete only duplicate records, excluding the ids matched by the following query whilst also preserving the existing non-duped records:
select id
from sales s
inner join (select product_id,
max(id) as maxId
from sales
group by product_id
having count(product_id) > 1) groupedByProdId on s.product_id
and s.id = groupedByProdId.maxId
I've struggled with this on two fronts - writing the query to select the correct records to delete and then also the constraint in MYSQL where a subselect FROM clause of a DELETE cannot reference the same table from which data is being removed.
I checked out this answer and it seemed to deal with the subject, but seem specific to sql-server, though I wouldn't rule this question out from duplicating another.
In reply to your comment, here's a query that works in MySQL:
delete YourTable
from YourTable
inner join YourTable yt2
on YourTable.product_id = yt2.product_id
and YourTable.id < yt2.id
This would only remove duplicate rows. The inner join will filter out the latest row for each product, even if no other rows for the same product exist.
P.S. If you try to alias the table after FROM, MySQL requires you to specify the name of the database, like:
delete <DatabaseName>.yt
from YourTable yt
inner join YourTable yt2
on yt.product_id = yt2.product_id
and yt.id < yt2.id;
Perhaps use ALTER IGNORE TABLE ... ADD UNIQUE KEY.
For example:
describe sales;
+------------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| product_id | int(11) | NO | | NULL | |
+------------+---------+------+-----+---------+----------------+
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 3 |
| 5 | 3 |
| 6 | 2 |
+----+------------+
ALTER IGNORE TABLE sales ADD UNIQUE KEY idx1(product_id), ORDER BY id DESC;
Query OK, 6 rows affected (0.03 sec)
Records: 6 Duplicates: 3 Warnings: 0
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 6 | 2 |
| 5 | 3 |
| 2 | 1 |
+----+------------+
See this pythian post for more information.
Note that the ids end up in reverse order. I don't think this matters, since order of the ids should not matter in a database (as far as I know!). If this displeases you however, the post linked to above shows a way to solve this problem too. However, it involves creating a temporary table which requires more hard drive space than the in-place method I posted above.
I might do the following in sql-server to eliminate the duplicates:
DELETE FROM Sales
FROM Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id
It looks like the analogous delete statement for mysql might be:
DELETE FROM Sales
USING Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id
This type of problem is easier to solve with CTEs and Ranking functions, however, you should be able to do something like the following to solve your problem:
Delete Sales
Where Exists(
Select 1
From Sales As S2
Where S2.product_id = Sales.product_id
And S2.id > Sales.Id
Having Count(*) > 0
)