Deleting duplicate rows with primary keys that are connected to other tables - sql

A process was causing duplicate rows in a table where there were not supposed to be any. There are several great answers to deleting duplicate rows online. But, what if those duplicates with ID primary keys all have data in other tables tied to them?
Is there a way to delete all duplicates in the first table and migrate all data tied to those keys to the single PK ID that wasn't deleted?
For example:
TABLE 1
+-------+----------+----------+------------+
| ID(PK)| Model | ItemType | Color |
+-------+----------+----------+------------+
| 1 | 4 | B | Red |
| 2 | 4 | B | Red |
| 3 | 5 | A | Blue |
+-------+----------+----------+------------+
TABLE 2
+-------+----------+---------+
| ID(PK)| OtherID | Type |
+-------+----------+---------+
| 1 | 1 | Type1 |
| 2 | 1 | Type2 |
| 3 | 2 | Type3 |
| 4 | 2 | Type4 |
| 5 | 2 | Type5 |
+-------+----------+---------+
So I would theoretically want to delete the entry with ID: 2 from TABLE 1, and then have the OtherID fields in TABLE 2 switch to 1. This would actually be needed for X number of tables. This particular situation has 4 tables connected to its ID PK.

You cannot do this automatically. But you can do this with some queries. First, you set all the foreign keys to the correct id, which is presumably the smallest one:
with ids (
select t1.*, min(id) over (partition by Model, ItemType, Color) as min_id
from table1 t1
)
update t2
set t2.otherid = ids.min_id
from table2 t2 join
ids
on t2.otherid = ids.id
where ids.id <> ids.min_id;
Then delete the ids that are either duplicated or not referenced in table2 (depending on which you actually want):
with ids (
select t1.*, min(id) over (partition by Model, ItemType, Color) as min_id
from table1 t1
)
delete from ids
where id <> min_id;
Note: If the database has concurrent users, you might want to put it in single user mode for this operation or lock the tables so they are not modified during these two operations.

To do this right, you want to wrap everything in a single transaction and perform this during a regular maintenance period. Anything else could leave things as inconsistent as they are now.
Make a determination as to which "key" you will use.
Update all of the child tables to use the new "key" where the value is the old "key".
There should be no FK dependencies on the duplicate records, delete them.
Once all ambiguities are resolved, place an unique constraint on (ItemType,Color) (or whatever the real columns are).
If there are a lot of instances, you may need to write a script to handle this and use the information in sys.foreign_keys and sys.foreign_key_columns to determine which records to update and in which order.

Related

delete duplicate rows but keep preferred row

I have a simple database table
create table demo (
id integer PRIMARY KEY,
fv integer,
sv text,
rel_id integer,
FOREIGN KEY (rel_id)
REFERENCES demo(id));
and i want to delete all duplicate rows grouped by fv and sv. Which is already a fairly popular question with great answers.
But I need a twist on that scenario. As in cases where rel_id is NULL I want to keep that row. In any other case anything goes.
So by using the following values
insert into demo (id,fv,sv,rel_id)
VALUES (1,1,'somestring',NULL),
(2,2,'somemorestring',1),
(3,1,'anotherstring',NULL),
(4,2,'somemorestring',3),
(5,1,'somestring',3)
Either
id | fv | sv | rel_id
---+----+------------------+-------
1 | 1 | 'somestring' | NULL
2 | 2 | 'somemorestring' | 1
3 | 1 | 'anotherstring' | NULL
or
id | fv | sv | rel_id
---+----+------------------+-------
1 | 1 | 'somestring' | NULL
3 | 1 | 'anotherstring' | NULL
4 | 2 | 'somemorestring' | 3
would be valid results. Where as
id | fv | sv | rel_id
---+----+------------------+-------
3 | 1 | 'anotherstring' | NULL
4 | 2 | 'somemorestring' | 3
5 | 1 | 'somestring' | 3
would not be. As the first entry had NULL as rel_id which takes presidency above NOT NULL.
I currently have this (which is an answer on the basic duplicate question) as a query to remove duplicates but I am not sure how to continue to modify the query to fit my needs.
DELETE FROM demo
WHERE id NOT IN (SELECT min(id) as id
FROM demo
GROUP BY fv,sv)
As as soon as the NOT NULL entry is inserted into the database before the NULL entry the NOT NULL one will not be deleted. It is guaranteed that rel_id will always point to an entry where rel_id is NULL therefore there is no danger of deleting a referenced entry. Further it is guaranteed that there will be no two rows in the same group with rel_id IS NULL. Therefore a row with rel_id IS NULL is unique for the whole table.
Or as a basic algorithm:
Go over all rows and group them by fv and sv
Look into each group for a row where rel_id IS NULL. If there is keep that row (and delete the rest). Else pick one row of your choice and delete the rest.
sqlfiddle
I seem to have worked it out
DELETE FROM demo
WHERE id NOT IN (SELECT min(id) as id
FROM demo AS out_buff
WHERE rel_id IS NULL OR
NOT EXISTS (SELECT id FROM demo AS in_buff
WHERE rel_id IS NULL AND
in_buff.fv = out_buff.fv AND
in_buff.sv = out_buff.sv)
GROUP BY fv,sv);
by selecting in the inner SELECT either only the row with the rel_id with the value NULL or all rows matching on the GROUP BY arguments, by using the anti-condition to the existence of a row with rel_id IS NULL. But my query looks really ineffective. As a naive assumption would put the running time at at least O(n^2).

Is self-join the way to go on BigQuery when fetching data from multiple repeated fields?

Consider this schema:
key: REQUIRED INTEGER
description: NULLABLE STRING
field: REPEATED RECORD {
field.names: REQUIRED STRING
field.value: NULLABLE FLOAT
}
Where: key is unique by table, field.names is actually a comma-separated list of properties ("property1","property2","property3"...).
Sample dataset (don't pay attention to the actual values, they are only for demonstration of the structure):
{"key":1,"description":"Cool","field":[{"names":"\"Nice\",\"Wonderful\",\"Woohoo\"", "value":1.2},{"names":"\"Everything\",\"is\",\"Awesome\"", "value":20}]}
{"key":2,"description":"Stack","field":[{"names":"\"Overflow\",\"Exchange\",\"Nice\"", "value":2.0}]}
{"key":3,"description":"Iron","field":[{"names":"\"The\",\"Trooper\"", "value":666},{"names":"\"Aces\",\"High\",\"Awesome\"", "value":333}]}
What I need is a way to query for the values of multiple field.names at once. The output should be like this:
+-----+--------+-------+-------+-------+-------+
| key | desc | prop1 | prop2 | prop3 | prop4 |
+-----+--------+-------+-------+-------+-------+
| 1 | Desc 1 | 1.0 | 2.0 | 3.0 | 4.0 |
| 2 | Desc 2 | 4.0 | 3.0 | 2.0 | 1.0 |
| ... | | | | | |
+-----+--------+-------+-------+-------+-------+
If the same key contains fields with the same queried name, only the first value should be considered.
And here is my query so far:
select all.key as key, all.description as desc,
t1.col as prop1, t2.col as prop2, t3.col as prop3 //and so on...
from mydataset.mytable all
left join each
(select key, field.value as col from
mydataset.mytable
where lower(field.names) contains '"trooper"'
group each by key, col
) as t1 on all.key = t1.key
left join each
(select key, field.value as col from
mydataset.mytable
where lower(field.names) contains '"awesome"'
group each by key, col
) as t2 on all.key = t2.key
left join each
(select key, field.value as col from
mydataset.mytable
where lower(field.names) contains '"nice"'
group each by key, col
) as t3 on all.key = t3.key
//and so on...
The output of this query would be:
+-----+-------+-------+-------+-------+
| key | desc | prop1 | prop2 | prop3 |
+-----+-------+-------+-------+-------+
| 1 | Cool | null | 20.0 | 1.2 |
| 2 | Stack | null | null | 2.0 |
| 3 | Iron | 666.0 | 333.0 | null |
+-----+-------+-------+-------+-------+
So my question is: is this the way to go? If my user wants, lets say, 200 properties from my table, should I just make 200 self-joins? Is it scalable, considering the table can grow in billions of rows? Is there another way to do the same, using BigQuery?
Thanks.
Generally speaking, a query with more than 50 joins can start to become problematic, particularly if you're joining large tables. Even with repeated fields, you want to try to scan your tables in one pass wherever possible.
It's useful to note that when you query a table with a repeated field, you are really querying a semi-flattened representation of that table. You can pretend that each repetition is its own row, and apply filters, expressions, and grouping accordingly.
In this case, I think you can probably get away with a single scan:
select
key,
desc,
max(if(lower(field.names) contains "trooper", field.value, null))
within record as prop1,
max(if(lower(field.names) contains "awesome", field.value, null))
within record as prop2,
...
from mydataset.mytable
In this case, each "prop" field just selects the value corresponding to each desired field name, or null if it doesn't exist, and then aggregates those results using the "max" function. I'm assuming that there's only one occurrence of a field name per key, in which case the specific aggregation function doesn't matter much, since it only exists to collapse nulls. But obviously you should swap it for something more appropriate if needed.
The "within record" syntax tells BigQuery to perform those aggregations only over the repeated fields within a record, and not across the entire table, thus eliminating the need for a "group by" clause at the end.

Oracle Hierarchical query with condition on the whole tree

I need, using the hierarchical (or other) query, to select tree-structured data where a certain condition must hold for the whole tree (ie. all the nodes in the tree).
That means that if a single node of a tree violates the condition, then the tree is not selected at all (not even other the nodes of that tree that do comply with the condition, so the complete tree is thrown away).
Also I want to select all trees - all the nodes of such trees where the condition holds for every node (ie. select not just one such tree but all such trees).
EDIT:
Consider this example of table of files that are connected to each other through parent_id column so they form trees. There is also a foreign key owner_id, which references other table primary key.
PK file_id | name | parent_id | owner_id
----------------------------------------
1 | a1 | null | null -- root of one tree
2 | b1 | 1 | null
3 | c1 | 1 | null
4 | d1 | 2 | 100
5 | a2 | null | null -- root of another tree
6 | b2 | 5 | null
7 | c2 | 6 | null
8 | d2 | 7 | null
Column parent_id has a foreign key constraint to file_id column (making the hierarchies).
And there is one more table (let's call it junction table) where (among others) the foreign keys file_ids are stored in many-to-one relation ship to the table of files above:
FK file_id | other data
-----------------------
1 | ...
1 | ...
3 | ...
Now the query I need is to select all such whole trees of files where following conditions are met for each and every file in that tree:
owner_id of the file is null
and the file has no related records in the junction table (there are no records referencing the file by file_id FK)
For the example above, the query should result in:
file_id | name | parent_id | owner_id
---------------------------------------
5 | a2 | null | null
6 | b2 | 1 | null
7 | c2 | 1 | null
8 | d2 | 2 | null
All nodes make a whole tree as it is in the table (no missing children or parents) and each of the nodes holds to the conditions above (has no owner and no relation in junction table).
This generates the tree with a simple hierarchical query - which is really only needed to establish the root file_id for each row - while joining to junction to check for a record there. That can get duplicates, which is OK at that stage. The analytic version of max() is then applied to the intermediate result set to determine whether your conditions are met for any row with the same root:
select file_id, name, parent_id, owner_id
from (
select file_id, name, parent_id, owner_id,
max(j_id) over (partition by root_id) as max_j_id,
max(owner_id) over (partition by root_id) as max_o_id
from (
select f.*, j.file_id as j_id,
connect_by_root f.file_id as root_id
from files f
left outer join junction j
on j.file_id = f.file_id
connect by prior f.file_id = f.parent_id
start with f.parent_id is null
)
)
where max_j_id is null
and max_o_id is null
order by file_id;
FILE_ID NAME PARENT_ID OWNER_ID
--------- ------ ----------- ----------
5 a2 (null) (null)
6 b2 5 (null)
7 c2 6 (null)
8 d2 7 (null)
The innermost query gets the root and any matching junction records (with duplicates). The next level adds the analytic max owner and junction value (if there is one), giving the same result to every row for the same root. The outer query then filters out any rows which have either value for any row.
SQL Fiddle.

Need Strategy To Generate Snapshot Of Table Contents After a Series of Random Inserts

There are 10 rooms with a set of inventory items. When an item is added/deleted from a room a new row gets inserted to a MS-SQL table. I need the latest update for each room.
Take this series of inserts:
id| room| descriptor1| descriptor2| descriptor3|
1 | A | blue | 2 | large |
2 | B | red | 1 | small |
3 | A | blue | 1 | large |
What the resulting table needs to show:
room| descriptor1| descriptor2| descriptor3|
A | blue |1 | large |
B | red |1 | small |
Ideally, I would write a trigger that would update a room status table. I could then just query the room status table (Select *) to obtain the result. However, this table does not belong to me, I only have read access to a constantly updated table. I need to poll periodically or when I need a report.
How do I do this in MS-SQL? I have some inkling of how I would do it to obtain the status of just one room something like:
SELECT descriptor1, descriptor2, descriptor3
FROM myTable mt1
WHERE id = (SELECT MAX(id)
from myTable mt2
WHERE room = 'A'
);
Since I have 10 rooms i would need to do this query 10 times. Can this be narrowed down to a single query? What happens when there are 100 rooms? Is there a better way?
Thanks!
Matt
You were very close:
SELECT descriptor1, descriptor2, descriptor3
FROM myTable mt1
WHERE id in (SELECT MAX(id)
From myTable
Group By room
);
Instead of creating a trigger to update a static table, you should look into creating a view.

Removing duplicate SQL records to permit a unique key

I have a table ('sales') in a MYSQL DB which should rightfully have had a unique constraint enforced to prevent duplicates. To first remove the dupes and set the constraint is proving a bit tricky.
Table structure (simplified):
'id (unique, autoinc)'
product_id
The goal is to enforce uniqueness for product_id. The de-duping policy I want to apply is to remove all duplicate records except the most recently created, eg: the highest id.
Or to put another way, I would like to delete only duplicate records, excluding the ids matched by the following query whilst also preserving the existing non-duped records:
select id
from sales s
inner join (select product_id,
max(id) as maxId
from sales
group by product_id
having count(product_id) > 1) groupedByProdId on s.product_id
and s.id = groupedByProdId.maxId
I've struggled with this on two fronts - writing the query to select the correct records to delete and then also the constraint in MYSQL where a subselect FROM clause of a DELETE cannot reference the same table from which data is being removed.
I checked out this answer and it seemed to deal with the subject, but seem specific to sql-server, though I wouldn't rule this question out from duplicating another.
In reply to your comment, here's a query that works in MySQL:
delete YourTable
from YourTable
inner join YourTable yt2
on YourTable.product_id = yt2.product_id
and YourTable.id < yt2.id
This would only remove duplicate rows. The inner join will filter out the latest row for each product, even if no other rows for the same product exist.
P.S. If you try to alias the table after FROM, MySQL requires you to specify the name of the database, like:
delete <DatabaseName>.yt
from YourTable yt
inner join YourTable yt2
on yt.product_id = yt2.product_id
and yt.id < yt2.id;
Perhaps use ALTER IGNORE TABLE ... ADD UNIQUE KEY.
For example:
describe sales;
+------------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| product_id | int(11) | NO | | NULL | |
+------------+---------+------+-----+---------+----------------+
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 3 |
| 5 | 3 |
| 6 | 2 |
+----+------------+
ALTER IGNORE TABLE sales ADD UNIQUE KEY idx1(product_id), ORDER BY id DESC;
Query OK, 6 rows affected (0.03 sec)
Records: 6 Duplicates: 3 Warnings: 0
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 6 | 2 |
| 5 | 3 |
| 2 | 1 |
+----+------------+
See this pythian post for more information.
Note that the ids end up in reverse order. I don't think this matters, since order of the ids should not matter in a database (as far as I know!). If this displeases you however, the post linked to above shows a way to solve this problem too. However, it involves creating a temporary table which requires more hard drive space than the in-place method I posted above.
I might do the following in sql-server to eliminate the duplicates:
DELETE FROM Sales
FROM Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id
It looks like the analogous delete statement for mysql might be:
DELETE FROM Sales
USING Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id
This type of problem is easier to solve with CTEs and Ranking functions, however, you should be able to do something like the following to solve your problem:
Delete Sales
Where Exists(
Select 1
From Sales As S2
Where S2.product_id = Sales.product_id
And S2.id > Sales.Id
Having Count(*) > 0
)