A process was causing duplicate rows in a table where there were not supposed to be any. There are several great answers to deleting duplicate rows online. But, what if those duplicates with ID primary keys all have data in other tables tied to them?
Is there a way to delete all duplicates in the first table and migrate all data tied to those keys to the single PK ID that wasn't deleted?
For example:
TABLE 1
+-------+----------+----------+------------+
| ID(PK)| Model | ItemType | Color |
+-------+----------+----------+------------+
| 1 | 4 | B | Red |
| 2 | 4 | B | Red |
| 3 | 5 | A | Blue |
+-------+----------+----------+------------+
TABLE 2
+-------+----------+---------+
| ID(PK)| OtherID | Type |
+-------+----------+---------+
| 1 | 1 | Type1 |
| 2 | 1 | Type2 |
| 3 | 2 | Type3 |
| 4 | 2 | Type4 |
| 5 | 2 | Type5 |
+-------+----------+---------+
So I would theoretically want to delete the entry with ID: 2 from TABLE 1, and then have the OtherID fields in TABLE 2 switch to 1. This would actually be needed for X number of tables. This particular situation has 4 tables connected to its ID PK.
You cannot do this automatically. But you can do this with some queries. First, you set all the foreign keys to the correct id, which is presumably the smallest one:
with ids (
select t1.*, min(id) over (partition by Model, ItemType, Color) as min_id
from table1 t1
)
update t2
set t2.otherid = ids.min_id
from table2 t2 join
ids
on t2.otherid = ids.id
where ids.id <> ids.min_id;
Then delete the ids that are either duplicated or not referenced in table2 (depending on which you actually want):
with ids (
select t1.*, min(id) over (partition by Model, ItemType, Color) as min_id
from table1 t1
)
delete from ids
where id <> min_id;
Note: If the database has concurrent users, you might want to put it in single user mode for this operation or lock the tables so they are not modified during these two operations.
To do this right, you want to wrap everything in a single transaction and perform this during a regular maintenance period. Anything else could leave things as inconsistent as they are now.
Make a determination as to which "key" you will use.
Update all of the child tables to use the new "key" where the value is the old "key".
There should be no FK dependencies on the duplicate records, delete them.
Once all ambiguities are resolved, place an unique constraint on (ItemType,Color) (or whatever the real columns are).
If there are a lot of instances, you may need to write a script to handle this and use the information in sys.foreign_keys and sys.foreign_key_columns to determine which records to update and in which order.
I have two basic entities: financial plan and purchase request. Theese two entities are in many-to-many relationship:
CREATE TABLE FinancialPlan
(
ID int NOT NULL,
PRIMARY KEY (ID)
);
CREATE TABLE PurchaseRequest
(
ID int NOT NULL,
PRIMARY KEY (ID)
);
CREATE TABLE FP_PR
(
FP_ID FOREIGN KEY REFERENCES FinancialPlan(ID),
PR_ID FOREIGN KEY REFERENCES PurchaseRequest(ID)
);
Problem: find all requests, related to specified plan, and all plans, related to requests, related to specified plan, ...
Model could be represented as a graph, where each node represents a plan, or a request, and each edge represents a relationship, then the problem could be rephrased as find connected component, which specified node belongs to.
Example:
Plan Request FP_PR
ID | ID | FP_ID|PR_ID|
----| ----| -----|-----|
1 | 1 | 1 |1 |
2 | 2 | 2 |1 |
3 | 3 | 2 |2 |
4 | 3 |2 |
5 | 4 |2 |
5 |3 |
Find connected component of finplan ID=1
Desired output:
FP_ID | PR_ID|
------+------+
1 | 1 |
2 | 1 |
2 | 2 |
3 | 2 |
4 | 2 |
I am currently doing it recursively on app side, which may generate to many requests and hang the DB server, could this be done with some recursive DB approach?
Visualization:
Starting entity is marked by arrow.
Desired output is circled.
SQL Server solution
I guess the main problem is you need to compare by PR_ID then FP_ID. So in recursive part there must be a CASE statement. On 1 run we take data by FP_ID on second by PR_ID and etc with the help of modulo.
DECLARE #fp int = 1
;WITH cte AS (
SELECT f.FP_ID,
f.PR_ID,
1 as lev
FROM #FP_PR f
WHERE f.FP_id = #fp
UNION ALL
SELECT f.FP_ID,
f.PR_ID,
lev+1
FROM cte c
CROSS JOIN #FP_PR f -- You can use INNER JOIN instead
WHERE CASE (lev+1)%2 WHEN 0 THEN f.PR_ID WHEN 1 THEN f.FP_ID END = CASE (lev+1)%2 WHEN 0 THEN c.PR_ID WHEN 1 THEN c.FP_ID END
AND NOT (f.PR_ID = c.PR_ID AND f.FP_ID = c.FP_ID)
)
SELECT *
FROM cte
Output:
FP_ID PR_ID lev
1 1 1
2 1 2
2 2 3
3 2 4
4 2 4
I have a Family table:
SELECT * FROM Family;
id | Surname | Oldest | Oldest_Age
---+----------+--------+-------
1 | Byre | NULL | NULL
2 | Summers | NULL | NULL
3 | White | NULL | NULL
4 | Anders | NULL | NULL
The Family.Oldest column is not yet populated. There is another table of Children:
SELECT * FROM Children;
id | Name | Age | Family_FK
---+----------+------+--------
1 | Jake | 8 | 1
2 | Martin | 7 | 2
3 | Sarah | 10 | 1
4 | Tracy | 12 | 3
where many children (or no children) can be associated with one family. I would like to populate the Oldest column using an UPDATE ... SET ... statement that sets it to the Name and Oldest_Age of the oldest child in each family. Finding the name of each oldest child is a problem that is solved quite well here: How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL?
However, I don't know how to use the result of this in an UPDATE statement to update the column of an associated table using the h2 database.
The following is ANSI-SQL syntax that solves this problem:
update family
set oldest = (select name
from children c
where c.family_fk = f.id
order by age desc
fetch first 1 row only
)
In h2, I think you would use limit 1 instead of fetch first 1 row only.
EDIT:
For two columns -- alas -- the solution is two subqueries:
update family
set oldest = (select name
from children c
where c.family_fk = f.id
order by age desc
limit 1
),
oldest_age = (select age
from children c
where c.family_fk = f.id
order by age desc
limit 1
);
Some databases (such as SQL Server, Postgres, and Oracle) support lateral joins that can help with this. Also, row_number() can also help solve this problem. Unfortunately, H2 doesn't support this functionality.
I have a simple database table
create table demo (
id integer PRIMARY KEY,
fv integer,
sv text,
rel_id integer,
FOREIGN KEY (rel_id)
REFERENCES demo(id));
and i want to delete all duplicate rows grouped by fv and sv. Which is already a fairly popular question with great answers.
But I need a twist on that scenario. As in cases where rel_id is NULL I want to keep that row. In any other case anything goes.
So by using the following values
insert into demo (id,fv,sv,rel_id)
VALUES (1,1,'somestring',NULL),
(2,2,'somemorestring',1),
(3,1,'anotherstring',NULL),
(4,2,'somemorestring',3),
(5,1,'somestring',3)
Either
id | fv | sv | rel_id
---+----+------------------+-------
1 | 1 | 'somestring' | NULL
2 | 2 | 'somemorestring' | 1
3 | 1 | 'anotherstring' | NULL
or
id | fv | sv | rel_id
---+----+------------------+-------
1 | 1 | 'somestring' | NULL
3 | 1 | 'anotherstring' | NULL
4 | 2 | 'somemorestring' | 3
would be valid results. Where as
id | fv | sv | rel_id
---+----+------------------+-------
3 | 1 | 'anotherstring' | NULL
4 | 2 | 'somemorestring' | 3
5 | 1 | 'somestring' | 3
would not be. As the first entry had NULL as rel_id which takes presidency above NOT NULL.
I currently have this (which is an answer on the basic duplicate question) as a query to remove duplicates but I am not sure how to continue to modify the query to fit my needs.
DELETE FROM demo
WHERE id NOT IN (SELECT min(id) as id
FROM demo
GROUP BY fv,sv)
As as soon as the NOT NULL entry is inserted into the database before the NULL entry the NOT NULL one will not be deleted. It is guaranteed that rel_id will always point to an entry where rel_id is NULL therefore there is no danger of deleting a referenced entry. Further it is guaranteed that there will be no two rows in the same group with rel_id IS NULL. Therefore a row with rel_id IS NULL is unique for the whole table.
Or as a basic algorithm:
Go over all rows and group them by fv and sv
Look into each group for a row where rel_id IS NULL. If there is keep that row (and delete the rest). Else pick one row of your choice and delete the rest.
sqlfiddle
I seem to have worked it out
DELETE FROM demo
WHERE id NOT IN (SELECT min(id) as id
FROM demo AS out_buff
WHERE rel_id IS NULL OR
NOT EXISTS (SELECT id FROM demo AS in_buff
WHERE rel_id IS NULL AND
in_buff.fv = out_buff.fv AND
in_buff.sv = out_buff.sv)
GROUP BY fv,sv);
by selecting in the inner SELECT either only the row with the rel_id with the value NULL or all rows matching on the GROUP BY arguments, by using the anti-condition to the existence of a row with rel_id IS NULL. But my query looks really ineffective. As a naive assumption would put the running time at at least O(n^2).
I have a group of tables that define some rules that need to be followed, for example:
CREATE TABLE foo.subrules (
subruleid SERIAL PRIMARY KEY,
ruleid INTEGER REFERENCES foo.rules(ruleid),
subrule INTEGER,
barid INTEGER REFERENCES foo.bars(barid)
);
INSERT INTO foo.subrules(ruleid,subrule,barid) VALUES
(1,1,1),
(1,1,2),
(1,2,2),
(1,2,3),
(1,2,4),
(1,3,3),
(1,3,4),
(1,3,5),
(1,3,6),
(1,3,7);
What this is defining is a set of "subrules" that need to be satisfied... if all "subrules" are satisfied then the rule is also satisfied.
In the above example, "subruleid" 1 can be satisfied with a "barid" value of 1 or 2.
Additionally, "subruleid" 2 can be satisfied with a "barid" value of 2, 3, or 4.
Likewise, "subruleid" 3 can be satisfied with a "barid" value of 3, 4, 5, 6, or 7.
I also have a data set that looks like this:
primarykey | resource | barid
------------|------------|------------
1 | A | 1
2 | B | 2
3 | C | 8
The tricky part is that once a "subrule" is satisfied with a "resource", that "resource" can't satisfy any other "subrule" (even if the same "barid" would satisfy the other "subrule")
So, what I need is to evaluate and return the following results:
ruleid | subrule | barid | primarykey | resource
------------|------------|------------|------------|------------
1 | 1 | 1 | 1 | A
1 | 1 | 2 | NULL | NULL
1 | 2 | 2 | 2 | B
1 | 2 | 3 | NULL | NULL
1 | 2 | 4 | NULL | NULL
1 | 3 | 3 | NULL | NULL
1 | 3 | 4 | NULL | NULL
1 | 3 | 5 | NULL | NULL
1 | 3 | 6 | NULL | NULL
1 | 3 | 7 | NULL | NULL
NULL | NULL | NULL | 3 | C
Interestingly, if "primarykey" 3 had a "barid" value of 2 (instead of 8) the results would be identical.
I have tried several methods including a plpgsql function that performs a grouping by "subruleid" with ARRAY_AGG(barid) and building an array from barid and checking if each element in the barid array is in the "subruleid" group via a loop, but it just doesn't feel right.
Is a more elegant or efficient option available?
The following fragment finds solutions, if there are any. The number three (resources) is hardcoded. If only one solution is needed some symmetry-breaker should be added.
If the number of resources is not bounded, I think there could be a solution by enumerating all possible tableaux (Hilbert? mixed-radix?), and selecting from them, after pruning the not-satifying ones.
-- the data
CREATE TABLE subrules
( subruleid SERIAL PRIMARY KEY
, ruleid INTEGER -- REFERENCES foo.rules(ruleid),
, subrule INTEGER
, barid INTEGER -- REFERENCES foo.bars(barid)
);
INSERT INTO subrules(ruleid,subrule,barid) VALUES
(1,1,1), (1,1,2),
(1,2,2), (1,2,3), (1,2,4),
(1,3,3), (1,3,4), (1,3,5), (1,3,6), (1,3,7);
CREATE TABLE resources
( primarykey INTEGER NOT NULL PRIMARY KEY
, resrc varchar
, barid INTEGER NOT NULL
);
INSERT INTO resources(primarykey,resrc,barid) VALUES
(1, 'A', 1) ,(2, 'B', 2) ,(3, 'C', 8)
-- ################################
-- uncomment next line to find a (two!) solution(s)
-- ,(4, 'D', 7)
;
-- all matching pairs of subrules <--> resources
WITH pairs AS (
SELECT sr.subruleid, sr.ruleid, sr.subrule, sr.barid
, re.primarykey, re.resrc
FROM subrules sr
JOIN resources re ON re.barid = sr.barid
)
SELECT
p1.ruleid AS ru1 , p1.subrule AS sr1 , p1.resrc AS one
, p2.ruleid AS ru2 , p2.subrule AS sr2 , p2.resrc AS two
, p3.ruleid AS ru3 , p3.subrule AS sr3 , p3.resrc AS three
-- self-join the pairs, excluding the ones that
-- use the same subrule or resource
FROM pairs p1
JOIN pairs p2 ON p2.primarykey > p1.primarykey -- tie-breaker
JOIN pairs p3 ON p3.primarykey > p2.primarykey -- tie breaker
WHERE 1=1
AND p2.subruleid <> p1.subruleid
AND p2.subruleid <> p3.subruleid
AND p3.subruleid <> p1.subruleid
;
Result (after uncommenting the line with missing resource) :
ru1 | sr1 | one | ru2 | sr2 | two | ru3 | sr3 | three
-----+-----+-----+-----+-----+-----+-----+-----+-------
1 | 1 | A | 1 | 1 | B | 1 | 3 | D
1 | 1 | A | 1 | 2 | B | 1 | 3 | D
(2 rows)
The resources {A,B,C} could of course be hard-coded, but that would prevent the 'D' record (or any other) to serve as the missing link.
Since you are not clarifying the question, I am going with my own assumptions.
subrule numbers are ascending without gaps for each rule.
(subrule, barid) is UNIQUE in table subrules.
If a there are multiple resources for the same barid, assignments are arbitrary among these peers.
As commented, the number of resources matches the number of subrules (which has no effect on my suggested solution).
The algorithm is as follows:
Pick the subrule with the smallest subrule number.
Assign a resource to the lowest barid possible (the first that has a matching resource), which consumes the resource.
After the first resource is matched, skip to the next higher subruleid and repeat 2.
Append all remaining resources after last subrule.
You can implement this with pure SQL using a recursive CTE:
WITH RECURSIVE cte AS ((
SELECT s.*, r.resourceid, r.resource
, CASE WHEN r.resourceid IS NULL THEN '{}'::int[]
ELSE ARRAY[r.resourceid] END AS consumed
FROM subrules s
LEFT JOIN resource r USING (barid)
WHERE s.ruleid = 1
ORDER BY s.subrule, r.barid, s.barid
LIMIT 1
)
UNION ALL (
SELECT s.*, r.resourceid, r.resource
, CASE WHEN r.resourceid IS NULL THEN c.consumed
ELSE c.consumed || r.resourceid END
FROM cte c
JOIN subrules s ON s.subrule = c.subrule + 1
LEFT JOIN resource r ON r.barid = s.barid
AND r.resourceid <> ALL (c.consumed)
ORDER BY r.barid, s.barid
LIMIT 1
))
SELECT ruleid, subrule, barid, resourceid, resource FROM cte
UNION ALL -- add unused rules
SELECT s.ruleid, s.subrule, s.barid, NULL, NULL
FROM subrules s
LEFT JOIN cte c USING (subruleid)
WHERE c.subruleid IS NULL
UNION ALL -- add unused resources
SELECT NULL, NULL, r.barid, r.resourceid, r.resource
FROM resource r
LEFT JOIN cte c USING (resourceid)
WHERE c.resourceid IS NULL
ORDER BY subrule, barid, resourceid;
Returns exactly the result you have been asking for.
SQL Fiddle.
Explain
It's basically an implementation of the algorithm laid out above.
Only take a single match on a single barid per subrule. Hence the LIMIT 1, which requires additional parentheses:
Sum results of a few queries and then find top 5 in SQL
Collecting "consumed" resources in the array consumed and exclude them from repeated assignment with r.resourceid <> ALL (c.consumed). Note in particular how I avoid NULL values in the array, which would break the test.
The CTE only returns matched rows. Add rules and resources without match in the outer SELECT to get the complete result.
Or you open two cursors on the tables subrule and resource and implement the algorithm with any decent programming language (including PL/pgSQL).