Oracle Hierarchical query with condition on the whole tree - sql

I need, using the hierarchical (or other) query, to select tree-structured data where a certain condition must hold for the whole tree (ie. all the nodes in the tree).
That means that if a single node of a tree violates the condition, then the tree is not selected at all (not even other the nodes of that tree that do comply with the condition, so the complete tree is thrown away).
Also I want to select all trees - all the nodes of such trees where the condition holds for every node (ie. select not just one such tree but all such trees).
EDIT:
Consider this example of table of files that are connected to each other through parent_id column so they form trees. There is also a foreign key owner_id, which references other table primary key.
PK file_id | name | parent_id | owner_id
----------------------------------------
1 | a1 | null | null -- root of one tree
2 | b1 | 1 | null
3 | c1 | 1 | null
4 | d1 | 2 | 100
5 | a2 | null | null -- root of another tree
6 | b2 | 5 | null
7 | c2 | 6 | null
8 | d2 | 7 | null
Column parent_id has a foreign key constraint to file_id column (making the hierarchies).
And there is one more table (let's call it junction table) where (among others) the foreign keys file_ids are stored in many-to-one relation ship to the table of files above:
FK file_id | other data
-----------------------
1 | ...
1 | ...
3 | ...
Now the query I need is to select all such whole trees of files where following conditions are met for each and every file in that tree:
owner_id of the file is null
and the file has no related records in the junction table (there are no records referencing the file by file_id FK)
For the example above, the query should result in:
file_id | name | parent_id | owner_id
---------------------------------------
5 | a2 | null | null
6 | b2 | 1 | null
7 | c2 | 1 | null
8 | d2 | 2 | null
All nodes make a whole tree as it is in the table (no missing children or parents) and each of the nodes holds to the conditions above (has no owner and no relation in junction table).

This generates the tree with a simple hierarchical query - which is really only needed to establish the root file_id for each row - while joining to junction to check for a record there. That can get duplicates, which is OK at that stage. The analytic version of max() is then applied to the intermediate result set to determine whether your conditions are met for any row with the same root:
select file_id, name, parent_id, owner_id
from (
select file_id, name, parent_id, owner_id,
max(j_id) over (partition by root_id) as max_j_id,
max(owner_id) over (partition by root_id) as max_o_id
from (
select f.*, j.file_id as j_id,
connect_by_root f.file_id as root_id
from files f
left outer join junction j
on j.file_id = f.file_id
connect by prior f.file_id = f.parent_id
start with f.parent_id is null
)
)
where max_j_id is null
and max_o_id is null
order by file_id;
FILE_ID NAME PARENT_ID OWNER_ID
--------- ------ ----------- ----------
5 a2 (null) (null)
6 b2 5 (null)
7 c2 6 (null)
8 d2 7 (null)
The innermost query gets the root and any matching junction records (with duplicates). The next level adds the analytic max owner and junction value (if there is one), giving the same result to every row for the same root. The outer query then filters out any rows which have either value for any row.
SQL Fiddle.

Related

Deleting duplicate rows with primary keys that are connected to other tables

A process was causing duplicate rows in a table where there were not supposed to be any. There are several great answers to deleting duplicate rows online. But, what if those duplicates with ID primary keys all have data in other tables tied to them?
Is there a way to delete all duplicates in the first table and migrate all data tied to those keys to the single PK ID that wasn't deleted?
For example:
TABLE 1
+-------+----------+----------+------------+
| ID(PK)| Model | ItemType | Color |
+-------+----------+----------+------------+
| 1 | 4 | B | Red |
| 2 | 4 | B | Red |
| 3 | 5 | A | Blue |
+-------+----------+----------+------------+
TABLE 2
+-------+----------+---------+
| ID(PK)| OtherID | Type |
+-------+----------+---------+
| 1 | 1 | Type1 |
| 2 | 1 | Type2 |
| 3 | 2 | Type3 |
| 4 | 2 | Type4 |
| 5 | 2 | Type5 |
+-------+----------+---------+
So I would theoretically want to delete the entry with ID: 2 from TABLE 1, and then have the OtherID fields in TABLE 2 switch to 1. This would actually be needed for X number of tables. This particular situation has 4 tables connected to its ID PK.
You cannot do this automatically. But you can do this with some queries. First, you set all the foreign keys to the correct id, which is presumably the smallest one:
with ids (
select t1.*, min(id) over (partition by Model, ItemType, Color) as min_id
from table1 t1
)
update t2
set t2.otherid = ids.min_id
from table2 t2 join
ids
on t2.otherid = ids.id
where ids.id <> ids.min_id;
Then delete the ids that are either duplicated or not referenced in table2 (depending on which you actually want):
with ids (
select t1.*, min(id) over (partition by Model, ItemType, Color) as min_id
from table1 t1
)
delete from ids
where id <> min_id;
Note: If the database has concurrent users, you might want to put it in single user mode for this operation or lock the tables so they are not modified during these two operations.
To do this right, you want to wrap everything in a single transaction and perform this during a regular maintenance period. Anything else could leave things as inconsistent as they are now.
Make a determination as to which "key" you will use.
Update all of the child tables to use the new "key" where the value is the old "key".
There should be no FK dependencies on the duplicate records, delete them.
Once all ambiguities are resolved, place an unique constraint on (ItemType,Color) (or whatever the real columns are).
If there are a lot of instances, you may need to write a script to handle this and use the information in sys.foreign_keys and sys.foreign_key_columns to determine which records to update and in which order.

Find sql connected component between many to many entities

I have two basic entities: financial plan and purchase request. Theese two entities are in many-to-many relationship:
CREATE TABLE FinancialPlan
(
ID int NOT NULL,
PRIMARY KEY (ID)
);
CREATE TABLE PurchaseRequest
(
ID int NOT NULL,
PRIMARY KEY (ID)
);
CREATE TABLE FP_PR
(
FP_ID FOREIGN KEY REFERENCES FinancialPlan(ID),
PR_ID FOREIGN KEY REFERENCES PurchaseRequest(ID)
);
Problem: find all requests, related to specified plan, and all plans, related to requests, related to specified plan, ...
Model could be represented as a graph, where each node represents a plan, or a request, and each edge represents a relationship, then the problem could be rephrased as find connected component, which specified node belongs to.
Example:
Plan Request FP_PR
ID | ID | FP_ID|PR_ID|
----| ----| -----|-----|
1 | 1 | 1 |1 |
2 | 2 | 2 |1 |
3 | 3 | 2 |2 |
4 | 3 |2 |
5 | 4 |2 |
5 |3 |
Find connected component of finplan ID=1
Desired output:
FP_ID | PR_ID|
------+------+
1 | 1 |
2 | 1 |
2 | 2 |
3 | 2 |
4 | 2 |
I am currently doing it recursively on app side, which may generate to many requests and hang the DB server, could this be done with some recursive DB approach?
Visualization:
Starting entity is marked by arrow.
Desired output is circled.
SQL Server solution
I guess the main problem is you need to compare by PR_ID then FP_ID. So in recursive part there must be a CASE statement. On 1 run we take data by FP_ID on second by PR_ID and etc with the help of modulo.
DECLARE #fp int = 1
;WITH cte AS (
SELECT f.FP_ID,
f.PR_ID,
1 as lev
FROM #FP_PR f
WHERE f.FP_id = #fp
UNION ALL
SELECT f.FP_ID,
f.PR_ID,
lev+1
FROM cte c
CROSS JOIN #FP_PR f -- You can use INNER JOIN instead
WHERE CASE (lev+1)%2 WHEN 0 THEN f.PR_ID WHEN 1 THEN f.FP_ID END = CASE (lev+1)%2 WHEN 0 THEN c.PR_ID WHEN 1 THEN c.FP_ID END
AND NOT (f.PR_ID = c.PR_ID AND f.FP_ID = c.FP_ID)
)
SELECT *
FROM cte
Output:
FP_ID PR_ID lev
1 1 1
2 1 2
2 2 3
3 2 4
4 2 4

Update statement to set a column based the maximum row of another table

I have a Family table:
SELECT * FROM Family;
id | Surname | Oldest | Oldest_Age
---+----------+--------+-------
1 | Byre | NULL | NULL
2 | Summers | NULL | NULL
3 | White | NULL | NULL
4 | Anders | NULL | NULL
The Family.Oldest column is not yet populated. There is another table of Children:
SELECT * FROM Children;
id | Name | Age | Family_FK
---+----------+------+--------
1 | Jake | 8 | 1
2 | Martin | 7 | 2
3 | Sarah | 10 | 1
4 | Tracy | 12 | 3
where many children (or no children) can be associated with one family. I would like to populate the Oldest column using an UPDATE ... SET ... statement that sets it to the Name and Oldest_Age of the oldest child in each family. Finding the name of each oldest child is a problem that is solved quite well here: How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL?
However, I don't know how to use the result of this in an UPDATE statement to update the column of an associated table using the h2 database.
The following is ANSI-SQL syntax that solves this problem:
update family
set oldest = (select name
from children c
where c.family_fk = f.id
order by age desc
fetch first 1 row only
)
In h2, I think you would use limit 1 instead of fetch first 1 row only.
EDIT:
For two columns -- alas -- the solution is two subqueries:
update family
set oldest = (select name
from children c
where c.family_fk = f.id
order by age desc
limit 1
),
oldest_age = (select age
from children c
where c.family_fk = f.id
order by age desc
limit 1
);
Some databases (such as SQL Server, Postgres, and Oracle) support lateral joins that can help with this. Also, row_number() can also help solve this problem. Unfortunately, H2 doesn't support this functionality.

delete duplicate rows but keep preferred row

I have a simple database table
create table demo (
id integer PRIMARY KEY,
fv integer,
sv text,
rel_id integer,
FOREIGN KEY (rel_id)
REFERENCES demo(id));
and i want to delete all duplicate rows grouped by fv and sv. Which is already a fairly popular question with great answers.
But I need a twist on that scenario. As in cases where rel_id is NULL I want to keep that row. In any other case anything goes.
So by using the following values
insert into demo (id,fv,sv,rel_id)
VALUES (1,1,'somestring',NULL),
(2,2,'somemorestring',1),
(3,1,'anotherstring',NULL),
(4,2,'somemorestring',3),
(5,1,'somestring',3)
Either
id | fv | sv | rel_id
---+----+------------------+-------
1 | 1 | 'somestring' | NULL
2 | 2 | 'somemorestring' | 1
3 | 1 | 'anotherstring' | NULL
or
id | fv | sv | rel_id
---+----+------------------+-------
1 | 1 | 'somestring' | NULL
3 | 1 | 'anotherstring' | NULL
4 | 2 | 'somemorestring' | 3
would be valid results. Where as
id | fv | sv | rel_id
---+----+------------------+-------
3 | 1 | 'anotherstring' | NULL
4 | 2 | 'somemorestring' | 3
5 | 1 | 'somestring' | 3
would not be. As the first entry had NULL as rel_id which takes presidency above NOT NULL.
I currently have this (which is an answer on the basic duplicate question) as a query to remove duplicates but I am not sure how to continue to modify the query to fit my needs.
DELETE FROM demo
WHERE id NOT IN (SELECT min(id) as id
FROM demo
GROUP BY fv,sv)
As as soon as the NOT NULL entry is inserted into the database before the NULL entry the NOT NULL one will not be deleted. It is guaranteed that rel_id will always point to an entry where rel_id is NULL therefore there is no danger of deleting a referenced entry. Further it is guaranteed that there will be no two rows in the same group with rel_id IS NULL. Therefore a row with rel_id IS NULL is unique for the whole table.
Or as a basic algorithm:
Go over all rows and group them by fv and sv
Look into each group for a row where rel_id IS NULL. If there is keep that row (and delete the rest). Else pick one row of your choice and delete the rest.
sqlfiddle
I seem to have worked it out
DELETE FROM demo
WHERE id NOT IN (SELECT min(id) as id
FROM demo AS out_buff
WHERE rel_id IS NULL OR
NOT EXISTS (SELECT id FROM demo AS in_buff
WHERE rel_id IS NULL AND
in_buff.fv = out_buff.fv AND
in_buff.sv = out_buff.sv)
GROUP BY fv,sv);
by selecting in the inner SELECT either only the row with the rel_id with the value NULL or all rows matching on the GROUP BY arguments, by using the anti-condition to the existence of a row with rel_id IS NULL. But my query looks really ineffective. As a naive assumption would put the running time at at least O(n^2).

PostgreSQL 9.3 - Compare two sets of data without duplicating values in first set

I have a group of tables that define some rules that need to be followed, for example:
CREATE TABLE foo.subrules (
subruleid SERIAL PRIMARY KEY,
ruleid INTEGER REFERENCES foo.rules(ruleid),
subrule INTEGER,
barid INTEGER REFERENCES foo.bars(barid)
);
INSERT INTO foo.subrules(ruleid,subrule,barid) VALUES
(1,1,1),
(1,1,2),
(1,2,2),
(1,2,3),
(1,2,4),
(1,3,3),
(1,3,4),
(1,3,5),
(1,3,6),
(1,3,7);
What this is defining is a set of "subrules" that need to be satisfied... if all "subrules" are satisfied then the rule is also satisfied.
In the above example, "subruleid" 1 can be satisfied with a "barid" value of 1 or 2.
Additionally, "subruleid" 2 can be satisfied with a "barid" value of 2, 3, or 4.
Likewise, "subruleid" 3 can be satisfied with a "barid" value of 3, 4, 5, 6, or 7.
I also have a data set that looks like this:
primarykey | resource | barid
------------|------------|------------
1 | A | 1
2 | B | 2
3 | C | 8
The tricky part is that once a "subrule" is satisfied with a "resource", that "resource" can't satisfy any other "subrule" (even if the same "barid" would satisfy the other "subrule")
So, what I need is to evaluate and return the following results:
ruleid | subrule | barid | primarykey | resource
------------|------------|------------|------------|------------
1 | 1 | 1 | 1 | A
1 | 1 | 2 | NULL | NULL
1 | 2 | 2 | 2 | B
1 | 2 | 3 | NULL | NULL
1 | 2 | 4 | NULL | NULL
1 | 3 | 3 | NULL | NULL
1 | 3 | 4 | NULL | NULL
1 | 3 | 5 | NULL | NULL
1 | 3 | 6 | NULL | NULL
1 | 3 | 7 | NULL | NULL
NULL | NULL | NULL | 3 | C
Interestingly, if "primarykey" 3 had a "barid" value of 2 (instead of 8) the results would be identical.
I have tried several methods including a plpgsql function that performs a grouping by "subruleid" with ARRAY_AGG(barid) and building an array from barid and checking if each element in the barid array is in the "subruleid" group via a loop, but it just doesn't feel right.
Is a more elegant or efficient option available?
The following fragment finds solutions, if there are any. The number three (resources) is hardcoded. If only one solution is needed some symmetry-breaker should be added.
If the number of resources is not bounded, I think there could be a solution by enumerating all possible tableaux (Hilbert? mixed-radix?), and selecting from them, after pruning the not-satifying ones.
-- the data
CREATE TABLE subrules
( subruleid SERIAL PRIMARY KEY
, ruleid INTEGER -- REFERENCES foo.rules(ruleid),
, subrule INTEGER
, barid INTEGER -- REFERENCES foo.bars(barid)
);
INSERT INTO subrules(ruleid,subrule,barid) VALUES
(1,1,1), (1,1,2),
(1,2,2), (1,2,3), (1,2,4),
(1,3,3), (1,3,4), (1,3,5), (1,3,6), (1,3,7);
CREATE TABLE resources
( primarykey INTEGER NOT NULL PRIMARY KEY
, resrc varchar
, barid INTEGER NOT NULL
);
INSERT INTO resources(primarykey,resrc,barid) VALUES
(1, 'A', 1) ,(2, 'B', 2) ,(3, 'C', 8)
-- ################################
-- uncomment next line to find a (two!) solution(s)
-- ,(4, 'D', 7)
;
-- all matching pairs of subrules <--> resources
WITH pairs AS (
SELECT sr.subruleid, sr.ruleid, sr.subrule, sr.barid
, re.primarykey, re.resrc
FROM subrules sr
JOIN resources re ON re.barid = sr.barid
)
SELECT
p1.ruleid AS ru1 , p1.subrule AS sr1 , p1.resrc AS one
, p2.ruleid AS ru2 , p2.subrule AS sr2 , p2.resrc AS two
, p3.ruleid AS ru3 , p3.subrule AS sr3 , p3.resrc AS three
-- self-join the pairs, excluding the ones that
-- use the same subrule or resource
FROM pairs p1
JOIN pairs p2 ON p2.primarykey > p1.primarykey -- tie-breaker
JOIN pairs p3 ON p3.primarykey > p2.primarykey -- tie breaker
WHERE 1=1
AND p2.subruleid <> p1.subruleid
AND p2.subruleid <> p3.subruleid
AND p3.subruleid <> p1.subruleid
;
Result (after uncommenting the line with missing resource) :
ru1 | sr1 | one | ru2 | sr2 | two | ru3 | sr3 | three
-----+-----+-----+-----+-----+-----+-----+-----+-------
1 | 1 | A | 1 | 1 | B | 1 | 3 | D
1 | 1 | A | 1 | 2 | B | 1 | 3 | D
(2 rows)
The resources {A,B,C} could of course be hard-coded, but that would prevent the 'D' record (or any other) to serve as the missing link.
Since you are not clarifying the question, I am going with my own assumptions.
subrule numbers are ascending without gaps for each rule.
(subrule, barid) is UNIQUE in table subrules.
If a there are multiple resources for the same barid, assignments are arbitrary among these peers.
As commented, the number of resources matches the number of subrules (which has no effect on my suggested solution).
The algorithm is as follows:
Pick the subrule with the smallest subrule number.
Assign a resource to the lowest barid possible (the first that has a matching resource), which consumes the resource.
After the first resource is matched, skip to the next higher subruleid and repeat 2.
Append all remaining resources after last subrule.
You can implement this with pure SQL using a recursive CTE:
WITH RECURSIVE cte AS ((
SELECT s.*, r.resourceid, r.resource
, CASE WHEN r.resourceid IS NULL THEN '{}'::int[]
ELSE ARRAY[r.resourceid] END AS consumed
FROM subrules s
LEFT JOIN resource r USING (barid)
WHERE s.ruleid = 1
ORDER BY s.subrule, r.barid, s.barid
LIMIT 1
)
UNION ALL (
SELECT s.*, r.resourceid, r.resource
, CASE WHEN r.resourceid IS NULL THEN c.consumed
ELSE c.consumed || r.resourceid END
FROM cte c
JOIN subrules s ON s.subrule = c.subrule + 1
LEFT JOIN resource r ON r.barid = s.barid
AND r.resourceid <> ALL (c.consumed)
ORDER BY r.barid, s.barid
LIMIT 1
))
SELECT ruleid, subrule, barid, resourceid, resource FROM cte
UNION ALL -- add unused rules
SELECT s.ruleid, s.subrule, s.barid, NULL, NULL
FROM subrules s
LEFT JOIN cte c USING (subruleid)
WHERE c.subruleid IS NULL
UNION ALL -- add unused resources
SELECT NULL, NULL, r.barid, r.resourceid, r.resource
FROM resource r
LEFT JOIN cte c USING (resourceid)
WHERE c.resourceid IS NULL
ORDER BY subrule, barid, resourceid;
Returns exactly the result you have been asking for.
SQL Fiddle.
Explain
It's basically an implementation of the algorithm laid out above.
Only take a single match on a single barid per subrule. Hence the LIMIT 1, which requires additional parentheses:
Sum results of a few queries and then find top 5 in SQL
Collecting "consumed" resources in the array consumed and exclude them from repeated assignment with r.resourceid <> ALL (c.consumed). Note in particular how I avoid NULL values in the array, which would break the test.
The CTE only returns matched rows. Add rules and resources without match in the outer SELECT to get the complete result.
Or you open two cursors on the tables subrule and resource and implement the algorithm with any decent programming language (including PL/pgSQL).