SQL get leaves of directed acyclic graph - sql

I am quite new to SQL and have a rather basic question. Suppose I'm dealing with the following table structure:
CREATE TABLE nodes (
id INTEGER NOT NULL PRIMARY KEY,
parent INTEGER REFERENCES nodes(id)
);
If we hold an invariant that says, the parent of a node cannot be equivalent to any of its children, then by definition we will not have any loops in our graph. Now we are left with a disjoint directed acyclic graph.
The two questions I have then are:
If we cannot change the structure of the database: What select statement would I have to write to efficiently get all of the leaves in my database? I.e. the ids that don't have any children.
If we can change the structure of the tables: What could we change or add to make this select statement more efficient?
An example of output for the graph with five nodes whose parents where 3->2, 2->1, and 5->4 would output 3 and 5 because they are the only nodes that don't have children.

You can use NOT EXISTS and a correlated subquery that checks for node where the current not is the parent. For leafs no such record can exist.
SELECT *
FROM nodes n1
WHERE NOT EXISTS (SELECT *
FROM nodes n2
WHERE n2.parent = n1.id);
Another option is a left join joining possible children of a node. If there's a null for an id of the "children's side" of the join no child exists for the current node, it's a leaf.
SELECT *
FROM nodes n1
LEFT JOIN nodes n2
ON n2.parent = n1.id
WHERE n2.id IS NULL;
And, leaving denormalization away, I don't think there's much to change in the table's structure. Indexes could help though. One should be on id (but that's already the case because of the primary key constraint) and one on parent (but again such an index already exists because MySQL creates indexes for foreign key tuples).

For more complex graph queries, you may use Common Table Expressions (CTEs), standardized in SQL:99 and supported in MySQL since 8.0.1 (reference)
But as others pointed out, for the query you're interested in, a simple NOT EXISTS subquery or equivalent is enough. Yet another equivalent to those already posted would be using the EXCEPT set operation:
SELECT id FROM nodes
EXCEPT SELECT parent FROM nodes

I would do:
select *
from nodes
where id not in (select parent from nodes where parent is not null)

Related

Abap subquery Where Cond [duplicate]

I have a requirement to pull records, that do not have history in an archive table. 2 Fields of 1 record need to be checked for in the archive.
In technical sense my requirement is a left join where right side is 'null' (a.k.a. an excluding join), which in abap openSQL is commonly implemented like this (for my scenario anyways):
Select * from xxxx //xxxx is a result for a multiple table join
where xxxx~key not in (select key from archive_table where [conditions] )
and xxxx~foreign_key not in (select key from archive_table where [conditions] )
Those 2 fields are also checked against 2 more tables, so that would mean a total of 6 subqueries.
Database engines that I have worked with previously usually had some methods to deal with such problems (such as excluding join or outer apply).
For this particular case I will be trying to use ABAP logic with 'for all entries', but I would still like to know if it is possible to use results of a sub-query to check more than than 1 field or use another form of excluding join logic on multiple fields using SQL (without involving application server).
I have tested quite a few variations of sub-queries in the life-cycle of the program I was making. NOT EXISTS with multiple field check (shortened example below) to exclude based on 2 keys works in certain cases.
Performance acceptable (processing time is about 5 seconds), although, it's noticeably slower than the same query when excluding based on 1 field.
Select * from xxxx //xxxx is a result for a multiple table inner joins and 1 left join ( 1-* relation )
where NOT EXISTS (
select key from archive_table
where key = xxxx~key OR key = XXXX-foreign_key
)
EDIT:
With changing requirements (for more filtering) a lot has changed, so I figured I would update this. The construct I marked as XXXX in my example contained a single left join ( where main to secondary table relation is 1-* ) and it appeared relatively fast.
This is where context becomes helpful for understanding the problem:
Initial requirement: pull all vendors, without financial records in 3
tables.
Additional requirements: also exclude based on alternative
payers (1-* relationship). This is what example above is based on.
More requirements: also exclude based on alternative payee (*-* relationship between payer and payee).
Many-to-many join exponentially increased the record count within the construct I labeled XXXX, which in turn produces a lot of unnecessary work. For instance: a single customer with 3 payers, and 3 payees produced 9 rows, with a total of 27 fields to check (3 per row), when in reality there are only 7 unique values.
At this point, moving left-joined tables from main query into sub-queries and splitting them gave significantly better performance.
than any smarter looking alternatives.
select * from lfa1 inner join lfb1
where
( lfa1~lifnr not in ( select lifnr from bsik where bsik~lifnr = lfa1~lifnr )
and lfa1~lifnr not in ( select wyt3~lifnr from wyt3 inner join t024e on wyt3~ekorg = t024e~ekorg and wyt3~lifnr <> wyt3~lifn2
inner join bsik on bsik~lifnr = wyt3~lifn2 where wyt3~lifnr = lfa1~lifnr and t024e~bukrs = lfb1~bukrs )
and lfa1~lifnr not in ( select lfza~lifnr from lfza inner join bsik on bsik~lifnr = lfza~empfk where lfza~lifnr = lfa1~lifnr )
)
and [3 more sets of sub queries like the 3 above, just checking different tables].
My Conclusion:
When exclusion is based on a single field, both not in/not exits work. One might be better than the other, depending on filters you use.
When exclusion is based on 2 or more fields and you don't have many-to-many join in main query, not exists ( select .. from table where id = a.id or id = b.id or... ) appears to be the best.
The moment your exclusion criteria implements a many-to-many relationship within your main query, I would recommend looking for an optimal way to implement multiple sub-queries instead (even having a sub-query for each key-table combination will perform better than a many-to-many join with 1 good sub-query, that looks good).
Anyways, any additional insight into this is welcome.
EDIT2: Although it's slightly off topic, given how my question was about sub-queries, I figured I would post an update. After over a year I had to revisit the solution I worked on to expand it. I learned that proper excluding join works. I just failed horribly at implementing it the first time.
select header~key
from headers left join items on headers~key = items~key
where items~key is null
if it is possible to use results of a sub-query to check more than
than 1 field or use another form of excluding join logic on multiple
fields
No, it is not possible to check two columns in subquery, as SAP Help clearly says:
The clauses in the subquery subquery_clauses must constitute a scalar
subquery.
Scalar is keyword here, i.e. it should return exactly one column.
Your subquery can have multi-column key, and such syntax is completely legit:
SELECT planetype, seatsmax
FROM saplane AS plane
WHERE seatsmax < #wa-seatsmax AND
seatsmax >= ALL ( SELECT seatsocc
FROM sflight
WHERE carrid = #wa-carrid AND
connid = #wa-connid )
however you say that these two fields should be checked against different tables
Those 2 fields are also checked against two more tables
so it's not the case for you. Your only choice seems to be multi-join.
P.S. FOR ALL ENTRIES does not support negation logic, you cannot just use some sort of NOT IN FOR ALL ENTRIES, it won't be that easy.

SQL loop. I want to iterate through a loop containing SELECT results

From a table with column structure (parent, child) I need:
For a particular parent I need all children.
From the result of (1) I need the children's children too.
For example for parent=1:
parent|child parent|child parent|child
1 a a d b f
b e g
This gets you the information you say you want, I think. Two columns: child and grandchild (if any, or else NULL). Not sure if it's the schema you'd like, since you don't specify. You may add JOINs to increase the recursion depth.
select t1.child, t2.child
from T as t1 left join T as t2
on t1.child = t2.parent
where t1.parent = 1
This works on SQLite; I think it's quite standard. Regarding schema if this one doesn't serve you, hopefully it may give you ideas; or else please specify more.

Reusing results from a SQL query in a following query in Sqlite

I am using a recursive with statement to select all child from a given parent in a table representing tree structured entries. This is in Sqlite (which now supports recursive with).
This allows me to select very quickly thousands of record in this tree whithout suffering the huge performance loss due to preparing thousands of select statements from the calling application.
WITH RECURSIVE q(Id) AS
(
SELECT Id FROM Entity
WHERE Parent=(?)
UNION ALL
SELECT m.Id FROM Entity AS m
JOIN Entity ON m.Id=q.Parent
)
SELECT Id FROM q;
Now, suppose I have related data to these entities in an arbitrary number of other tables, that I want to subsequently load. Due to the arbitrary number of them (in a modular fashion) it is not possible to include the data fetching directly in this one. They must follow it.
But, if for each related tables I then do a SELECT statement, all the performance gain from selecting all the data from the tree directly inside Sqlite is almost useless because I will still stall on thousands of subsequent requests which will each prepare and issue a select statement.
So two questions :
The better solution is to formulate a similar recursive statement for each of the related tables, that will recursively gather the entities from this tree again, and this time select their related data by joining it.
This sounds really more efficient, but it's really tricky to formulate such a statement and I'm a bit lost here.
Now the real mystery is, would there be an even more efficient solution, which would be to somehow keep these results from the last query cached somewhere (the rows with the ids from the entity tree) and join them to the related tables in the following statement without having to recursively iterate over it again ?
Here is a try at the first option, supposing I want to select a field Data from related table Component : is the second UNION ALL legal ?
WITH RECURSIVE q(Data) AS
(
SELECT Id FROM Entity
WHERE Parent=(?)
UNION ALL
SELECT m.Id FROM Entity AS m
JOIN Entity ON m.Id=q.Parent
UNION ALL
SELECT Data FROM Component AS c
JOIN Component ON c.Id=q.Id
)
SELECT Data FROM q;
The documentation says:
 2. The table named on the left-hand side of the AS keyword must appear exactly once in the FROM clause of the right-most SELECT statement of the compound select, and nowhere else.
So your second query is not legal.
However, the CTE behaves like a normal table/view, so you can just join it to the related table:
WITH RECURSIVE q(Id) AS
( ... )
SELECT q.Id, c.Data
FROM q JOIN Component AS c ON q.Id = c.Id
If you want to reuse the computed values in q for multiple queries, there's nothing you can do with CTEs, but you can store them in a temporary table:
CREATE TEMPORARY TABLE q_123 AS
WITH RECURSIVE q(Id) AS
( ... )
SELECT Id FROM q;
SELECT * FROM q_123 JOIN Component ...;
SELECT * FROM q_123 JOIN Whatever ...;
DROP TABLE q_123;

Delete parent if it's not referenced by any other child

I have an example situation: parent table has a column named id, referenced in child table as a foreign key.
When deleting a child row, how to delete the parent as well if it's not referenced by any other child?
In PostgreSQL 9.1 or later you can do this with a single statement using a data-modifying CTE. This is generally less error prone. It minimizes the time frame between the two DELETEs in which a race conditions could lead to surprising results with concurrent operations:
WITH del_child AS (
DELETE FROM child
WHERE child_id = 1
RETURNING parent_id, child_id
)
DELETE FROM parent p
USING del_child x
WHERE p.parent_id = x.parent_id
AND NOT EXISTS (
SELECT 1
FROM child c
WHERE c.parent_id = x.parent_id
AND c.child_id <> x.child_id -- !
);
db<>fiddle here
Old sqlfiddle
The child is deleted in any case. I quote the manual:
Data-modifying statements in WITH are executed exactly once, and
always to completion, independently of whether the primary query reads
all (or indeed any) of their output. Notice that this is different
from the rule for SELECT in WITH: as stated in the previous section,
execution of a SELECT is carried only as far as the primary query
demands its output.
The parent is only deleted if it has no other children.
Note the last condition. Contrary to what one might expect, this is necessary, since:
The sub-statements in WITH are executed concurrently with each other
and with the main query. Therefore, when using data-modifying
statements in WITH, the order in which the specified updates actually
happen is unpredictable. All the statements are executed with the same
snapshot (see Chapter 13), so they cannot "see" each others' effects
on the target tables.
Bold emphasis mine.
I used the column name parent_id in place of the non-descriptive id.
Eliminate race condition
To eliminate possible race conditions I mentioned above completely, lock the parent row first. Of course, all similar operations must follow the same procedure to make it work.
WITH lock_parent AS (
SELECT p.parent_id, c.child_id
FROM child c
JOIN parent p ON p.parent_id = c.parent_id
WHERE c.child_id = 12 -- provide child_id here once
FOR NO KEY UPDATE -- locks parent row.
)
, del_child AS (
DELETE FROM child c
USING lock_parent l
WHERE c.child_id = l.child_id
)
DELETE FROM parent p
USING lock_parent l
WHERE p.parent_id = l.parent_id
AND NOT EXISTS (
SELECT 1
FROM child c
WHERE c.parent_id = l.parent_id
AND c.child_id <> l.child_id -- !
);
This way only one transaction at a time can lock the same parent. So it cannot happen that multiple transactions delete children of the same parent, still see other children and spare the parent, while all of the children are gone afterwards. (Updates on non-key columns are still allowed with FOR NO KEY UPDATE.)
If such cases never occur or you can live with it (hardly ever) happening - the first query is cheaper. Else, this is the secure path.
FOR NO KEY UPDATE was introduced with Postgres 9.4. Details in the manual. In older versions use the stronger lock FOR UPDATE instead.
delete from child
where parent_id = 1
After deleted in the child do it in the parent:
delete from parent
where
id = 1
and not exists (
select 1 from child where parent_id = 1
)
The not exists condition will make sure it will only be deleted if it does not exist in the child. You can wrap both delete commands in a transaction:
begin;
first_delete;
second_delete;
commit;

Could this SQL query be made more efficient?

I have a very large table of nodes (cardinality of about 600,000), each record in this table can have one or more types associated with it. There is a node_types table that contains these (30 or so) type definitions.
To connect the two, I have a third table called node_type_relations that simply links node ids to type ids.
I am trying to clean up orphaned node_type_relation entries after a cull of the node table. My query to delete any type relations for which the node no longer exists is;
DELETE FROM node_type_relations WHERE node_id NOT IN (SELECT id FROM nodes)
But judging by the speed at which this is running (one record being deleted per 10 seconds or so), it looks like Postgres is loading up the entire nodes table once for every record in the node_type_relations table (which is about 1.4million records in size).
I was about to dive in and write some code to do it more sensibly when I thought I'd ask here if the query could be turned inside-out somehow. Anything to avoid loading the nodes table more than once.
Thanks as always.
Edit with solution
Executing the query;
DELETE FROM node_type_relations WHERE NOT EXISTS (SELECT 1 FROM nodes WHERE nodes.id=node_type_relations.node_id)
appears to have had the desired effect and deleted all orphaned records (some 170,000) in a matter of seconds.
Maybe do a left join, and then delete where null.
So:
DELETE ntr
FROM node_type_relations ntr
LEFT JOIN nodes n
ON n.id = ntr.node_id
WHERE n.id IS NULL
#lynks' found the optimal query for his case himself - with an EXISTS semi-join:
DELETE FROM node_type_relations ntr
WHERE NOT EXISTS (
SELECT 1
FROM nodes n
WHERE n.id = ntr.node_id
);
A solution with JOIN syntax would have to be constructed like this in PostgreSQL:
DELETE FROM node_type_relations d
USING node_type_relations ntr
LEFT JOIN nodes n ON n.id = ntr.node_id
WHERE ntr.node_id = d.node_id
AND n.id IS NULL;