Recursively walking a DAG in an SQL table

Recursively walking a DAG in an SQL table - sql

I have a graph which consists of two types of nodes Task and Subtask. Lists of these are stored along with metadata (for now, we can just assume a single metadata string column called "name") in two tables TaskTable and SubTaskTable. Tasks will have subtasks under them which will be connected in the form of a DAG. A Task A might have 5 subtasks st1, st2, st3, st4, and st5 which are connected like so
This is represented in the database like so in a dependencies table.
| for_task | from_subtask | to_subtask |
|----------+--------------+------------|
| A | 1 | 2 |
| A | 1 | 3 |
| A | 2 | 4 |
| A | 3 | 4 |
| A | 4 | 5 |
So far so, good. However, now, there is a possibility that another task B will have a similar DAG for its own subtasks which is okay too.
The third requirement is that a Task should be able to have another Task as a dependency and there should be some way to "expand" the subtask tree when I take the top level task. For example,
And B itself has subtasks like so
I've changed my dependencies table to hold this information as well like so
| for_task | from_task | from_subtask | to_task | to_subtask |
|----------+-----------+--------------+---------+------------|
| A | | 1 | | 2 |
| A | | 1 | | 3 |
| A | | 2 | | 4 |
| A | | 3 | | 4 |
| A | | 4 | | 5 |
| A | | 1 | B | |
| B | | 6 | | 7 |
| B | | 6 | | 8 |
| B | | 7 | | 8 |
| B | | 8 | | 9 |
I have two questions
Is this a good way to store this kind of information?
How would I construct a query that will give "expand" the task B when I get all the tasks for task A and give me the whole list of subtasks.
For 2, I'd expect something like this
| from_task | to_task |
|-----------+---------|
| 1 | 2 |
| 1 | 3 |
| 2 | 4 |
| 3 | 4 |
| 4 | 5 |
| 1 | 6 |The subtask here is directly linked to the subtask of B
| 6 | 7 |
| 6 | 8 |
| 7 | 8 |
| 8 | 9 |
No more tasks. Just a tree of subtasks.
This is using postgreSQL if that's relevant.

Instinctively, I would create a first table task which describes every tasks independently (also including maybe the subtasks to be decided) :
CREATE TABLE task
( name text PRIMARY KEY -- can add hereafter any kind of task attributes as new columns
) ;
Then I would keep your first table dependencies renamed as subtask_dependency with only 3 columns and add a foreign key to task :
CREATE TABLE subtask_dependency
( for_task text
, from_subtask text
, to_subtask text
, CONSTRAINT pk_subtask PRIMARY KEY (for_task, from_subtask, to_subtask)
, CONSTRAINT fk_task FOREIGN KEY (for_task) REFERENCES task (name)
MATCH SIMPLE ON UPDATE CASCADE ON DELETE RESTRICT
) ;
and I would create a second table task_dependency the same way :
CREATE TABLE task_dependency
( from_task text
, from_subtask text
, to_task text
, to_subtask text -- this column is not mandatory but it is helpful to store the root subtask of to_task
, CONSTRAINT pk_task PRIMARY KEY (from_task, from_subtask, to_task)
, CONSTRAINT fk_fromtask FOREIGN KEY (from_task) REFERENCES task (name)
MATCH SIMPLE ON UPDATE CASCADE ON DELETE RESTRICT
, CONSTRAINT fk_totask FOREIGN KEY (to_task) REFERENCES task (name)
MATCH SIMPLE ON UPDATE CASCADE ON DELETE RESTRICT
) ;
Then the query to get your expected result with only one level of nested task :
SELECT from_subtask, to_subtask
FROM subtask_dependency AS s
WHERE for_task = 'A'
UNION ALL
SELECT from_subtask, to_subtask
FROM task_dependency
WHERE from_task = 'A'
UNION ALL
SELECT s.from_subtask, s.to_subtask
FROM subtask_dependency AS s
LEFT JOIN task_dependency AS t
ON t.to_task = s.for_task
WHERE t.from_task = 'A'
And more generally, the query to get your expected result with multi level nested tasks :
WITH RECURSIVE list (from_task, from_subtask, to_task, to_subtask) AS
( SELECT NULL, NULL, from_task, from_subtask
FROM task_dependency
WHERE from_task = 'A'
UNION ALL
SELECT t.from_task, t.from_subtask, t.to_task, t.to_subtask
FROM list AS l
INNER JOIN task_dependency AS t
ON t.from_task = l.to_task
AND t.from_subtask = l.to_subtask
)
SELECT s.from_subtask, s.to_subtask
FROM list AS l
INNER JOIN subtask_dependency AS s
ON s.for_task = l.to_task
UNION ALL
SELECT l.from_subtask, l.to_subtask
FROM list AS l
WHERE l.from_subtask IS NOT NULL
ORDER BY 1, 2
see dbfiddle

Related

Which normal form or other formal rule does this database design choice violate?

The project I'm working on is an application that lets you design data entry forms, and automagically generates a schema in an underlying PostgreSQL database
to persist them as well as the browsing and editing UI.
The use case I've encountered this with is a store back-office database, but the app itself intends to be somewhat universal. The administrator creates the following entry forms with the given fields:
Customers
name (text box)
Items
name (text box)
stock (number field)
Order
customer (combo box selecting a customer)
order lines (a grid showing order lines)
OrderLine
item (combo box selecting an item)
count (number field)
When all this is done, the resulting database schema will be equivalent to this:
create table Customers(id serial primary key,
name varchar);
create table Items(id serial primary key,
name varchar,
stock integer);
create table Orders(id serial primary key);
create table OrderLines(id serial primary key,
count integer);
create table Links(id serial primary key,
fk1 integer references Customers.id,
fk2 integer references Items.id,
fk3 integer references Orders.id,
fk4 integer references OrderLines.id);
Links being a special table that stores all the relationships between entities; every row has (usually) two of the foreign keys set to a value, and the rest set to NULL. Whenever a new entry form is added to the application instance, a new foreign key referencing the table for this form is added to Links.
So, suppose our shop stocks some widgets, gizmos, and thingeys. A customer named Adam orders two widgets and three gizmos, and Betty orders four gizmos and five thingeys. The database will contain the following data:
Customers
/----+-------\
| ID | NAME |
| 1 | Adam |
| 2 | Betty |
\----+-------/
Items
/----+---------+-------\
| ID | NAME | STOCK |
| 1 | widget | 123 |
| 2 | gizmo | 456 |
| 3 | thingey | 789 |
\----+---------+-------/
Orders
/----\
| ID |
| 1 |
| 2 |
\----/
OrderLines
/----+-------\
| ID | COUNT |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
\----+-------/
Links
/----+------+------+------+------\
| ID | FK1 | FK2 | FK3 | FK4 |
| 1 | 1 | NULL | 1 | NULL |
| 2 | 2 | NULL | 2 | NULL |
| 3 | NULL | NULL | 1 | 1 |
| 4 | NULL | NULL | 1 | 2 |
| 5 | NULL | NULL | 2 | 3 |
| 6 | NULL | NULL | 2 | 4 |
| 7 | NULL | 1 | NULL | 1 |
| 8 | NULL | 2 | NULL | 2 |
| 9 | NULL | 2 | NULL | 3 |
| 10 | NULL | 3 | NULL | 4 |
\----+------+------+------+------/
(The tables also contain a bunch of timestamps for auditing and soft deletion but I don't think they're relevant here, they just make writing the SQL by the administrator that much messier. The management app is also used to implement a bunch of different use cases, but they're generally primarily data entry, master-detail views, and either scalar fields or selection boxes.)
When I've had to write a join through this thing I'd grumbled about it to my coworker, who replied "well using separate tables for each relationship is one way to do it, this is another..." Leaving aside the obvious-to-me ugliness of the above and the practical issues, I also have a nagging feeling this has to be a violation of some normal form, but it's been a while since college and I'm struggling to figure out which of the criteria apply here.
Is there something stronger "well that's just your opinion" I can use when critiquing this design?

SQL - Convert non-null adjacency list to path

I am working with some tables that represent a file system, and I need to select the full path of each folder as a flattened string.
The first table lists the details of each folder:
CREATE TABLE Folders(
FolderID int IDENTITY(1,1) NOT NULL,
[Name] nvarchar(255) NOT NULL)
The second table lists transitive closures of folder relationships:
CREATE TABLE FolderClosures(
FolderClosuresID int IDENTITY(1,1) NOT NULL,
AncestorFolderID int NOT NULL, --Foreign key to Folders.FolderID
DescendantFolderID int NOT NULL --Foreign key to Folders.FolderID
IsDirect bit NOT NULL)
For sample data, let's assume the following folders exist:
Documents/
Documents/Finance/
Documents/HumanResources/
Documents/HumanResources/Training/
These would be persisted in those tables as follows:
| FolderID | Name |
+----------+----------------+
| 1 | Documents |
| 2 | Finance |
| 3 | HumanResources |
| 4 | Training |
| FolderClosureID | AncestorFolderID | DescendantFolderID | IsDirect |
+-----------------+------------------+--------------------+----------+
| 1 | 1 | 1 | 0 |
| 2 | 2 | 2 | 0 |
| 3 | 1 | 2 | 1 |
| 4 | 3 | 3 | 0 |
| 5 | 1 | 3 | 1 |
| 6 | 4 | 4 | 0 |
| 7 | 1 | 4 | 0 |
| 8 | 3 | 4 | 1 |
Some details to note:
Every folder has an "identity row" in FolderClosures, where AncestorFolderID = DescendantFolderID AND IsDirect = 0.
Every folder that is not a top-level folder has exactly one row in FolderClosures where IsDirect = 1
FolderClosures can contain many rows per folder, where AncestorFolderID <> DescendantFolderID AND IsDirect = 0. Each of these represents a "grandparent" or more distant relationship.
Since no columns are nullable, no rows explicitly state that a given folder is a top-level folder. This can only be discerned by checking that there are no rows in FolderClosures where IsDirect = 1 AND DescendantFolderID = SomeID where SomeID is the ID of the folder in question.
I want to be able to run a query that returns this data:
| FolderID | Path |
+----------+------------------------------------+
| 1 | Documents/ |
| 2 | Documents/Finance/ |
| 3 | Documents/HumanResources/ |
| 4 | Documents/HumanResources/Training/ |
Folders may be nested at unlimited depth, but realistically probably only up to 10 levels. Queries may require returning paths for a few thousand folders.
I've found a lot of advice on creating this type of query when data is persisted as an adjacency list, but I haven't been able to find an answer for a transitive closure setup like this. The adjacency list solutions I've found rely on rows being persisted with nullable parent folder IDs, but that doesn't work here.
How can I get the desired output?
If it helps, I am using SQL Server 2016.

One way to get desired output is to do a recursive query. For this, I think the best is to only use the rows that have IsDirect = 1 and use the anchor as all folders that don't have direct parent in FolderClosures, which should be all your root folders.
WITH FoldersCTE AS (
SELECT F.FolderID, CAST(F.Name as NVARCHAR(max)) Path
FROM Folders F
WHERE NOT EXISTS (
SELECT 1 FROM FolderClosures FC WHERE FC.IsDirect = 1 AND FC.DescendantFolderID = F.FolderID
)
UNION ALL
SELECT F.FolderID, CONCAT(PF.Path, '\', F.Name)
FROM FoldersCTE PF
INNER JOIN FolderClosures FC
ON FC.AncestorFolderID = PF.FolderId
AND FC.IsDirect = 1
INNER JOIN Folders F
ON F.FolderID = FC.DescendantFolderID
)
SELECT *
FROM FoldersCTE
OPTION (MAXRECURSION 1000) --> how many nested levels you think you will have
This produces:
FolderID Path
1 Documents
2 Documents\Finance
3 Documents\HumanResources
4 Documents\HumanResources\Training
Hope it helps.

SQL query for many-to-many self-join

I have a database table that has a companion many-to-many self-join table alongside it. The primary table is part and the other table is alternate_part (basically, alternate parts are identical to their main part with different #s). Every record in the alternate_part table is also in the part table. To illustrate:
`part`
| part_id | part_number | description |
|---------|-------------|-------------|
| 1 | 00001 | wheel |
| 2 | 00002 | tire |
| 3 | 00003 | window |
| 4 | 00004 | seat |
| 5 | 00005 | wheel |
| 6 | 00006 | tire |
| 7 | 00007 | window |
| 8 | 00008 | seat |
| 9 | 00009 | wheel |
| 10 | 00010 | tire |
| 11 | 00011 | window |
| 12 | 00012 | seat |
`alternate_part`
| main_part_id | alt_part_id |
|--------------|-------------|
| 1 | 5 | // Wheel
| 5 | 1 | // |
| 5 | 9 | // |
| 9 | 5 | // |
| 2 | 6 | // Tire
| 6 | 2 | // |
| ... | ... | // |
I am trying to produce a simple SQL query that will give me a list of all alternates for a main part. The tricky part is: some alternates are only listed as alternates of alternates, it is not guaranteed that every viable alternate for a part is listed as a direct alternate. e.g., if 'Part 3' is an alternate of 'Part 2' which is an alternate of 'Part 1', then Part 3 is an alternate of Part 1 (even if the alternate_part table doesn't list a direct link). The reverse is also true (Part 1 is an alternate of Part 3).
Basically, right now I'm pulling alternates and iterating through them
SELECT p.*, ap.*
FROM part p
INNER JOIN alternate_part ap ON p.part_id = ap.main_part_id
And then going back and doing the same again on those alternates. But, I think there's got to be a better way.
The SQL query I'm looking for will basically give me:
| part_id | alt_part_id |
|---------|-------------|
| 1 | 5 |
| 1 | 9 |
For part_id = 1, even when 1 & 9 are not explicitly linked in the alternates table.
Note: I have no control whatever over the structure of the DB, it is a distributed software solution.
Note 2: It is an Oracle platform, if that affects syntax.

You have to create hierarchical tree , probably you have to use connect by prior , nocycle query
something like this
select distinct p.part_id,p.part_number,p.description,c.main_part_id
from part p
left join (
select main_part_id,connect_by_root(main_part_id) real_part_id
from alternate_part
connect by NOCYCLE prior main_part_id = alternate_part_id
) c
on p.part_id = c.real_part_id and p.part_id != c.main_part_id
order by p.part_id
You can read full documentation about Hierarchical queries at http://docs.oracle.com/cd/B28359_01/server.111/b28286/queries003.htm

SQL Select Text for multiple foreign keys to lookup table in same row

I have a table similar to the following that has a history of changes to an item and holds the old and new value for a status. The status number is a foreign key to a lookup table that holds the text. I.e. 1 = 'In Inventory', 2= 'Destroyed' etc..
I want to be able to present this as human readable results and replace the integer keys with the text from the lookup table but I'm not quite sure how to do that as I can't just join on the foreign key.
Demo Database
+---------+-------------+-------------+------------+
| ITEM_ID | OLD_STATUS | NEW_STATUS | TIMESTAMP |
+---------+-------------+-------------+------------+
| 1 | 1 | 2 | 2012-03-25 |
| 1 | 2 | 3 | 2013-12-25 |
| 1 | 3 | 4 | 2015-03-25 |
+---------+-------------+-------------+------------+

You can join on the status table multiple times - something like this:
select i.item_id,
i.old_status,
i.new_status,
i.timestamp,
s1.statustext,
s2.statustext
from items i
join status s1 on i.old_status = s1.statusid
join status s2 on i.new_status = s2.statusid

Eliminate full table scan due to BETWEEN (and GROUP BY)

Description
According to the explain command, there is a range that is causing a query to perform a full table scan (160k rows). How do I keep the range condition and reduce the scanning? I expect the culprit to be:
Y.YEAR BETWEEN 1900 AND 2009 AND
Code
Here is the code that has the range condition (the STATION_DISTRICT is likely superfluous).
SELECT
COUNT(1) as MEASUREMENTS,
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y FORCE INDEX(YEAR_IDX),
MONTH_REF M,
DAILY D
WHERE
-- For a specific city ...
--
C.ID = 10663 AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= 50 AND
-- Get the station district identification for the matching station.
--
S.STATION_DISTRICT_ID = SD.ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = '003' AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
Update
The SQL is performing a full table scan, which results in MySQL performing a "copy to tmp table", as shown here:
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
| 1 | SIMPLE | C | const | PRIMARY | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | Y | range | YEAR_IDX | YEAR_IDX | 4 | NULL | 160422 | Using where |
| 1 | SIMPLE | SD | eq_ref | PRIMARY | PRIMARY | 4 | climate.Y.STATION_DISTRICT_ID | 1 | Using index |
| 1 | SIMPLE | S | eq_ref | PRIMARY | PRIMARY | 4 | climate.SD.ID | 1 | Using where |
| 1 | SIMPLE | M | ref | PRIMARY,YEAR_REF_IDX,CATEGORY_IDX | YEAR_REF_IDX | 8 | climate.Y.ID | 54 | Using where |
| 1 | SIMPLE | D | ref | INDEX | INDEX | 8 | climate.M.ID | 11 | Using where |
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
Answer
After using the STRAIGHT_JOIN:
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | C | const | PRIMARY | PRIMARY | 4 | const | 1 | Using temporary; Using filesort |
| 1 | SIMPLE | S | ALL | PRIMARY | NULL | NULL | NULL | 7795 | Using where |
| 1 | SIMPLE | SD | eq_ref | PRIMARY | PRIMARY | 4 | climate.S.STATION_DISTRICT_ID | 1 | Using index |
| 1 | SIMPLE | Y | ref | PRIMARY,STAT_YEAR_IDX | STAT_YEAR_IDX | 4 | climate.S.STATION_DISTRICT_ID | 1650 | Using where |
| 1 | SIMPLE | M | ref | PRIMARY,YEAR_REF_IDX,CATEGORY_IDX | YEAR_REF_IDX | 8 | climate.Y.ID | 54 | Using where |
| 1 | SIMPLE | D | ref | INDEX | INDEX | 8 | climate.M.ID | 11 | Using where |
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
Related
http://dev.mysql.com/doc/refman/5.0/en/how-to-avoid-table-scan.html
http://dev.mysql.com/doc/refman/5.0/en/where-optimizations.html
Optimize SQL that uses between clause
Thank you!

ONE Request... It looks like you KNOW your data. Add the keyword "STRAIGHT_JOIN" and see the results...
SELECT STRAIGHT_JOIN ... the rest of your query...
Straight-join tells MySql to DO IT AS I HAVE LISTED. So, your CITY table is the first in the FROM list, thus indicating you expect that to be your primary... Additionally, your WHERE clause of the CITY is the immediate filter. With that being said, it will probably fly through the rest of the query...
Hope it helps... Its worked for me with gov't data of millions of records queried and joined to 10+ lookup tables where mySql was trying to think for me.

in order to do efficient between queries you are going to want a b tree index on your YEAR column. for example:
CREATE INDEX id_index USING BTREE ON YEAR_REF (YEAR);
BTREE indexes allow for efficient range queries, if this is in fact the root problem then having an index like this should get rid of the full table scan and have it only scan the part of the table that is in the range. read more about btrees on wikipedia
However, as with any optimisation advice, you should measure to make sure that you don't do more harm than good.

Can you change from searching within a radius to search in a bounding box?
You know the city so you can calculate a bounding box in your application.
Perhaps this
S.LATITUDE_DECIMAL >= latitude_lower and
S.LATITUDE_DECIMAL <= latitude_upper and
S.LONGITUDE_DECIMAL >= longitude_lower and
S.LONGITUDE_DECIMAL <= longitude_upper
could be a little faster?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Recursively walking a DAG in an SQL table - sql

Related

Which normal form or other formal rule does this database design choice violate?

SQL - Convert non-null adjacency list to path

SQL query for many-to-many self-join

SQL Select Text for multiple foreign keys to lookup table in same row

Eliminate full table scan due to BETWEEN (and GROUP BY)

Categories

Resources