SQL - Convert non-null adjacency list to path - sql

I am working with some tables that represent a file system, and I need to select the full path of each folder as a flattened string.
The first table lists the details of each folder:
CREATE TABLE Folders(
FolderID int IDENTITY(1,1) NOT NULL,
[Name] nvarchar(255) NOT NULL)
The second table lists transitive closures of folder relationships:
CREATE TABLE FolderClosures(
FolderClosuresID int IDENTITY(1,1) NOT NULL,
AncestorFolderID int NOT NULL, --Foreign key to Folders.FolderID
DescendantFolderID int NOT NULL --Foreign key to Folders.FolderID
IsDirect bit NOT NULL)
For sample data, let's assume the following folders exist:
Documents/
Documents/Finance/
Documents/HumanResources/
Documents/HumanResources/Training/
These would be persisted in those tables as follows:
| FolderID | Name |
+----------+----------------+
| 1 | Documents |
| 2 | Finance |
| 3 | HumanResources |
| 4 | Training |
| FolderClosureID | AncestorFolderID | DescendantFolderID | IsDirect |
+-----------------+------------------+--------------------+----------+
| 1 | 1 | 1 | 0 |
| 2 | 2 | 2 | 0 |
| 3 | 1 | 2 | 1 |
| 4 | 3 | 3 | 0 |
| 5 | 1 | 3 | 1 |
| 6 | 4 | 4 | 0 |
| 7 | 1 | 4 | 0 |
| 8 | 3 | 4 | 1 |
Some details to note:
Every folder has an "identity row" in FolderClosures, where AncestorFolderID = DescendantFolderID AND IsDirect = 0.
Every folder that is not a top-level folder has exactly one row in FolderClosures where IsDirect = 1
FolderClosures can contain many rows per folder, where AncestorFolderID <> DescendantFolderID AND IsDirect = 0. Each of these represents a "grandparent" or more distant relationship.
Since no columns are nullable, no rows explicitly state that a given folder is a top-level folder. This can only be discerned by checking that there are no rows in FolderClosures where IsDirect = 1 AND DescendantFolderID = SomeID where SomeID is the ID of the folder in question.
I want to be able to run a query that returns this data:
| FolderID | Path |
+----------+------------------------------------+
| 1 | Documents/ |
| 2 | Documents/Finance/ |
| 3 | Documents/HumanResources/ |
| 4 | Documents/HumanResources/Training/ |
Folders may be nested at unlimited depth, but realistically probably only up to 10 levels. Queries may require returning paths for a few thousand folders.
I've found a lot of advice on creating this type of query when data is persisted as an adjacency list, but I haven't been able to find an answer for a transitive closure setup like this. The adjacency list solutions I've found rely on rows being persisted with nullable parent folder IDs, but that doesn't work here.
How can I get the desired output?
If it helps, I am using SQL Server 2016.

One way to get desired output is to do a recursive query. For this, I think the best is to only use the rows that have IsDirect = 1 and use the anchor as all folders that don't have direct parent in FolderClosures, which should be all your root folders.
WITH FoldersCTE AS (
SELECT F.FolderID, CAST(F.Name as NVARCHAR(max)) Path
FROM Folders F
WHERE NOT EXISTS (
SELECT 1 FROM FolderClosures FC WHERE FC.IsDirect = 1 AND FC.DescendantFolderID = F.FolderID
)
UNION ALL
SELECT F.FolderID, CONCAT(PF.Path, '\', F.Name)
FROM FoldersCTE PF
INNER JOIN FolderClosures FC
ON FC.AncestorFolderID = PF.FolderId
AND FC.IsDirect = 1
INNER JOIN Folders F
ON F.FolderID = FC.DescendantFolderID
)
SELECT *
FROM FoldersCTE
OPTION (MAXRECURSION 1000) --> how many nested levels you think you will have
This produces:
FolderID Path
1 Documents
2 Documents\Finance
3 Documents\HumanResources
4 Documents\HumanResources\Training
Hope it helps.

Related

Recursively walking a DAG in an SQL table

I have a graph which consists of two types of nodes Task and Subtask. Lists of these are stored along with metadata (for now, we can just assume a single metadata string column called "name") in two tables TaskTable and SubTaskTable. Tasks will have subtasks under them which will be connected in the form of a DAG. A Task A might have 5 subtasks st1, st2, st3, st4, and st5 which are connected like so
This is represented in the database like so in a dependencies table.
| for_task | from_subtask | to_subtask |
|----------+--------------+------------|
| A | 1 | 2 |
| A | 1 | 3 |
| A | 2 | 4 |
| A | 3 | 4 |
| A | 4 | 5 |
So far so, good. However, now, there is a possibility that another task B will have a similar DAG for its own subtasks which is okay too.
The third requirement is that a Task should be able to have another Task as a dependency and there should be some way to "expand" the subtask tree when I take the top level task. For example,
And B itself has subtasks like so
I've changed my dependencies table to hold this information as well like so
| for_task | from_task | from_subtask | to_task | to_subtask |
|----------+-----------+--------------+---------+------------|
| A | | 1 | | 2 |
| A | | 1 | | 3 |
| A | | 2 | | 4 |
| A | | 3 | | 4 |
| A | | 4 | | 5 |
| A | | 1 | B | |
| B | | 6 | | 7 |
| B | | 6 | | 8 |
| B | | 7 | | 8 |
| B | | 8 | | 9 |
I have two questions
Is this a good way to store this kind of information?
How would I construct a query that will give "expand" the task B when I get all the tasks for task A and give me the whole list of subtasks.
For 2, I'd expect something like this
| from_task | to_task |
|-----------+---------|
| 1 | 2 |
| 1 | 3 |
| 2 | 4 |
| 3 | 4 |
| 4 | 5 |
| 1 | 6 |The subtask here is directly linked to the subtask of B
| 6 | 7 |
| 6 | 8 |
| 7 | 8 |
| 8 | 9 |
No more tasks. Just a tree of subtasks.
This is using postgreSQL if that's relevant.
Instinctively, I would create a first table task which describes every tasks independently (also including maybe the subtasks to be decided) :
CREATE TABLE task
( name text PRIMARY KEY -- can add hereafter any kind of task attributes as new columns
) ;
Then I would keep your first table dependencies renamed as subtask_dependency with only 3 columns and add a foreign key to task :
CREATE TABLE subtask_dependency
( for_task text
, from_subtask text
, to_subtask text
, CONSTRAINT pk_subtask PRIMARY KEY (for_task, from_subtask, to_subtask)
, CONSTRAINT fk_task FOREIGN KEY (for_task) REFERENCES task (name)
MATCH SIMPLE ON UPDATE CASCADE ON DELETE RESTRICT
) ;
and I would create a second table task_dependency the same way :
CREATE TABLE task_dependency
( from_task text
, from_subtask text
, to_task text
, to_subtask text -- this column is not mandatory but it is helpful to store the root subtask of to_task
, CONSTRAINT pk_task PRIMARY KEY (from_task, from_subtask, to_task)
, CONSTRAINT fk_fromtask FOREIGN KEY (from_task) REFERENCES task (name)
MATCH SIMPLE ON UPDATE CASCADE ON DELETE RESTRICT
, CONSTRAINT fk_totask FOREIGN KEY (to_task) REFERENCES task (name)
MATCH SIMPLE ON UPDATE CASCADE ON DELETE RESTRICT
) ;
Then the query to get your expected result with only one level of nested task :
SELECT from_subtask, to_subtask
FROM subtask_dependency AS s
WHERE for_task = 'A'
UNION ALL
SELECT from_subtask, to_subtask
FROM task_dependency
WHERE from_task = 'A'
UNION ALL
SELECT s.from_subtask, s.to_subtask
FROM subtask_dependency AS s
LEFT JOIN task_dependency AS t
ON t.to_task = s.for_task
WHERE t.from_task = 'A'
And more generally, the query to get your expected result with multi level nested tasks :
WITH RECURSIVE list (from_task, from_subtask, to_task, to_subtask) AS
( SELECT NULL, NULL, from_task, from_subtask
FROM task_dependency
WHERE from_task = 'A'
UNION ALL
SELECT t.from_task, t.from_subtask, t.to_task, t.to_subtask
FROM list AS l
INNER JOIN task_dependency AS t
ON t.from_task = l.to_task
AND t.from_subtask = l.to_subtask
)
SELECT s.from_subtask, s.to_subtask
FROM list AS l
INNER JOIN subtask_dependency AS s
ON s.for_task = l.to_task
UNION ALL
SELECT l.from_subtask, l.to_subtask
FROM list AS l
WHERE l.from_subtask IS NOT NULL
ORDER BY 1, 2
see dbfiddle

Joining two tables with grouped elements in Oracle sql

So i have the following two tables (simplified):
Table 1: FOLDERS
ID | DESC_FOLDER | TEMPLATE_ID
---------------------------------
... | ... | ...
20 | Folder 1 | 52
21 | Folder 2 | 55
... | | ...
Table 2: TEMPLATES
ID | DESC_TEMPLATE | GROUP
-----------------------------
... | ... | ...
51 | Template 1 | abc
52 | Template 2 | abc
53 | Template 3 | abc
54 | Template 4 | abc
55 | Template 5 | NULL
... | ... | ...
The result should be a list with all the templates and their corresponding folder.
Expected Result:
DESC_TEMPLATE | DESC_FOLDER
---------------------------
Template 1 | Folder 1
Template 2 | Folder 1
Template 3 | Folder 1
Template 4 | Folder 1
Template 5 | Folder 2
I have problems with the grouped templates, because only one template of each group is connected to the folder. The following sql command obviously only returns the templates directly connected to the folder. How to extend my command to get the desired output?
Select
T.DESC_TEMPLATE,
F.DESC_FOLDER
from
TEMPLATES T,
FOLDERS F
where
T.ID = F.TEMPLATE_ID
Thanks a lot for your help!
I think a window function will solve your problem:
Select T.DESC_TEMPLATE,
MAX(F.DESC_FOLDER) OVER (PARTITION BY t.GROUP) as DESC_FOLDER
from TEMPLATES T left join
FOLDERS F
on T.ID = F.TEMPLATE_ID;
where
T.ID = F.TEMPLATE_ID (+)

View Table over Language/Client/Status Table

I would like to simplify my data with a view table, MainView but am having a hard time figuring it out.
I have a Fact table that is specific to clients, language, and status. The ID in the Fact table comes from a FactLink table that just has an FactLinkID column. The Status table has an Order column that needs to be shown in the aggregate view instead of the StatusID. My Main table references the Fact table in multiple columns.
The end goal will be to be able to query the view table by the compound index of LanguageID, StatusOrder, ClientID more simply than I was before, grabbing the largest specified StatusOrder and the specified ClientID or ClientID 1. So, that is what I was hoping to simplify with the view table.
So,
Main
ID | DescriptionID | DisclaimerID | Other
----+---------------+--------------+-------------
50 | 1 | 2 | Blah
55 | 4 | 3 | Blah Blah
Fact
FactID | LanguageID | StatusID | ClientID | Description
-------+------------+----------+----------+------------
1 | 1 | 1 | 1 | Some text
1 | 2 | 1 | 1 | Otro texto
1 | 1 | 3 | 2 | Modified text
2 | 1 | 1 | 1 | Disclaimer1
3 | 1 | 1 | 1 | Disclaimer2
4 | 1 | 1 | 1 | Some text 2
FactLink
ID
--
1
2
3
4
Status
ID | Order
---+------
1 | 10
2 | 100
3 | 20
MainView
MainID | StatusOrder | LanguageID | ClientID | Description | Disclaimer | Other
-------+-------------+------------+----------+---------------+-------------+------
50 | 10 | 1 | 1 | Some text | Disclaimer1 | Blah
50 | 10 | 2 | 1 | Otro texto | NULL | Blah
50 | 20 | 1 | 2 | Modified text | NULL | Blah
55 | 10 | 1 | 1 | Some text 2 | Disclaimer2 | Blah Blah
Here's how I implemented it with just a single column that references the Fact table:
DROP VIEW IF EXISTS dbo.KeywordView
GO
CREATE VIEW dbo.KeywordView
WITH SCHEMABINDING
AS
SELECT t.KeywordID, f.ClientID, f.Description Keyword, f.LanguageID, s.[Order] StatusOrder
FROM dbo.Keyword t
JOIN dbo.Fact f
ON f.FactLinkID = t.KeywordID
JOIN dbo.Status s
ON f.StatusID = s.StatusID
GO
CREATE UNIQUE CLUSTERED INDEX KeywordIndex
ON dbo.KeywordView (KeywordID, ClientID, LanguageID, StatusOrder)
My previous query queried for everything except for that StatusOrder. But adding in the StatusOrder seems to complicate things. Here's my previous query without the StatusOrder. When I created a view on a table with just a single Fact linked column it greatly simplified things, but extending that to two or more columns has proven difficult!
SELECT
Main.ID,
COALESCE(fDescription.Description, dfDescription.Description) Description,
COALESCE(fDisclaimer.Description, dfDisclaimer.Description) Disclaimer,
Main.Other
FROM Main
LEFT OUTER JOIN Fact fDescription
ON fDescription.FactLinkID = Main.DescriptionID
AND fDescription.ClientID = #clientID
AND fDescription.LanguageID = #langID
AND fDescription.StatusID = #statusID -- This actually needs to get the largest `StatusOrder`, not the `StatusID`.
LEFT OUTER JOIN Fact dfDescription
ON dfDescription.FactLinkID = Main.DescriptionID
AND dfDescription.ClientID = 1
AND dfDescription.LanguageID = #langID
AND dfDescription.StatusID = #statusID
... -- Same for Disclaimer
WHERE Main.ID = 50
Not sure if this the most performant or elegant way to solve this problem. But I finally thought of a way to do it. The problem with the solution below is that it can no longer be indexed. So, now to figure out how to do that without having to wrap it in a derived table.
SELECT
x.ID,
x.StatusOrder,
x.LanguageID,
x.ClientID,
x.Other,
MAX(x.Description),
MAX(x.Disclaimer)
FROM (
SELECT
Main.ID,
s.StatusOrder,
f.LanguageID,
f.ClientID,
f.Description,
NULL Disclaimer,
Main.Other
FROM Main
JOIN Fact f
ON f.FactID = Main.DescriptionID
JOIN Status s ON s.StatusID = f.StatusID
UNION ALL
SELECT
Main.ID,
s.StatusOrder,
f.LanguageID,
f.ClientID,
NULL Description,
f.Description Disclaimer,
Main.Other
FROM Main
JOIN Fact f
ON f.FactID = Main.DisclaimerID
JOIN Status s ON s.StatusID = f.StatusID
) x
GROUP BY x.ID, x.StatusOrder, x.LanguageID, x.ClientID, x.Other

How to add data or change schema to production database

I am new to working with databases and I want to make sure I understand the best way to add or remove data from a database without making a mess of any related data.
Here is a scenario I am working with:
I have a Tags table, with an Identity ID column. The Tags can be selected via the web application to categorize stories that are submitted by a user. When the database was first seeded; like tags were seeded in order together. As you can see all the Campuses (cities) were 1-4, the Colleges (subjects) are 5-7, and Populations are 8-11.
If this database is live in production and the client wants to add a new Campus (City) tag, what is the best way to do this?
All the other city tags are sort of organized at the top, it seems like the only option is to insert any new tags at to bottom of the table, where they will end up taking whatever the next ID available is. I suppose this is fine because the Display category column will allow us to know which categories these new tags actually belong to.
Is this typical? Is there better ways to set up the database or handle this situation such that everything remains more organized?
Thank you
+----+------------------+---------------+-----------------+--------------+--------+----------+
| ID | DisplayName | DisplayDetail | DisplayCategory | DisplayOrder | Active | ParentID |
+----+------------------+---------------+-----------------+--------------+--------+----------+
| 1 | Albany | NULL | 1 | 0 | 1 | NULL |
| 2 | Buffalo | NULL | 1 | 1 | 1 | NULL |
| 3 | New York City | NULL | 1 | 2 | 1 | NULL |
| 4 | Syracuse | NULL | 1 | 3 | 1 | NULL |
| 5 | Business | NULL | 2 | 0 | 1 | NULL |
| 6 | Dentistry | NULL | 2 | 1 | 1 | NULL |
| 7 | Law | NULL | 2 | 2 | 1 | NULL |
| 8 | Student-Athletes | NULL | 3 | 0 | 1 | NULL |
| 9 | Alumni | NULL | 3 | 1 | 1 | NULL |
| 10 | Faculty | NULL | 3 | 2 | 1 | NULL |
| 11 | Staff | NULL | 3 | 3 | 1 | NULL |
+----+------------------+---------------+-----------------+--------------+--------+----------+
The terms "top" and "bottom" which you use aren't really applicable. "Albany" isn't at the "Top" of the table - it's merely at the top of the specific view you see when you query the table without specifying a meaningful sort order. It defaults to a sort order based on the Id or an internal ROWID parameter, which isn't the logical way to show this data.
Data in the table isn't inherently ordered. If you want to view your tags organized by their category, simply order your query by DisplayCategory (and probably by DisplayOrder afterwards), and you'll see your data properly organized. You can even create a persistent View that sorts it that way for your convenience.

Is there a single query that can update a "sequence number" across multiple groups?

Given a table like below, is there a single-query way to update the table from this:
| id | type_id | created_at | sequence |
|----|---------|------------|----------|
| 1 | 1 | 2010-04-26 | NULL |
| 2 | 1 | 2010-04-27 | NULL |
| 3 | 2 | 2010-04-28 | NULL |
| 4 | 3 | 2010-04-28 | NULL |
To this (note that created_at is used for ordering, and sequence is "grouped" by type_id):
| id | type_id | created_at | sequence |
|----|---------|------------|----------|
| 1 | 1 | 2010-04-26 | 1 |
| 2 | 1 | 2010-04-27 | 2 |
| 3 | 2 | 2010-04-28 | 1 |
| 4 | 3 | 2010-04-28 | 1 |
I've seen some code before that used an # variable like the following, that I thought might work:
SET #seq = 0;
UPDATE `log` SET `sequence` = #seq := #seq + 1
ORDER BY `created_at`;
But that obviously doesn't reset the sequence to 1 for each type_id.
If there's no single-query way to do this, what's the most efficient way?
Data in this table may be deleted, so I'm planning to run a stored procedure after the user is done editing to re-sequence the table.
You can use another variable storing the previous type_id (#type_id). The query is ordered by type_id, so whenever there is a change in type_id, sequence has to be reset to 1 again.
Set #seq = 0;
Set #type_id = -1;
Update `log`
Set `sequence` = If(#type_id=(#type_id:=`type_id`), (#seq:=#seq+1), (#seq:=1))
Order By `type_id`, `created_at`;
I don't know MySQL very well, but you could use a sub query though it may be very slow.
UPDATE 'log' set 'sequence' = (
select count(*) from 'log' as log2
where log2.type_id = log.type_id and
log2.created_at < log.created_at) + 1
You'll get duplicate sequences, though, if two type_ids have the same created_at date.