Flatten tree structure represented in SQL [duplicate] - sql

This question already has an answer here:
SQL Server recursive self join
(1 answer)
Closed 3 years ago.
I'm using an engineering calculation package and trying to extract some information from it in a built in reporting tool that allows SQL query
An abbreviated example SQL tables are as follows:
Id | Description | Ref
---|---------------------
1 | system 1 |
3 | block 4 | 6
3 | block 4 | 1
5 | formula1 | 3
6 | f |
7 | something | 1
9 | cheese | 5
The "Ref" column identifies rows that are subrecords of other items.
What I want to do is run a query that will produce a list that will show all items that appear on a each page. As you can see from the table above "ID" is not the unique key; each item can appear in multiple locations within the table. In the example above:
ID 5 is a subitem of ID3
ID 3 is a subitem of ID 1 AND ID 6
ID 1 and ID 6 aren't subitems of anything
So effectively it is representing a tree structure:
ID 1
+-------- ID 7
|---- ID 3
+---- ID 5
+---- ID 9
ID 6
+---- ID 3
+---- ID 5
+---- ID 9
What I'm hoping to is work out which items appear under each top level item (so the end result should be a table where in the "Ref" column only top level items appear):
Id | Description | Ref
---|---------------------
1 | system 1 |
3 | block 4 | 6
3 | block 4 | 1
5 | formula1 | 1
5 | formula1 | 6
6 | f |
9 | cheese | 1
9 | cheese | 6
7 | something | 1
The tree structure can be a total of 5 levels deep
I've been trying to use left joins to build up a list of page references, but I think I'm also going to need to union results tables (because obviously rows like ID=9, ID=5, and ID = 6 have to be duplicated in the final results set). It starts to get a bit messy!
WITH A
AS (SELECT *
FROM [RbdBlocks]),
B
AS (SELECT [x].[Id],
[x].[Description],
[x].[Page] AS Page1,
[y].[Page] AS Page2,
FROM A AS x
LEFT OUTER JOIN
A AS y
ON y.Id = x.Page)
SELECT *
FROM B
The above gives me some of the nested references, but I'm not sure if there's a better way to get this data together, and to manage the recursion rather than just duplicating the set of queries 4 times?

Have a look at Recursive Common Table Expressions (CTEs). They should be able to accomplish exactly what you need.
Have a look at Example D on the SQL Docs page.
Basically what you'd do in your case is:
In the "anchor member" of the CTE, select all top-level items
In the "recursive member" of the CTE, join all of the nested children to the top-level item
Recursive CTEs are not really trivial to understand, so be sure to read the docs carefully.

Related

Minimizing a graph with SQL

Suppose we have a directed graph defined as following:
node | neighbor
-----------------
1 | 2
1 | 3
2 | 4
2 | 3
3 | 4
the above table defines the only the edges between two nodes, a couple (1,2)for example means that node 1 and 2 are connected by an edge, here is a plot of the graph.
I also have a table of the transitive closure of the graph, this table holds all the possible paths of the graph (for example: (1,3) is present twice because it can be reached either directly or by the path 1=>2=>3), here is the table:
node | neighbor
-----------------
1 | 2
1 | 3
2 | 4
2 | 3
3 | 4
1 | 3
1 | 4
1 | 4
2 | 4
from these two tables, I want to return a minimized graph without losing any reachability, an idea was to only return edges that are not in dependency of the two tables, here's an example:
(1,2) is in the first table and (2,3) is in the second, and therefore (1,3) can be deleted from the first table because you can reach node 3 from 1 passing by node 2
the outuput table should look like this then:
node | neighbor
-----------------
1 | 2
2 | 3
3 | 4
How can I write an SQL query that does this?
Here is one approach:
with recursive cte as (
select node, neighbor, 1 is_initial from graph
union all
select c.node, g.neighbor, 0
from cte c
inner join graph g on g.node = c.neighbor
)
select node, neighbor
from graph g
where not exists (
select 1
from cte c
where c.node = g.node and c.neighbor = g.neighbor and c.is_initial = 0
)
order by node, neighbor
This uses the first table only (I called it graph). We start by generating all possible paths with a recursive query. This is quite similar to your closure table, but with one extra column, is_initial, that indicates whether the path comes from the original table, or was generated during a further iteration.
Then, all that is left to do is filter the graph to remove tuples that match a "non-initial" path.
Demo on DB Fiddle:
node | neighbor
---: | -------:
1 | 2
2 | 3
3 | 4

Find records which have multiple occurrences in another table array (postgres)

I have a table which has records in array. Also there is another table which have single string records. I want to get records which have multiple occurrences in another table. Following are tables;
Vehicle
veh_id | vehicle_types
-------+---------------------------------------
1 | {"byd_tang","volt","viper","laferrari"}
2 | {"volt","viper"}
3 | {"byd_tang","sonata","jaguarxf"}
4 | {"swift","teslax","mirai"}
5 | {"volt","viper"}
6 | {"viper","ferrariff","bmwi8","viper"}
7 | {"ferrariff","viper","viper","volt"}
vehicle_names
id | vehicle_name
-----+-----------------------
1 | byd_tang
2 | volt
3 | viper
4 | laferrari
5 | sonata
6 | jaguarxf
7 | swift
8 | teslax
9 | mirai
10 | ferrariff
11 | bmwi8
I have a query which can give output what I expect but its not optimal and may be its expensive query.
This is the query:
select veh_name
from vehicle_names dsb
where (select count(*) from vehicle dsd
where dsb.veh_name = ANY (dsd.veh_types)) > 1
The output should be:
byd_tang
volt
viper
One option would be an aggregation query:
SELECT
vn.id,
vn.veh_name
FROM vehicle_names vn
INNER JOIN vehicle v
ON vn. veh_name = ANY (v.veh_types)
GROUP BY
vn.id,
vn.veh_name
HAVING
COUNT(*) > 1;
This only counts a vehicle name which appears in two or more records in the other table. It would not pick up, for example, a single vehicle record with the same name appearing two or more times.

Recursive self join over file data

I know there are many questions about recursive self joins, but they're mostly in a hierarchical data structure as follows:
ID | Value | Parent id
-----------------------------
But I was wondering if there was a way to do this in a specific case that I have where I don't necessarily have a parent id. My data will look like this when I initially load the file.
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
Essentially, its a CSV file where each row in the table is a line in the file. Lines 1 and 5 identify an object header and lines 3, 4, 7, and 8 identify the rows belonging to the object. The object header lines can have only 40 attributes which is why the object is broken up across multiple sections in the CSV file.
What I'd like to do is take the table, separate out the record # column, and join it with itself multiple times so it achieves something like this:
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,5,6,7,8,...
2 | *,record,abc,efg,hij,lmn,opq,rst
3 | ,,1,x,y,z,t,u,v,...
4 | ,,2,q,r,s,l,m,n,...
I know its probably possible, I'm just not sure where to start. My initial idea was to create a view that separates out the first and second columns in a view, and use the view as a way of joining in a repeated fashion on those two columns. However, I have some problems:
I don't know how many sections will occur in the file for the same
object
The file can contain other objects as well so joining on the first two columns would be problematic if you have something like
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
9 | ,4,Data,1,2,3,4,...
10 | *,record,lmn,opq,rst,...
11 | ,,1,t,u,v,...
In the above case, my plan could join rows from the Data object in row 9 with the first rows of the Formula object by matching the record value of 1.
UPDATE
I know this is somewhat confusing. I tried doing this with C# a while back, but I had to basically write a recursive decent parser to parse the specific file format and it simply took to long because I had to get it in the database afterwards and it was too much for entity framework. It was taking hours just to convert one file since these files are excessively large.
Either way, #Nolan Shang has the closest result to what I want. The only difference is this (sorry for the bad formatting):
+----+------------+------------------------------------------+-----------------------+
| ID | header | x | value
|
+----+------------+------------------------------------------+-----------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 |3,Formula,1,2,3,4,5,6,7,8 |
| 2 | ,, | ,1,x,y,z,t,u,v | ,1,x,y,z,t,u,v |
| 3 | ,, | ,2,q,r,s,l,m,n | ,2,q,r,s,l,m,n |
| 4 | *,record, | ,abc,efg,hij,lmn,opq,rst |*,record,abc,efg,hij,lmn,opq,rst |
| 5 | ,4, | ,Data,1,2,3,4 |,4,Data,1,2,3,4 |
| 6 | *,record, | ,lmn,opq,rst | ,lmn,opq,rst |
| 7 | ,, | ,1,t,u,v | ,1,t,u,v |
+----+------------+------------------------------------------+-----------------------------------------------+
I agree that it would be better to export this to a scripting language and do it there. This will be a lot of work in TSQL.
You've intimated that there are other possible scenarios you haven't shown, so I obviously can't give a comprehensive solution. I'm guessing this isn't something you need to do quickly on a repeated basis. More of a one-time transformation, so performance isn't an issue.
One approach would be to do a LEFT JOIN to a hard-coded table of the possible identifying sub-strings like:
3,Formula,
*,record,
,,1,
,,2,
,4,Data,
Looks like it pretty much has to be human-selected and hard-coded because I can't find a reliable pattern that can be used to SELECT only these sub-strings.
Then you SELECT from this artificially-created table (or derived table, or CTE) and LEFT JOIN to your actual table with a LIKE to get all the rows that use each of these values as their starting substring, strip out the starting characters to get the rest of the string, and use the STUFF..FOR XML trick to build the desired Line.
How you get the ID column depends on what you want, for instance in your second example, I don't know what ID you want for the ,4,Data,... line. Do you want 5 because that's the next number in the results, or do you want 9 because that's the ID of the first occurrance of that sub-string? Code accordingly. If you want 5 it's a ROW_NUMBER(). If you want 9, you can add an ID column to the artificial table you created at the start of this approach.
BTW, there's really nothing recursive about what you need done, so if you're still thinking in those terms, now would be a good time to stop. This is more of a "Group Concatenation" problem.
Here is a sample, but has some different with you need.
It is because I use the value the second comma as group header, so the ,,1 and ,,2 will be treated as same group, if you can use a parent id to indicated a group will be better
DECLARE #testdata TABLE(ID int,Line varchar(8000))
INSERT INTO #testdata
SELECT 1,'3,Formula,1,2,3,4,...' UNION ALL
SELECT 2,'*,record,abc,efg,hij,...' UNION ALL
SELECT 3,',,1,x,y,z,...' UNION ALL
SELECT 4,',,2,q,r,s,...' UNION ALL
SELECT 5,'3,Formula,5,6,7,8,...' UNION ALL
SELECT 6,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 7,',,1,t,u,v,...' UNION ALL
SELECT 8,',,2,l,m,n,...' UNION ALL
SELECT 9,',4,Data,1,2,3,4,...' UNION ALL
SELECT 10,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 11,',,1,t,u,v,...'
;WITH t AS(
SELECT *,REPLACE(SUBSTRING(t.Line,LEN(c.header)+1,LEN(t.Line)),',...','') AS data
FROM #testdata AS t
CROSS APPLY(VALUES(LEFT(t.Line,CHARINDEX(',',t.Line, CHARINDEX(',',t.Line)+1 )))) c(header)
)
SELECT MIN(ID) AS ID,t.header,c.x,t.header+STUFF(c.x,1,1,'') AS value
FROM t
OUTER APPLY(SELECT ','+tb.data FROM t AS tb WHERE tb.header=t.header FOR XML PATH('') ) c(x)
GROUP BY t.header,c.x
+----+------------+------------------------------------------+-----------------------------------------------+
| ID | header | x | value |
+----+------------+------------------------------------------+-----------------------------------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 | 3,Formula,1,2,3,4,5,6,7,8 |
| 3 | ,, | ,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v | ,,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v |
| 2 | *,record, | ,abc,efg,hij,lmn,opq,rst,lmn,opq,rst | *,record,abc,efg,hij,lmn,opq,rst,lmn,opq,rst |
| 9 | ,4, | ,Data,1,2,3,4 | ,4,Data,1,2,3,4 |
+----+------------+------------------------------------------+-----------------------------------------------+

PostgreSQL: Distribute rows evenly and according to frequency

I have trouble with a complex ordering problem. I have following example data:
table "categories"
id | frequency
1 | 0
2 | 4
3 | 0
table "entries"
id | category_id | type
1 | 1 | a
2 | 1 | a
3 | 1 | a
4 | 2 | b
5 | 2 | c
6 | 3 | d
I want to put entries rows in an order so that category_id,
and type are distributed evenly.
More precisely, I want to order entries in a way that:
category_ids that refer to a category that has frequency=0 are
distributed evenly - so that a row is followed by a different category_id
whenever possible. e.g. category_ids of rows: 1,2,1,3,1,2.
Rows with category_ids of categories with frequency<>0 should
be inserted from ca. the beginning with a minimum of frequency rows between them
(the gaps should vary). In my example these are rows with category_id=2.
So the result could start with row id #1, then #4, then a minimum of 4 rows of other
categories, then #5.
in the end result rows with same type should not be next to each other.
Example result:
id | category_id | type
1 | 1 | a
4 | 2 | b
2 | 1 | a
6 | 3 | d
.. some other row ..
.. some other row ..
.. some other row ..
5 | 2 | c
entries are like a stream of things the user gets (one at a time).
The whole ordering should give users some variation. It's just there to not
present them similar entries all the time, so it doesn't have to be perfect.
The query also does not have to give the same result on each call - using
random() is totally fine.
frequencies are there to give entries of certain categories a higher
priority so that they are not distributed across the whole range, but are placed more
at the beginning of the result list. Even if there are a lot of these entries, they
should not completely crowd out the frequency=0 entries at the beginning, through.
I'm no sure how to start this. I think I can use window functions and
ntile() to distribute rows by category_id and type.
But I have no idea how to insert the non-0-category-entries afterwards.

How to select using WITH RECURSIVE clause [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have googled and read throug some articles like
this postgreSQL manual page
or this blog page
and tried making queries myself with a moderate success (part of them hangs, while others works good and fast),
but so far I can not completely understand how this magic works.
Can anybody give very clear explanation demonstrating such query semantics and execution process,
better based on typical samples like factorial calculation or full tree expansion from (id,parent_id,name) table?
And what are the basic guidlines and typical mistakes that one should know to make good with recursive queries?
First of all, let us try to simplify and clarify algorithm description given on the manual page. To simplify it consider only union all in with recursive clause for now (and union later):
WITH RECURSIVE pseudo-entity-name(column-names) AS (
Initial-SELECT
UNION ALL
Recursive-SELECT using pseudo-entity-name
)
Outer-SELECT using pseudo-entity-name
To clarify it let us describe query execution process in pseudo code:
working-recordset = result of Initial-SELECT
append working-recordset to empty outer-recordset
while( working-recordset is not empty ) begin
new working-recordset = result of Recursive-SELECT
taking previous working-recordset as pseudo-entity-name
append working-recordset to outer-recordset
end
overall-result = result of Outer-SELECT
taking outer-recordset as pseudo-entity-name
Or even shorter - Database engine executes initial select, taking its result rows as working set. Then it repeatedly executes recursive select on the working set, each time replacing contents of the working set with query result obtained. This process ends when empty set is returned by recursive select. And all result rows given firstly by initial select and then by recursive select are gathered and feeded to outer select, which result becomes overall query result.
This query is calculating factorial of 3:
WITH RECURSIVE factorial(F,n) AS (
SELECT 1 F, 3 n
UNION ALL
SELECT F*n F, n-1 n from factorial where n>1
)
SELECT F from factorial where n=1
Initial select SELECT 1 F, 3 n gives us initial values: 3 for argument and 1 for function value.
Recursive select SELECT F*n F, n-1 n from factorial where n>1 states that every time we need to multiply last funcion value by last argument value and decrement argument value.
Database engine executes it like this:
First of all it executes initail select, which gives the initial state of working recordset:
F | n
--+--
1 | 3
Then it transforms working recordset with recursive query and obtain its second state:
F | n
--+--
3 | 2
Then third state:
F | n
--+--
6 | 1
In the third state there is no row which follows n>1 condition in recursive select, so forth working set is loop exits.
Outer recordset now holds all the rows, returned by initial and recursive select:
F | n
--+--
1 | 3
3 | 2
6 | 1
Outer select filters out all intermediate results from outer recordset, showing only final factorial value which becomes overall query result:
F
--
6
And now let us consider table forest(id,parent_id,name):
id | parent_id | name
---+-----------+-----------------
1 | | item 1
2 | 1 | subitem 1.1
3 | 1 | subitem 1.2
4 | 1 | subitem 1.3
5 | 3 | subsubitem 1.2.1
6 | | item 2
7 | 6 | subitem 2.1
8 | | item 3
'Expanding full tree' here means sorting tree items in human-readable depth-first order while calculating their levels and (maybe) paths. Both tasks (of correct sorting and calculating level or path) are not solvable in one (or even any constant number of) SELECT without using WITH RECURSIVE clause (or Oracle CONNECT BY clause, which is not supported by PostgreSQL). But this recursive query does the job (well, almost does, see the note below):
WITH RECURSIVE fulltree(id,parent_id,level,name,path) AS (
SELECT id, parent_id, 1 as level, name, name||'' as path from forest where parent_id is null
UNION ALL
SELECT t.id, t.parent_id, ft.level+1 as level, t.name, ft.path||' / '||t.name as path
from forest t, fulltree ft where t.parent_id = ft.id
)
SELECT * from fulltree order by path
Database engine executes it like this:
Firstly, it executes initail select, which gives all highest level items (roots) from forest table:
id | parent_id | level | name | path
---+-----------+-------+------------------+----------------------------------------
1 | | 1 | item 1 | item 1
8 | | 1 | item 3 | item 3
6 | | 1 | item 2 | item 2
Then, it executes recursive select, which gives all 2nd level items from forest table:
id | parent_id | level | name | path
---+-----------+-------+------------------+----------------------------------------
2 | 1 | 2 | subitem 1.1 | item 1 / subitem 1.1
3 | 1 | 2 | subitem 1.2 | item 1 / subitem 1.2
4 | 1 | 2 | subitem 1.3 | item 1 / subitem 1.3
7 | 6 | 2 | subitem 2.1 | item 2 / subitem 2.1
Then, it executes recursive select again, retrieving 3d level items:
id | parent_id | level | name | path
---+-----------+-------+------------------+----------------------------------------
5 | 3 | 3 | subsubitem 1.2.1 | item 1 / subitem 1.2 / subsubitem 1.2.1
And now it executes recursive select again, trying to retrieve 4th level items, but there are none of them, so the loop exits.
The outer SELECT sets the correct human-readable row order, sorting on path column:
id | parent_id | level | name | path
---+-----------+-------+------------------+----------------------------------------
1 | | 1 | item 1 | item 1
2 | 1 | 2 | subitem 1.1 | item 1 / subitem 1.1
3 | 1 | 2 | subitem 1.2 | item 1 / subitem 1.2
5 | 3 | 3 | subsubitem 1.2.1 | item 1 / subitem 1.2 / subsubitem 1.2.1
4 | 1 | 2 | subitem 1.3 | item 1 / subitem 1.3
6 | | 1 | item 2 | item 2
7 | 6 | 2 | subitem 2.1 | item 2 / subitem 2.1
8 | | 1 | item 3 | item 3
NOTE: Resulting row order will remain correct only while there are no punctuation characters collation-preceeding / in the item names. If we rename Item 2 in Item 1 *, it will break row order, standing between Item 1 and its descendants.
More stable solution is using tab character (E'\t') as path separator in query (which can be substituted by more readable path separator later: in outer select, before displaing to human or etc). Tab separated paths will retain correct order until there are tabs or control characters in the item names - which easily can be checked and ruled out without loss of usability.
It is very simple to modify last query to expand any arbitrary subtree - you need only to substitute condition parent_id is null with perent_id=1 (for example). Note that this query variant will return all levels and paths relative to Item 1.
And now about typical mistakes. The most notable typical mistake specific to recursive queries is defining ill stop conditions in recursive select, which results in infinite looping.
For example, if we omit where n>1 condition in factorial sample above, execution of recursive select will never give an empty set (because we have no condition to filter out single row) and looping will continue infinitely.
That is the most probable reason why some of your queries hang (the other non-specific but still possible reason is very ineffective select, which executes in finite but very long time).
There are not much RECURSIVE-specific querying guidlines to mention, as far as I know. But I would like to suggest (rather obvious) step by step recursive query building procedure.
Separately build and debug your initial select.
Wrap it with scaffolding WITH RECURSIVE construct
and begin building and debugging your recursive select.
The recommended scuffolding construct is like this:
WITH RECURSIVE rec( <Your column names> ) AS (
<Your ready and working initial SELECT>
UNION ALL
<Recursive SELECT that you are debugging now>
)
SELECT * from rec limit 1000
This simplest outer select will output the whole outer recordset, which, as we know, contains all output rows from initial select and every execution of recusrive select in a loop in their original output order - just like in samples above! The limit 1000 part will prevent hanging, replacing it with oversized output in which you will be able to see the missed stop point.
After debugging initial and recursive select build and debug your outer select.
And now the last thing to mention - the difference in using union instead of union all in with recursive clause. It introduces row uniqueness constraint which results in two extra lines in our execution pseudocode:
working-recordset = result of Initial-SELECT
discard duplicate rows from working-recordset /*union-specific*/
append working-recordset to empty outer-recordset
while( working-recordset is not empty ) begin
new working-recordset = result of Recursive-SELECT
taking previous working-recordset as pseudo-entity-name
discard duplicate rows and rows that have duplicates in outer-recordset
from working-recordset /*union-specific*/
append working-recordset to outer-recordset
end
overall-result = result of Outer-SELECT
taking outer-recordset as pseudo-entity-name