How to select using WITH RECURSIVE clause [closed] - sql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have googled and read throug some articles like
this postgreSQL manual page
or this blog page
and tried making queries myself with a moderate success (part of them hangs, while others works good and fast),
but so far I can not completely understand how this magic works.
Can anybody give very clear explanation demonstrating such query semantics and execution process,
better based on typical samples like factorial calculation or full tree expansion from (id,parent_id,name) table?
And what are the basic guidlines and typical mistakes that one should know to make good with recursive queries?

First of all, let us try to simplify and clarify algorithm description given on the manual page. To simplify it consider only union all in with recursive clause for now (and union later):
WITH RECURSIVE pseudo-entity-name(column-names) AS (
Initial-SELECT
UNION ALL
Recursive-SELECT using pseudo-entity-name
)
Outer-SELECT using pseudo-entity-name
To clarify it let us describe query execution process in pseudo code:
working-recordset = result of Initial-SELECT
append working-recordset to empty outer-recordset
while( working-recordset is not empty ) begin
new working-recordset = result of Recursive-SELECT
taking previous working-recordset as pseudo-entity-name
append working-recordset to outer-recordset
end
overall-result = result of Outer-SELECT
taking outer-recordset as pseudo-entity-name
Or even shorter - Database engine executes initial select, taking its result rows as working set. Then it repeatedly executes recursive select on the working set, each time replacing contents of the working set with query result obtained. This process ends when empty set is returned by recursive select. And all result rows given firstly by initial select and then by recursive select are gathered and feeded to outer select, which result becomes overall query result.
This query is calculating factorial of 3:
WITH RECURSIVE factorial(F,n) AS (
SELECT 1 F, 3 n
UNION ALL
SELECT F*n F, n-1 n from factorial where n>1
)
SELECT F from factorial where n=1
Initial select SELECT 1 F, 3 n gives us initial values: 3 for argument and 1 for function value.
Recursive select SELECT F*n F, n-1 n from factorial where n>1 states that every time we need to multiply last funcion value by last argument value and decrement argument value.
Database engine executes it like this:
First of all it executes initail select, which gives the initial state of working recordset:
F | n
--+--
1 | 3
Then it transforms working recordset with recursive query and obtain its second state:
F | n
--+--
3 | 2
Then third state:
F | n
--+--
6 | 1
In the third state there is no row which follows n>1 condition in recursive select, so forth working set is loop exits.
Outer recordset now holds all the rows, returned by initial and recursive select:
F | n
--+--
1 | 3
3 | 2
6 | 1
Outer select filters out all intermediate results from outer recordset, showing only final factorial value which becomes overall query result:
F
--
6
And now let us consider table forest(id,parent_id,name):
id | parent_id | name
---+-----------+-----------------
1 | | item 1
2 | 1 | subitem 1.1
3 | 1 | subitem 1.2
4 | 1 | subitem 1.3
5 | 3 | subsubitem 1.2.1
6 | | item 2
7 | 6 | subitem 2.1
8 | | item 3
'Expanding full tree' here means sorting tree items in human-readable depth-first order while calculating their levels and (maybe) paths. Both tasks (of correct sorting and calculating level or path) are not solvable in one (or even any constant number of) SELECT without using WITH RECURSIVE clause (or Oracle CONNECT BY clause, which is not supported by PostgreSQL). But this recursive query does the job (well, almost does, see the note below):
WITH RECURSIVE fulltree(id,parent_id,level,name,path) AS (
SELECT id, parent_id, 1 as level, name, name||'' as path from forest where parent_id is null
UNION ALL
SELECT t.id, t.parent_id, ft.level+1 as level, t.name, ft.path||' / '||t.name as path
from forest t, fulltree ft where t.parent_id = ft.id
)
SELECT * from fulltree order by path
Database engine executes it like this:
Firstly, it executes initail select, which gives all highest level items (roots) from forest table:
id | parent_id | level | name | path
---+-----------+-------+------------------+----------------------------------------
1 | | 1 | item 1 | item 1
8 | | 1 | item 3 | item 3
6 | | 1 | item 2 | item 2
Then, it executes recursive select, which gives all 2nd level items from forest table:
id | parent_id | level | name | path
---+-----------+-------+------------------+----------------------------------------
2 | 1 | 2 | subitem 1.1 | item 1 / subitem 1.1
3 | 1 | 2 | subitem 1.2 | item 1 / subitem 1.2
4 | 1 | 2 | subitem 1.3 | item 1 / subitem 1.3
7 | 6 | 2 | subitem 2.1 | item 2 / subitem 2.1
Then, it executes recursive select again, retrieving 3d level items:
id | parent_id | level | name | path
---+-----------+-------+------------------+----------------------------------------
5 | 3 | 3 | subsubitem 1.2.1 | item 1 / subitem 1.2 / subsubitem 1.2.1
And now it executes recursive select again, trying to retrieve 4th level items, but there are none of them, so the loop exits.
The outer SELECT sets the correct human-readable row order, sorting on path column:
id | parent_id | level | name | path
---+-----------+-------+------------------+----------------------------------------
1 | | 1 | item 1 | item 1
2 | 1 | 2 | subitem 1.1 | item 1 / subitem 1.1
3 | 1 | 2 | subitem 1.2 | item 1 / subitem 1.2
5 | 3 | 3 | subsubitem 1.2.1 | item 1 / subitem 1.2 / subsubitem 1.2.1
4 | 1 | 2 | subitem 1.3 | item 1 / subitem 1.3
6 | | 1 | item 2 | item 2
7 | 6 | 2 | subitem 2.1 | item 2 / subitem 2.1
8 | | 1 | item 3 | item 3
NOTE: Resulting row order will remain correct only while there are no punctuation characters collation-preceeding / in the item names. If we rename Item 2 in Item 1 *, it will break row order, standing between Item 1 and its descendants.
More stable solution is using tab character (E'\t') as path separator in query (which can be substituted by more readable path separator later: in outer select, before displaing to human or etc). Tab separated paths will retain correct order until there are tabs or control characters in the item names - which easily can be checked and ruled out without loss of usability.
It is very simple to modify last query to expand any arbitrary subtree - you need only to substitute condition parent_id is null with perent_id=1 (for example). Note that this query variant will return all levels and paths relative to Item 1.
And now about typical mistakes. The most notable typical mistake specific to recursive queries is defining ill stop conditions in recursive select, which results in infinite looping.
For example, if we omit where n>1 condition in factorial sample above, execution of recursive select will never give an empty set (because we have no condition to filter out single row) and looping will continue infinitely.
That is the most probable reason why some of your queries hang (the other non-specific but still possible reason is very ineffective select, which executes in finite but very long time).
There are not much RECURSIVE-specific querying guidlines to mention, as far as I know. But I would like to suggest (rather obvious) step by step recursive query building procedure.
Separately build and debug your initial select.
Wrap it with scaffolding WITH RECURSIVE construct
and begin building and debugging your recursive select.
The recommended scuffolding construct is like this:
WITH RECURSIVE rec( <Your column names> ) AS (
<Your ready and working initial SELECT>
UNION ALL
<Recursive SELECT that you are debugging now>
)
SELECT * from rec limit 1000
This simplest outer select will output the whole outer recordset, which, as we know, contains all output rows from initial select and every execution of recusrive select in a loop in their original output order - just like in samples above! The limit 1000 part will prevent hanging, replacing it with oversized output in which you will be able to see the missed stop point.
After debugging initial and recursive select build and debug your outer select.
And now the last thing to mention - the difference in using union instead of union all in with recursive clause. It introduces row uniqueness constraint which results in two extra lines in our execution pseudocode:
working-recordset = result of Initial-SELECT
discard duplicate rows from working-recordset /*union-specific*/
append working-recordset to empty outer-recordset
while( working-recordset is not empty ) begin
new working-recordset = result of Recursive-SELECT
taking previous working-recordset as pseudo-entity-name
discard duplicate rows and rows that have duplicates in outer-recordset
from working-recordset /*union-specific*/
append working-recordset to outer-recordset
end
overall-result = result of Outer-SELECT
taking outer-recordset as pseudo-entity-name

Related

Minimizing a graph with SQL

Suppose we have a directed graph defined as following:
node | neighbor
-----------------
1 | 2
1 | 3
2 | 4
2 | 3
3 | 4
the above table defines the only the edges between two nodes, a couple (1,2)for example means that node 1 and 2 are connected by an edge, here is a plot of the graph.
I also have a table of the transitive closure of the graph, this table holds all the possible paths of the graph (for example: (1,3) is present twice because it can be reached either directly or by the path 1=>2=>3), here is the table:
node | neighbor
-----------------
1 | 2
1 | 3
2 | 4
2 | 3
3 | 4
1 | 3
1 | 4
1 | 4
2 | 4
from these two tables, I want to return a minimized graph without losing any reachability, an idea was to only return edges that are not in dependency of the two tables, here's an example:
(1,2) is in the first table and (2,3) is in the second, and therefore (1,3) can be deleted from the first table because you can reach node 3 from 1 passing by node 2
the outuput table should look like this then:
node | neighbor
-----------------
1 | 2
2 | 3
3 | 4
How can I write an SQL query that does this?
Here is one approach:
with recursive cte as (
select node, neighbor, 1 is_initial from graph
union all
select c.node, g.neighbor, 0
from cte c
inner join graph g on g.node = c.neighbor
)
select node, neighbor
from graph g
where not exists (
select 1
from cte c
where c.node = g.node and c.neighbor = g.neighbor and c.is_initial = 0
)
order by node, neighbor
This uses the first table only (I called it graph). We start by generating all possible paths with a recursive query. This is quite similar to your closure table, but with one extra column, is_initial, that indicates whether the path comes from the original table, or was generated during a further iteration.
Then, all that is left to do is filter the graph to remove tuples that match a "non-initial" path.
Demo on DB Fiddle:
node | neighbor
---: | -------:
1 | 2
2 | 3
3 | 4

Flatten tree structure represented in SQL [duplicate]

This question already has an answer here:
SQL Server recursive self join
(1 answer)
Closed 3 years ago.
I'm using an engineering calculation package and trying to extract some information from it in a built in reporting tool that allows SQL query
An abbreviated example SQL tables are as follows:
Id | Description | Ref
---|---------------------
1 | system 1 |
3 | block 4 | 6
3 | block 4 | 1
5 | formula1 | 3
6 | f |
7 | something | 1
9 | cheese | 5
The "Ref" column identifies rows that are subrecords of other items.
What I want to do is run a query that will produce a list that will show all items that appear on a each page. As you can see from the table above "ID" is not the unique key; each item can appear in multiple locations within the table. In the example above:
ID 5 is a subitem of ID3
ID 3 is a subitem of ID 1 AND ID 6
ID 1 and ID 6 aren't subitems of anything
So effectively it is representing a tree structure:
ID 1
+-------- ID 7
|---- ID 3
+---- ID 5
+---- ID 9
ID 6
+---- ID 3
+---- ID 5
+---- ID 9
What I'm hoping to is work out which items appear under each top level item (so the end result should be a table where in the "Ref" column only top level items appear):
Id | Description | Ref
---|---------------------
1 | system 1 |
3 | block 4 | 6
3 | block 4 | 1
5 | formula1 | 1
5 | formula1 | 6
6 | f |
9 | cheese | 1
9 | cheese | 6
7 | something | 1
The tree structure can be a total of 5 levels deep
I've been trying to use left joins to build up a list of page references, but I think I'm also going to need to union results tables (because obviously rows like ID=9, ID=5, and ID = 6 have to be duplicated in the final results set). It starts to get a bit messy!
WITH A
AS (SELECT *
FROM [RbdBlocks]),
B
AS (SELECT [x].[Id],
[x].[Description],
[x].[Page] AS Page1,
[y].[Page] AS Page2,
FROM A AS x
LEFT OUTER JOIN
A AS y
ON y.Id = x.Page)
SELECT *
FROM B
The above gives me some of the nested references, but I'm not sure if there's a better way to get this data together, and to manage the recursion rather than just duplicating the set of queries 4 times?
Have a look at Recursive Common Table Expressions (CTEs). They should be able to accomplish exactly what you need.
Have a look at Example D on the SQL Docs page.
Basically what you'd do in your case is:
In the "anchor member" of the CTE, select all top-level items
In the "recursive member" of the CTE, join all of the nested children to the top-level item
Recursive CTEs are not really trivial to understand, so be sure to read the docs carefully.

PostgreSQL: Distribute rows evenly and according to frequency

I have trouble with a complex ordering problem. I have following example data:
table "categories"
id | frequency
1 | 0
2 | 4
3 | 0
table "entries"
id | category_id | type
1 | 1 | a
2 | 1 | a
3 | 1 | a
4 | 2 | b
5 | 2 | c
6 | 3 | d
I want to put entries rows in an order so that category_id,
and type are distributed evenly.
More precisely, I want to order entries in a way that:
category_ids that refer to a category that has frequency=0 are
distributed evenly - so that a row is followed by a different category_id
whenever possible. e.g. category_ids of rows: 1,2,1,3,1,2.
Rows with category_ids of categories with frequency<>0 should
be inserted from ca. the beginning with a minimum of frequency rows between them
(the gaps should vary). In my example these are rows with category_id=2.
So the result could start with row id #1, then #4, then a minimum of 4 rows of other
categories, then #5.
in the end result rows with same type should not be next to each other.
Example result:
id | category_id | type
1 | 1 | a
4 | 2 | b
2 | 1 | a
6 | 3 | d
.. some other row ..
.. some other row ..
.. some other row ..
5 | 2 | c
entries are like a stream of things the user gets (one at a time).
The whole ordering should give users some variation. It's just there to not
present them similar entries all the time, so it doesn't have to be perfect.
The query also does not have to give the same result on each call - using
random() is totally fine.
frequencies are there to give entries of certain categories a higher
priority so that they are not distributed across the whole range, but are placed more
at the beginning of the result list. Even if there are a lot of these entries, they
should not completely crowd out the frequency=0 entries at the beginning, through.
I'm no sure how to start this. I think I can use window functions and
ntile() to distribute rows by category_id and type.
But I have no idea how to insert the non-0-category-entries afterwards.

TSQL change in query to and query

I have one to many relationship table
ReviewId EffectId
1 | 2
1 | 5
1 | 8
2 | 2
2 | 5
2 | 9
2 | 3
3 | 3
3 | 2
3 | 9
In the site the users select each effect he chooses, and I get all the relevant review.
I make an in query
For example if the user select effects 2 and 5
My query: “
select reviewed from table_name where effected in(2,5)
Now I need get all the review that contain both effect
All reviews that has effect 2 and effect 5
What is the best query to make this?
Important for me that the query will run as quick as possible.
And for this I can also change the table schema (if needed ) like add a cached field that contain all the effect with comma like
Reviewed cachedEffects
1 | ,2,5,8
2 | ,2,5,9,3,
3 | ,3,2,9
You can do it this way:
select reviewid
from
tbl
where effectid in (2,5)
group by reviewid
having count(distinct effectid) > 1
Demo
count (distinct effectid) is used to ensure that the results contain only those reviewIDs which have multiple records with different values of effectID. The where clause is used to filter out based on your filter condition of having both 2 and 5.
The key thing to note here is that we are grouping by reviewID, and also using the count of distinct effectID values to ensure that only those records which have both 2 and 5 are returned. If we did not do so, the query would return all rows which have effectID equal to either 2 or 5.
For improving performance, you could create an index on reviewID.

Access 2007 select first value of query results

I am running into a rather annoying thingy in Access (2007) and I am not sure if this is a feature or if I am asking for the impossible.
Although the actual database structure is more complex, my problem boils down to this:
I have a table with data about Units for specific years. This data comes from different sources and might overlap.
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 80 | 1 |
A | 2010 | 101 | 2 |
A | 2010 | 150 | 3 |
A | 2011 | 90 | 1 |
...
Now I would like the user to select certain sources, order them by priority and then extract one data value for each year.
For example, if the user selects source 1, 2 and 3 and orders them by (3, 1, 2), then I would like the following result:
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 150 | 3 |
A | 2011 | 90 | 1 |
I am able to order the initial table, based on a specific order. I do this with the following query
SELECT Unit, IYR, X1, Source
FROM TestTable
WHERE Source In (1,2,3)
ORDER BY Unit, IYR,
IIf(Source=3,1,IIf(Source=1,2,IIf(Source=2,3,4)))
This gives me the following intermediate result:
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 150 | 3 |
A | 2010 | 80 | 1 |
A | 2010 | 101 | 2 |
A | 2011 | 90 | 1 |
Next step is to only get the first value of each year. I was thinking to use the following query:
SELECT X.Unit, X.IYR, first(X.X1) as FirstX1
FROM (...) AS X
GROUP BY X.Unit, X.IYR
Where (…) is the above query.
Now Access goes bananas. Whatever order I give to the intermediate results, the result of this query is.
Unit | IYR | X1 |
--------------------
A | 2009 | 55 |
A | 2010 | 80 |
A | 2011 | 90 |
In other words, for year 2010 it shows the value of source 1 instead of 3. It seems that Access does not care about the ordering of the nested query when it applies the FIRST() function and sticks to the original ordering of the data.
Is this a feature of Access or is there a different way of achieving the desired results?
Ps: Next step would be to use a self join to add the source column to the results again, but I first need to resolve above problem.
Rather than use first it may be better to determine the MIN Priority and then join back e.g.
SELECT
t.UNIT,
t.IYR,
t.X1,
t.Source ,
t.PrioritySource
FROM
(SELECT
Unit,
IYR,
X1,
Source,
SWITCH ( [Source]=3, 1,
[Source]=1, 2,
[Source]=2, 3) as PrioritySource
FROM
TestTable
WHERE
Source In (1,2,3)
) as t
INNER JOIN
(SELECT
Unit,
IYR,
MIN(SWITCH ( [Source]=3, 1,
[Source]=1, 2,
[Source]=2, 3)) as PrioritySource
FROM
TestTable
WHERE
Source In (1,2,3)
GROUP BY
Unit,
IYR ) as MinPriortiy
ON t.Unit = MinPriortiy.Unit and
t.IYR = MinPriortiy.IYR and
t.PrioritySource = MinPriortiy.PrioritySource
which will produce this result (Note I include Source and priority source for demonstration purposes only)
UNIT | IYR | X1 | Source | PrioritySource
----------------------------------------------
A | 2009 | 55 | 1 | 2
A | 2010 | 150 | 3 | 1
A | 2011 | 90 | 1 | 2
Note the first subquery is to handle the fact that Access won't let you join on a Switch
Yes, FIRST() does use an arbitrary ordering. From the Access Help:
These functions return the value of a specified field in the first or
last record, respectively, of the result set returned by a query. If
the query does not include an ORDER BY clause, the values returned by
these functions will be arbitrary because records are usually returned
in no particular order.
I don't know whether FROM (...) AS X means you are using an ORDER BY inline (assuming that is actually possible) or if you are using a VIEW ('stored Query object') here but either way I assume the ORDER BY is being disregarded (because an ORDER BY should only apply to the final result).
The alternative is to use MIN() (or possibly MAX()).
This is the most concise way I have found to write such queries in Access that require pulling back all columns that correspond to the first row in a group of records that are ordered in a particular way.
First, I added a UniqueID to your table. In this case, it's just an AutoNumber field. You may already have a unique value in your table, in which case you can use that.
This will choose the row with a Source 3 first, then Source 1, then Source 2. If there is a tie, it picks the one with the higher X1 value. If there is a further tie, it is broken by the UniqueID value:
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.UniqueID=
(SELECT TOP 1 [UniqueID] FROM [TestTable]
WHERE t.IYR=IYR ORDER BY Choose([Source],2,3,1), X1 DESC, UniqueID)
This yields:
Unit IYR X1 Source UniqueID
A 2009 55 1 1
A 2010 150 3 4
A 2011 90 1 5
I recommend (1) you create an index on the IYR field -- this will dramatically increase your performance for this type of query, and (2) if you have a lot (>~100K) records, this isn't the best choice. I find it works quite well for tables in the 1-70K range. For larger datasets, I like to use my GroupIncrement function to partition each group (similar to SQL Server's ROW_NUMBER() OVER statement).
The Choose() function is a VBA function and may not be clear here. In your case, it sounds like there is some interactivity required. For that, you could create a second table called "Choices", like so:
Rank Choice
1 3
2 1
3 2
Then, you could substitute the following:
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.UniqueID=(SELECT TOP 1 [UniqueID] FROM
[TestTable] t2 INNER JOIN [Choices] c
ON t2.Source=c.Choice
WHERE t.IYR=t2.IYR ORDER BY c.[Rank], t2.X1 DESC, t2.UniqueID);
Indexing Source on TestTable and Choice on the Choices table may be helpful here, too, depending on the number of choices required.
Q:
Can you get this to work without the need for surrogate key? For
example what if the unique key is the composite of
{Unit,IYR,X1,Source}
A:
If you have a compound key, you can do it like this-- however I think that if you have a large dataset, it will totally kill the performance of the query. It may help to index all four columns, but I can't say for sure because I don't regularly use this method.
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.Unit & t.IYR & t.X1 & t.Source =
(SELECT TOP 1 Unit & IYR & X1 & Source FROM [TestTable]
WHERE t.IYR=IYR ORDER BY Choose([Source],2,3,1), X1 DESC, Unit, IYR)
In certain cases, you may have to coalesce some of the individual parts of the key as follows (though Access generally will coalesce values automatically):
t.Unit & CStr(t.IYR) & CStr(t.X1) & CStr(t.Source)
You could also use a query in your FROM statements instead of the actual table. The query itself would build a composite of the four fields used in the key, and then you'd use the new key name in the WHERE clause of the top SELECT statement, and in the SELECT TOP 1 [key] of the subquery.
In general, though, I will either: (a) create a new table with an AutoNumber field, (b) add an AutoNumber field, (c) add an integer and populate it with a unique number using VBA - this is useful when you get a MaxLocks error when trying to add an AutoNumber, or (d) use an already indexed unique key.