SQL: execution model for recursive CTEs - sql

I'm trying to understand how recursive CTEs are executed and particularly what causes them to terminate. Here's a simple example:
WITH cte_increment(n) AS (
SELECT -- Select 1
0
UNION ALL -- Union
SELECT -- Select 2
n + 1
FROM -- From
cte_increment
WHERE -- Where
n < 6
)
SELECT
*
FROM
cte_increment
;
My current mental model is that when the expression is invoked, the clauses should be executed in this order:
Select 1
From
Where
Select 2
Union
However, I don't think that can be happening because the From clause recursively invokes the same expression, which would restart the same process one level deeper. That would lead to infinite recursions, which would only stop when it hit the recursion limit.
My question is, how does the CTE ever check its termination condition?
Thanks in advance for your help!

So, "recursion" is a little bit of a misnomer here. It is really iterative, starting at the anchor condition. Perhaps a more accurate term would be "induction", in the mathematical sense.
Each iteration only uses the rows from the previous iteration. It doesn't use all the rows from the CTE.
So, the first iteration generates:
0
The second only sees the 1 and generates
1
The third only sees the 2 and generates
2
And so on until you get to where the where condition says "no more".

Related

Infinite loop with recursive SQL query

I can't seem to find the reason behind the infinite loop in this query, nor how to correct it.
Here is the context :
I have a table called mergesWith with this description :
mergesWith: information about neighboring seas. Note that in this relation, for every pair of
neighboring seas (A,B), only one tuple is given – thus, the relation is not symmetric.
sea1: a sea
sea2: a sea.
I want to know every sea accessible from the Mediterranean Sea by navigating. I have opted for a recursive query using "with" :
With
acces(p,d) as (
select sea1 as p, sea2 as d
from MERGESWITH
UNION ALL
select a.p, case when mw.sea1=a.d
then mw.sea2
else mw.sea1
end as d
from acces a, MERGESWITH mw
where a.d=mw.sea1 or a.d=mw.sea2)
select d
from acces
where p= 'Mediterranean Sea';
I think the cause is either the case when or the a.d=mw.sea1 or a.d=mw.sea2 that is not restrictive enough, but I can't seem to pinpoint why.
I get this error message :
32044. 00000 - "cycle detected while executing recursive WITH query"
*Cause: A recursive WITH clause query produced a cycle and was stopped
in order to avoid an infinite loop.
*Action: Rewrite the recursive WITH query to stop the recursion or use
the CYCLE clause.
The cycles are caused by the structure of your query, not by cycles in the data. You ask for the reason for cycling. That should be obvious: at the first iteration, one row of output has d = 'Aegean Sea'. At the second iteration, you will find a row with d = 'Mediterranean Sea', right? Can you now see how this will result in cycles?
Recursive queries have a cycle clause used exactly for this kind of problem. For some reason, even many users who learned the recursive with clause well, and use it all the time, seem unaware of the cycle clause (as well as the unrelated, but equally useful, search clause - used for ordering the output).
In your code, you need to make two changes. Add the cycle clause, and also in the outer query filter for non-cycle rows only. In the cycle clause, you can decide what to call the "cycle" column, and what values to give it. To make this look as similar to connect by queries as possible, I like to call the new column IS_CYCLE and to give it the values 0 (for no cycle) and 1 (for cycle). In the outer query below, add is_cycle to the select list to see what it adds to the recursive query.
Notice the position of the cycle clause: it comes right after the recursive with clause (in particular, after the closing parenthesis at the end of the recursive factored subquery).
with
acces(p,d) as (
select sea1 as p, sea2 as d
from MERGESWITH
UNION ALL
select a.p, case when mw.sea1=a.d
then mw.sea2
else mw.sea1
end as d
from acces a, MERGESWITH mw
where a.d=mw.sea1 or a.d=mw.sea2)
cycle d set is_cycle to 1 default 0 -- add this line
select d
from acces
where p= 'Mediterranean Sea'
and is_cycle = 0 -- and this line
;
Clearly, this would be data-dependent due to cycles in the data. I typically include a lev value when developing recursive CTEs. This makes it simpler to debug them.
So, try something like this:
with acces(p, d, lev) as (
select sea1 as p, sea2 as d, 1 as lev
from MERGESWITH
union all
select a.p,
(case when mw.sea1 = a.d then mw.sea2 else mw.sea1 end) as d,
lev + 1
from acces a join
MERGESWITH mw
on a.d in (mw.sea1, mw.sea2)
where lev < 5)
select d
from acces
where p = 'Mediterranean Sea';
If you find the reason but can't fix the code, ask a new question with sample data and desired results. A DB fiddle of some sort is also helpful.

UPDATED: Using COALESCE / IFNULL to filter recursive cte query

I have an issue with filtering / adding a condition to a recursive CTE query to avoid NULL result. The recursive part of query stops looping once it comes across a gap in the descending series of gids taken from the first non-recursive query (essentially when the x.t_gid = s.g2_t_gid filter in the WHERE is NULL).
x.t_gid and s.g2_t_gid are decsending series' of integers, and x.t_gid(n) = s.g2_t_gid(n+1). There are meant to be gaps in the series of gids but I want the recursive part to just continue onto the next row if it returns a NULL result. See code below.
WITH RECURSIVE snapped_points(t_gid, r_rdname, r_gid, r_ufi, snapped_geom, snapped_distance, g2_t_gid, g2_r_rdname, s_g2_snapped_geom, route_distance) AS (
(SELECT t_gid,
r_rdname,
r_gid,
r_ufi,
snapped_geom AS snapped_geom,
snapped_distance,
g2_t_gid,
g2_r_rdname,
g2_snapped_geom AS s_g2_snapped_geom,
route_distance
FROM x_joined_snapped x
LIMIT 1)
UNION ALL
(SELECT x.t_gid,
x.r_rdname,
x.r_gid,
x.r_ufi,
x.snapped_geom,
x.snapped_distance,
x.g2_t_gid,
x.g2_r_rdname,
x.g2_snapped_geom,
x.route_distance
FROM snapped_points s
INNER JOIN x_joined_snapped x
ON x.t_gid <> s.t_gid
AND (x.t_gid = s.g2_t_gid AND x.snapped_geom = s.s_g2_snapped_geom)
--OR (x.t_gid < s.g2_t_gid), <-- difference between 1s and 24s**
LIMIT 1
)
)
SELECT t_gid, r_gid, r_ufi, r_rdname, snapped_distance, snapped_geom FROM snapped_points
;
What I am aiming to achieve.
From the array of potential snapped points for t_gid(n), choose the row where distance between snapped_geom(n) and g2_snapped_geom(n) is the shortest. If there is only 1 result choose that.
From the array of potential snapped points for t_gid(n-1) (which equals
g2_t_gid(n)), select the subset containing only g2_snapped_geom(n) chosen in the previous step.
From this subset, choose the row where distance between snapped_geom(n-1) and g2_snapped_geom(n-1) is shortest. If there is only 1 choose that. Append to previous result. Loop until you run out of rows.
Once the recursive part hits a gap in the table where x.t_gid <> s.g2_t_gid it just stops looping. I have been able to fix this by adding a OR x.t_gid < s.g2_t_gid to the WHERE clause, but this increases the compute time from 1000ms to 24,000ms.
I have tried using COALESCE but can't get it to work, and recursive CTE queries don't allow the recursive part to be repeated in the query.
I am a complete noob so I'm sure this code could be made prettier and more efficient. Any help would be greatly appreciated.

Is it possible to use the result of a subquery in a case statement of the same outer query?

I am writing a search routine with a ranking algorithm and would like to get this in one pass.
My Ideal query would be something like this....
select *, (select top 1 wordposition
from wordpositions
where recordid=items.pk_itemid and wordid=79588 and nextwordid=64502
) as WordPos,
case when WordPos<11 then 1 else case WordPos<50 then 2 else case WordPos<100 then 3 else 4 end end end end as rank
from items
Is it possible to use WordPos in a case right there? It's generating an error on me , Invalid column name 'WordPos'.
I know I can redo the subquery for each case but I think it would actually re-run the case wouldn't it?
For example:
select *, case when (select top 1 wordposition from wordpositions where recordid=items.pk_itemid and wordid=79588 and nextwordid=64502)<11 then 1 else case (select top 1 wordposition from wordpositions where recordid=items.pk_itemid and wordid=79588 and nextwordid=64502)<50 then 2 else case (select top 1 wordposition from wordpositions where recordid=items.pk_itemid and wordid=79588 and nextwordid=64502)<100 then 3 else 4 end end end end as rank from items
That works....but is it really re-running the identical query each time?
It's hard to tell from the tests as the first time it runs it's slow but subsequent runs are quick....it's caching...so would that mean that the first time it ran it for the first row, the subsequent three times it would get the result from cache?
Just curious what the best way to do this would be...
Thank you!
Ryan
You can do this using a subquery. I will stick with your SQL Server syntax, even though the question is tagged mysql:
select i.*,
(case when WordPos < 11 then 1
when WordPos < 50 then 2
when WordPos < 100 then 3
else 4
end) as rank
from (select i.*,
(select top 1 wpwordposition
from wordpositions wp
where recordid=i.pk_itemid and wordid=79588 and nextwordid=64502
) as WordPos
from items i
) i;
This also simplifies the case statement. You do not need nested case statements to handle multiple conditions, just multiple where clauses.
No. Identifiers introduced in the output clause (the fact that it comes from a sub-query is irrelevant) cannot be used within the same SELECT statement.
Here are some solutions:
Rewrite the query using a JOIN1, This will eliminate the issue entirely and fits well with RA.
Wrap the entire SELECT with the sub-query within another SELECT with the case. The outer select can access identifiers introduced by the inner SELECT's output clause.
Use a CTE (if SQL Server). This is similar to #2 in that it allows an identifier to be introduced.
While "re-writing" the sub-query for each case is very messy it should still result in an equivalent plan - but view the query profile! - as the results of the query are non-volatile. As such the equivalent sub-queries can be safely moved by the query planner which should move the sub-query/sub-queries to a JOIN to avoid any "re-running" in the first place.
1 Here is a conversion to use a JOIN, which is my preferred method. (I find that if a query can't be written in terms of a JOIN "easily" then it might be asking for the wrong thing or otherwise be showing issues with schema design.)
select
wp.wordposition as WordPos,
case wp.wordposition .. as Rank
from items i
left join wordpositions wp
on wp.recordid = i.pk_itemid
where wp.wordid = 79588
and wp.nextwordid = 64502
I've made assumptions about the multiplicity here (i.e. that wordid is unique) which should be verified. If this multiplicity is not valid and not correctable otherwise (and you're indeed using SQL Server), then I'd recommend using ROW_NUMBER() and a CTE.

RECURSIVE in SQL

I'm learning SQL and had a hard time understanding the following recursive SQL statement.
WITH RECURSIVE t(n) AS (
SELECT 1
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
What is n and t from SELECT sum(n) FROM t;? As far as I could understand, n is a number of t is a set. Am I right?
Also how is recursion triggered in this statement?
The syntax that you are using looks like Postgres. "Recursion" in SQL is not really recursion, it is iteration. Your statement is:
WITH RECURSIVE t(n) AS (
SELECT 1
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
The statement for t is evaluated as:
Evaluate the non-self-referring part (select 1).
Then evaluate the self-referring part. (Initially this gives 2.)
Then evaluation the self-referring part again. (3).
And so on while the condition is still valid (n < 100).
When this is done the t subquery is finished, and the final statement can be evaluated.
This is called a Common Table Expression, or CTE.
The RECURSIVE from the query doesn't mean anything: it's just another name like n or t. What makes things recursive is that the CTE named t references itself inside the expression. To produce the result of the expression, the query engine must therefore recursively build the result, where each evaluation triggers the next. It reaches this point: SELECT n+1 FROM t... and has to stop and evaluate t. To do that, it has to call itself again, and so on, until the condition (n < 100) no longer holds. The SELECT 1 provides a starting point, and the WHERE n < 100 makes it so that the query does not recur forever.
At least, that's how it's supposed to work conceptually. What generally really happens is that the query engine builds the result iteratively, rather than recursively, if it can, but that's another story.
Let's break this apart:
WITH RECURSIVE t(n) AS (
A Common Table Expression (CTE) which is supposed to include a seed query and a recursive query. CTE is called t and returns 1 column: n
The seed query:
SELECT 1
returns ans answer set (in this case a just a single row: 1) and puts a copy of it into the final answer set
Now starts the recursive part:
UNION ALL
The rows returned from the seed query are now processed and n+1 is returned (again a single row answer set: 2) and copied into the final answer set:
SELECT n+1 FROM t WHERE n < 100
If this step returned a non-empty answer set (activity_count > 0) it's repeated (forever).
A WHERE-condition on a calculation like this n+1 is usually used to avoid an endless recursion. One usually knows the maximum possible level based on the data and for complex queries it's too easy to put some conditions wrong ;-)
Finally the answer set is returned:
)
SELECT sum(n) FROM t;
When you simply do a SELECT * FROM t; you'll see all numbers from 1 to 100, it's not a very efficient way to produce this list.
The most important thing to remember is that each step produces a part of the final result and only those rows from the previous step are processed in the next recursion level.

Why do these seemingly similar queries have such drastically different run times?

I'm working with an oracle DB trying to tune some queries and I'm having trouble understanding why working a particular clause in a particular way has such a drastic impact on the query performance. Here is a performant version of the query I'm doing
select * from
(
select a.*, rownum rn from
(
select *
from table_foo
) a where rownum <= 3
) where rn >= 2
The same query by replacing the last two lines with this
) a where rownum >=2 rownum <= 3
)
performs horribly. Several orders of magnitude worse
) a where rownum between 2 and 3
)
also performs horribly. I don't understand the magic from the first query and how to apply it to further similar queries.
My understanding is that the rownum assignment occurs after (or 'as') the row is selected, so any 'ROWNUM >= n' query with n greater than 1 is going to cause trouble. What was explained to me is that the first row is looked at; it is rownum 1, so it doesn't meet the criteria and is thrown away. The next row is looked at; it will still be rownum 1 since the result set is empty, and it doesn't meet the criteria and is thrown away. This process continues until all rows have been read and rejected.
Does the long-running query actually produce any data? Or have you always killed it before it completed?
ROWNUM is a pseudocolumn (not a real column) that is available in a query. ROWNUM will be assigned the numbers 1, 2, 3, 4, ... N, where N is the number of rows in the set ROWNUM is used with. In the first case, you are cutting the number of rows right off the bat, and in the second one you have to look for everything to cut off things that are bigger than 2.