RECURSIVE in SQL

RECURSIVE in SQL - sql

I'm learning SQL and had a hard time understanding the following recursive SQL statement.
WITH RECURSIVE t(n) AS (
SELECT 1
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
What is n and t from SELECT sum(n) FROM t;? As far as I could understand, n is a number of t is a set. Am I right?
Also how is recursion triggered in this statement?

The syntax that you are using looks like Postgres. "Recursion" in SQL is not really recursion, it is iteration. Your statement is:
WITH RECURSIVE t(n) AS (
SELECT 1
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
The statement for t is evaluated as:
Evaluate the non-self-referring part (select 1).
Then evaluate the self-referring part. (Initially this gives 2.)
Then evaluation the self-referring part again. (3).
And so on while the condition is still valid (n < 100).
When this is done the t subquery is finished, and the final statement can be evaluated.

This is called a Common Table Expression, or CTE.
The RECURSIVE from the query doesn't mean anything: it's just another name like n or t. What makes things recursive is that the CTE named t references itself inside the expression. To produce the result of the expression, the query engine must therefore recursively build the result, where each evaluation triggers the next. It reaches this point: SELECT n+1 FROM t... and has to stop and evaluate t. To do that, it has to call itself again, and so on, until the condition (n < 100) no longer holds. The SELECT 1 provides a starting point, and the WHERE n < 100 makes it so that the query does not recur forever.
At least, that's how it's supposed to work conceptually. What generally really happens is that the query engine builds the result iteratively, rather than recursively, if it can, but that's another story.

Let's break this apart:
WITH RECURSIVE t(n) AS (
A Common Table Expression (CTE) which is supposed to include a seed query and a recursive query. CTE is called t and returns 1 column: n
The seed query:
SELECT 1
returns ans answer set (in this case a just a single row: 1) and puts a copy of it into the final answer set
Now starts the recursive part:
UNION ALL
The rows returned from the seed query are now processed and n+1 is returned (again a single row answer set: 2) and copied into the final answer set:
SELECT n+1 FROM t WHERE n < 100
If this step returned a non-empty answer set (activity_count > 0) it's repeated (forever).
A WHERE-condition on a calculation like this n+1 is usually used to avoid an endless recursion. One usually knows the maximum possible level based on the data and for complex queries it's too easy to put some conditions wrong ;-)
Finally the answer set is returned:
)
SELECT sum(n) FROM t;
When you simply do a SELECT * FROM t; you'll see all numbers from 1 to 100, it's not a very efficient way to produce this list.
The most important thing to remember is that each step produces a part of the final result and only those rows from the previous step are processed in the next recursion level.

Related

SQL: execution model for recursive CTEs

I'm trying to understand how recursive CTEs are executed and particularly what causes them to terminate. Here's a simple example:
WITH cte_increment(n) AS (
SELECT -- Select 1
0
UNION ALL -- Union
SELECT -- Select 2
n + 1
FROM -- From
cte_increment
WHERE -- Where
n < 6
)
SELECT
*
FROM
cte_increment
;
My current mental model is that when the expression is invoked, the clauses should be executed in this order:
Select 1
From
Where
Select 2
Union
However, I don't think that can be happening because the From clause recursively invokes the same expression, which would restart the same process one level deeper. That would lead to infinite recursions, which would only stop when it hit the recursion limit.
My question is, how does the CTE ever check its termination condition?
Thanks in advance for your help!

So, "recursion" is a little bit of a misnomer here. It is really iterative, starting at the anchor condition. Perhaps a more accurate term would be "induction", in the mathematical sense.
Each iteration only uses the rows from the previous iteration. It doesn't use all the rows from the CTE.
So, the first iteration generates:
0
The second only sees the 1 and generates
1
The third only sees the 2 and generates
2
And so on until you get to where the where condition says "no more".

Infinite loop with recursive SQL query

I can't seem to find the reason behind the infinite loop in this query, nor how to correct it.
Here is the context :
I have a table called mergesWith with this description :
mergesWith: information about neighboring seas. Note that in this relation, for every pair of
neighboring seas (A,B), only one tuple is given – thus, the relation is not symmetric.
sea1: a sea
sea2: a sea.
I want to know every sea accessible from the Mediterranean Sea by navigating. I have opted for a recursive query using "with" :
With
acces(p,d) as (
select sea1 as p, sea2 as d
from MERGESWITH
UNION ALL
select a.p, case when mw.sea1=a.d
then mw.sea2
else mw.sea1
end as d
from acces a, MERGESWITH mw
where a.d=mw.sea1 or a.d=mw.sea2)
select d
from acces
where p= 'Mediterranean Sea';
I think the cause is either the case when or the a.d=mw.sea1 or a.d=mw.sea2 that is not restrictive enough, but I can't seem to pinpoint why.
I get this error message :
32044. 00000 - "cycle detected while executing recursive WITH query"
*Cause: A recursive WITH clause query produced a cycle and was stopped
in order to avoid an infinite loop.
*Action: Rewrite the recursive WITH query to stop the recursion or use
the CYCLE clause.

The cycles are caused by the structure of your query, not by cycles in the data. You ask for the reason for cycling. That should be obvious: at the first iteration, one row of output has d = 'Aegean Sea'. At the second iteration, you will find a row with d = 'Mediterranean Sea', right? Can you now see how this will result in cycles?
Recursive queries have a cycle clause used exactly for this kind of problem. For some reason, even many users who learned the recursive with clause well, and use it all the time, seem unaware of the cycle clause (as well as the unrelated, but equally useful, search clause - used for ordering the output).
In your code, you need to make two changes. Add the cycle clause, and also in the outer query filter for non-cycle rows only. In the cycle clause, you can decide what to call the "cycle" column, and what values to give it. To make this look as similar to connect by queries as possible, I like to call the new column IS_CYCLE and to give it the values 0 (for no cycle) and 1 (for cycle). In the outer query below, add is_cycle to the select list to see what it adds to the recursive query.
Notice the position of the cycle clause: it comes right after the recursive with clause (in particular, after the closing parenthesis at the end of the recursive factored subquery).
with
acces(p,d) as (
select sea1 as p, sea2 as d
from MERGESWITH
UNION ALL
select a.p, case when mw.sea1=a.d
then mw.sea2
else mw.sea1
end as d
from acces a, MERGESWITH mw
where a.d=mw.sea1 or a.d=mw.sea2)
cycle d set is_cycle to 1 default 0 -- add this line
select d
from acces
where p= 'Mediterranean Sea'
and is_cycle = 0 -- and this line
;

Clearly, this would be data-dependent due to cycles in the data. I typically include a lev value when developing recursive CTEs. This makes it simpler to debug them.
So, try something like this:
with acces(p, d, lev) as (
select sea1 as p, sea2 as d, 1 as lev
from MERGESWITH
union all
select a.p,
(case when mw.sea1 = a.d then mw.sea2 else mw.sea1 end) as d,
lev + 1
from acces a join
MERGESWITH mw
on a.d in (mw.sea1, mw.sea2)
where lev < 5)
select d
from acces
where p = 'Mediterranean Sea';
If you find the reason but can't fix the code, ask a new question with sample data and desired results. A DB fiddle of some sort is also helpful.

UPDATED: Using COALESCE / IFNULL to filter recursive cte query

I have an issue with filtering / adding a condition to a recursive CTE query to avoid NULL result. The recursive part of query stops looping once it comes across a gap in the descending series of gids taken from the first non-recursive query (essentially when the x.t_gid = s.g2_t_gid filter in the WHERE is NULL).
x.t_gid and s.g2_t_gid are decsending series' of integers, and x.t_gid(n) = s.g2_t_gid(n+1). There are meant to be gaps in the series of gids but I want the recursive part to just continue onto the next row if it returns a NULL result. See code below.
WITH RECURSIVE snapped_points(t_gid, r_rdname, r_gid, r_ufi, snapped_geom, snapped_distance, g2_t_gid, g2_r_rdname, s_g2_snapped_geom, route_distance) AS (
(SELECT t_gid,
r_rdname,
r_gid,
r_ufi,
snapped_geom AS snapped_geom,
snapped_distance,
g2_t_gid,
g2_r_rdname,
g2_snapped_geom AS s_g2_snapped_geom,
route_distance
FROM x_joined_snapped x
LIMIT 1)
UNION ALL
(SELECT x.t_gid,
x.r_rdname,
x.r_gid,
x.r_ufi,
x.snapped_geom,
x.snapped_distance,
x.g2_t_gid,
x.g2_r_rdname,
x.g2_snapped_geom,
x.route_distance
FROM snapped_points s
INNER JOIN x_joined_snapped x
ON x.t_gid <> s.t_gid
AND (x.t_gid = s.g2_t_gid AND x.snapped_geom = s.s_g2_snapped_geom)
--OR (x.t_gid < s.g2_t_gid), <-- difference between 1s and 24s**
LIMIT 1
)
)
SELECT t_gid, r_gid, r_ufi, r_rdname, snapped_distance, snapped_geom FROM snapped_points
;
What I am aiming to achieve.
From the array of potential snapped points for t_gid(n), choose the row where distance between snapped_geom(n) and g2_snapped_geom(n) is the shortest. If there is only 1 result choose that.
From the array of potential snapped points for t_gid(n-1) (which equals
g2_t_gid(n)), select the subset containing only g2_snapped_geom(n) chosen in the previous step.
From this subset, choose the row where distance between snapped_geom(n-1) and g2_snapped_geom(n-1) is shortest. If there is only 1 choose that. Append to previous result. Loop until you run out of rows.
Once the recursive part hits a gap in the table where x.t_gid <> s.g2_t_gid it just stops looping. I have been able to fix this by adding a OR x.t_gid < s.g2_t_gid to the WHERE clause, but this increases the compute time from 1000ms to 24,000ms.
I have tried using COALESCE but can't get it to work, and recursive CTE queries don't allow the recursive part to be repeated in the query.
I am a complete noob so I'm sure this code could be made prettier and more efficient. Any help would be greatly appreciated.

Why do these seemingly similar queries have such drastically different run times?

I'm working with an oracle DB trying to tune some queries and I'm having trouble understanding why working a particular clause in a particular way has such a drastic impact on the query performance. Here is a performant version of the query I'm doing
select * from
(
select a.*, rownum rn from
(
select *
from table_foo
) a where rownum <= 3
) where rn >= 2
The same query by replacing the last two lines with this
) a where rownum >=2 rownum <= 3
)
performs horribly. Several orders of magnitude worse
) a where rownum between 2 and 3
)
also performs horribly. I don't understand the magic from the first query and how to apply it to further similar queries.

My understanding is that the rownum assignment occurs after (or 'as') the row is selected, so any 'ROWNUM >= n' query with n greater than 1 is going to cause trouble. What was explained to me is that the first row is looked at; it is rownum 1, so it doesn't meet the criteria and is thrown away. The next row is looked at; it will still be rownum 1 since the result set is empty, and it doesn't meet the criteria and is thrown away. This process continues until all rows have been read and rejected.
Does the long-running query actually produce any data? Or have you always killed it before it completed?

ROWNUM is a pseudocolumn (not a real column) that is available in a query. ROWNUM will be assigned the numbers 1, 2, 3, 4, ... N, where N is the number of rows in the set ROWNUM is used with. In the first case, you are cutting the number of rows right off the bat, and in the second one you have to look for everything to cut off things that are bigger than 2.

Big performance difference (1hr to 1 minute ) found in SQL. Can you explain why?

The following queries are taking 70 minutes and 1 minute respectively on a standard machine for 1 million records. What could be the possible reasons?
Query [01:10:00]
SELECT *
FROM cdc.fn_cdc_get_net_changes_dbo_PartitionTest(
CASE WHEN sys.fn_cdc_increment_lsn(0x00)<sys.fn_cdc_get_min_lsn('dbo_PartitionTest')
THEN sys.fn_cdc_get_min_lsn('dbo_PartitionTest')
ELSE sys.fn_cdc_increment_lsn(0x00) END
, sys.fn_cdc_get_max_lsn()
, 'all with mask')
WHERE __$operation <> 1
Modified Query [00:01:10]
DECLARE #MinLSN binary(10)
DECLARE #MaxLSN binary(10)
SELECT #MaxLSN= sys.fn_cdc_get_max_lsn()
SELECT #MinLSN=CASE WHEN sys.fn_cdc_increment_lsn(0x00)<sys.fn_cdc_get_min_lsn('dbo_PartitionTest')
THEN sys.fn_cdc_get_min_lsn('dbo_PartitionTest')
ELSE sys.fn_cdc_increment_lsn(0x00) END
SELECT *
FROM cdc.fn_cdc_get_net_changes_dbo_PartitionTest(
#MinLSN, #MaxLSN, 'all with mask') WHERE __$operation <> 1
[Modified]
I tried to recreate the scenario with a similar function to see if the parameters are evaluated for each row.
CREATE FUNCTION Fn_Test(#a decimal)RETURNS TABLE
AS
RETURN
(
SELECT #a Parameter, Getdate() Dt, PartitionTest.*
FROM PartitionTest
);
SELECT * FROM Fn_Test(RAND(DATEPART(s,GETDATE())))
But I am getting the same value for the column 'Parameter' for a a million records processed in 38 seconds.

In your first query, your fn_cdc_increment_lsn and fn_cdc_get_min_lsn get executed for every row. In second example, just once.

Even deterministic scalar functions are evaluated at least once per row. If the same deterministic scalar function occurs multiple times on the same "row" with the same parameters, I believe only then will it be evaluated once - e.g. in a CASE WHEN fn_X(a, b, c) > 0 THEN fn_X(a, b, c) ELSE 0 END or something like that.
I think your RAND problem is because you continue to reseed:
Repetitive calls of RAND() with the
same seed value return the same
results.
For one connection, if RAND() is
called with a specified seed value,
all subsequent calls of RAND() produce
results based on the seeded RAND()
call. For example, the following query
will always return the same sequence
of numbers.
I have taken to caching scalar function results as you have indicated - even going so far as to precalculate tables of scalar function results and joining to them. Something has to be done eventually to make scalar functions perform. Right not, the best option is the CLR - apparently these far outperform SQL UDFs. Unfortunately, I cannot use them in my current environment.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

RECURSIVE in SQL - sql

Related

SQL: execution model for recursive CTEs

Infinite loop with recursive SQL query

UPDATED: Using COALESCE / IFNULL to filter recursive cte query

Why do these seemingly similar queries have such drastically different run times?

Big performance difference (1hr to 1 minute ) found in SQL. Can you explain why?

Categories

Resources