Infinite loop with recursive SQL query - sql

I can't seem to find the reason behind the infinite loop in this query, nor how to correct it.
Here is the context :
I have a table called mergesWith with this description :
mergesWith: information about neighboring seas. Note that in this relation, for every pair of
neighboring seas (A,B), only one tuple is given – thus, the relation is not symmetric.
sea1: a sea
sea2: a sea.
I want to know every sea accessible from the Mediterranean Sea by navigating. I have opted for a recursive query using "with" :
With
acces(p,d) as (
select sea1 as p, sea2 as d
from MERGESWITH
UNION ALL
select a.p, case when mw.sea1=a.d
then mw.sea2
else mw.sea1
end as d
from acces a, MERGESWITH mw
where a.d=mw.sea1 or a.d=mw.sea2)
select d
from acces
where p= 'Mediterranean Sea';
I think the cause is either the case when or the a.d=mw.sea1 or a.d=mw.sea2 that is not restrictive enough, but I can't seem to pinpoint why.
I get this error message :
32044. 00000 - "cycle detected while executing recursive WITH query"
*Cause: A recursive WITH clause query produced a cycle and was stopped
in order to avoid an infinite loop.
*Action: Rewrite the recursive WITH query to stop the recursion or use
the CYCLE clause.

The cycles are caused by the structure of your query, not by cycles in the data. You ask for the reason for cycling. That should be obvious: at the first iteration, one row of output has d = 'Aegean Sea'. At the second iteration, you will find a row with d = 'Mediterranean Sea', right? Can you now see how this will result in cycles?
Recursive queries have a cycle clause used exactly for this kind of problem. For some reason, even many users who learned the recursive with clause well, and use it all the time, seem unaware of the cycle clause (as well as the unrelated, but equally useful, search clause - used for ordering the output).
In your code, you need to make two changes. Add the cycle clause, and also in the outer query filter for non-cycle rows only. In the cycle clause, you can decide what to call the "cycle" column, and what values to give it. To make this look as similar to connect by queries as possible, I like to call the new column IS_CYCLE and to give it the values 0 (for no cycle) and 1 (for cycle). In the outer query below, add is_cycle to the select list to see what it adds to the recursive query.
Notice the position of the cycle clause: it comes right after the recursive with clause (in particular, after the closing parenthesis at the end of the recursive factored subquery).
with
acces(p,d) as (
select sea1 as p, sea2 as d
from MERGESWITH
UNION ALL
select a.p, case when mw.sea1=a.d
then mw.sea2
else mw.sea1
end as d
from acces a, MERGESWITH mw
where a.d=mw.sea1 or a.d=mw.sea2)
cycle d set is_cycle to 1 default 0 -- add this line
select d
from acces
where p= 'Mediterranean Sea'
and is_cycle = 0 -- and this line
;

Clearly, this would be data-dependent due to cycles in the data. I typically include a lev value when developing recursive CTEs. This makes it simpler to debug them.
So, try something like this:
with acces(p, d, lev) as (
select sea1 as p, sea2 as d, 1 as lev
from MERGESWITH
union all
select a.p,
(case when mw.sea1 = a.d then mw.sea2 else mw.sea1 end) as d,
lev + 1
from acces a join
MERGESWITH mw
on a.d in (mw.sea1, mw.sea2)
where lev < 5)
select d
from acces
where p = 'Mediterranean Sea';
If you find the reason but can't fix the code, ask a new question with sample data and desired results. A DB fiddle of some sort is also helpful.

Related

Athena/Presto | Can't match ID row on self join

I'm trying to get the bi-grams on a string column.
I've followed the approach here but Athena/Presto is giving me errors at the final steps.
Source code so far
with word_list as (
SELECT
transaction_id,
words,
n,
regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)') as f70,
f70_remittance_info
FROM exploration_transaction
cross join unnest(regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)')) with ordinality AS t (words, n)
where cardinality((regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)'))) > 1
and f70_remittance_info is not null
limit 50 )
select wl1.f70, wl1.n, wl1.words, wl2.f70, wl2.n, wl2.words
from word_list wl1
join word_list wl2
on wl1.transaction_id = wl2.transaction_id
The specific issue I'm having is on the very last line, when I try to self join the transaction ids - it always returns zero rows. It does work if I join only by wl1.n = wl2.n-1 (the position on the array) which is useless if I can't constrain it to a same id.
Athena doesn't support the ngrams function by presto, so I'm left with this approach.
Any clues why this isn't working?
Thanks!
This is speculation. But I note that your CTE is using limit with no order by. That means that an arbitrary set of rows is being returned.
Although some databases materialize CTEs, many do not. They run the code independently each time it is referenced. My guess is that the code is run independently and the arbitrary set of 50 rows has no transaction ids in common.
One solution would be to add order by transacdtion_id in the subquery.

SQL: execution model for recursive CTEs

I'm trying to understand how recursive CTEs are executed and particularly what causes them to terminate. Here's a simple example:
WITH cte_increment(n) AS (
SELECT -- Select 1
0
UNION ALL -- Union
SELECT -- Select 2
n + 1
FROM -- From
cte_increment
WHERE -- Where
n < 6
)
SELECT
*
FROM
cte_increment
;
My current mental model is that when the expression is invoked, the clauses should be executed in this order:
Select 1
From
Where
Select 2
Union
However, I don't think that can be happening because the From clause recursively invokes the same expression, which would restart the same process one level deeper. That would lead to infinite recursions, which would only stop when it hit the recursion limit.
My question is, how does the CTE ever check its termination condition?
Thanks in advance for your help!
So, "recursion" is a little bit of a misnomer here. It is really iterative, starting at the anchor condition. Perhaps a more accurate term would be "induction", in the mathematical sense.
Each iteration only uses the rows from the previous iteration. It doesn't use all the rows from the CTE.
So, the first iteration generates:
0
The second only sees the 1 and generates
1
The third only sees the 2 and generates
2
And so on until you get to where the where condition says "no more".

Tips about optimizing this multi-layered (with many layers of subqueries) SQL query

I need your kind help with this SQL query. I have a query with 6 layers of subqueries that's currently structured like this. I am looking forward to advice how to:
Reduce the layers without repeating the same statement (for example, I could replace 'case when E>200' with '(Case when T2.BB >100 then B+C else B+D end) > 200' and write the statement in layer 1, hence eliminating layer2. I can't do this because in my raw queries I have a computed column that is based on another computed column in its sub-query, which is then calculated based on another computed column in the sub-sub-query... So repeating codes 5/6 time will confuse me and drive me crazy...
Avoid using select 2., select 1. while still keeping all the columns (F,E,A,B,C,D,T2.BB) in the final output. I want to do this because in my raw query there is 5 select.* -- I feel like this causes the servers to do much redundant work and slows down query execution.
Thanks very much for your help!
Select
2.*,
case when E > 200 then 'OK' else 'OH NO' end F
From
(Select
1.*,
Case when T2.BB >100 then B+C else B+D end E
From
(Select
A, B, C, D, T2.BB
From
T1
Join
T2 on T1.A = T2.AA) 1
) 2
try WITH statements, they look cleaner because they help to maintain the order in which you apply the logic, so it will look like this:
WITH
t1 as (
select ...
from src_table
)
,t2 as (
select *, ...
from t1
)
<<as much layers as needed>>
also if you need to reuse something in 2 different places you can reference the prior WITH statement from any following statement, i.e. encapsulate that logic in one place

RECURSIVE in SQL

I'm learning SQL and had a hard time understanding the following recursive SQL statement.
WITH RECURSIVE t(n) AS (
SELECT 1
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
What is n and t from SELECT sum(n) FROM t;? As far as I could understand, n is a number of t is a set. Am I right?
Also how is recursion triggered in this statement?
The syntax that you are using looks like Postgres. "Recursion" in SQL is not really recursion, it is iteration. Your statement is:
WITH RECURSIVE t(n) AS (
SELECT 1
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
The statement for t is evaluated as:
Evaluate the non-self-referring part (select 1).
Then evaluate the self-referring part. (Initially this gives 2.)
Then evaluation the self-referring part again. (3).
And so on while the condition is still valid (n < 100).
When this is done the t subquery is finished, and the final statement can be evaluated.
This is called a Common Table Expression, or CTE.
The RECURSIVE from the query doesn't mean anything: it's just another name like n or t. What makes things recursive is that the CTE named t references itself inside the expression. To produce the result of the expression, the query engine must therefore recursively build the result, where each evaluation triggers the next. It reaches this point: SELECT n+1 FROM t... and has to stop and evaluate t. To do that, it has to call itself again, and so on, until the condition (n < 100) no longer holds. The SELECT 1 provides a starting point, and the WHERE n < 100 makes it so that the query does not recur forever.
At least, that's how it's supposed to work conceptually. What generally really happens is that the query engine builds the result iteratively, rather than recursively, if it can, but that's another story.
Let's break this apart:
WITH RECURSIVE t(n) AS (
A Common Table Expression (CTE) which is supposed to include a seed query and a recursive query. CTE is called t and returns 1 column: n
The seed query:
SELECT 1
returns ans answer set (in this case a just a single row: 1) and puts a copy of it into the final answer set
Now starts the recursive part:
UNION ALL
The rows returned from the seed query are now processed and n+1 is returned (again a single row answer set: 2) and copied into the final answer set:
SELECT n+1 FROM t WHERE n < 100
If this step returned a non-empty answer set (activity_count > 0) it's repeated (forever).
A WHERE-condition on a calculation like this n+1 is usually used to avoid an endless recursion. One usually knows the maximum possible level based on the data and for complex queries it's too easy to put some conditions wrong ;-)
Finally the answer set is returned:
)
SELECT sum(n) FROM t;
When you simply do a SELECT * FROM t; you'll see all numbers from 1 to 100, it's not a very efficient way to produce this list.
The most important thing to remember is that each step produces a part of the final result and only those rows from the previous step are processed in the next recursion level.

Retrieve hierarchical groups ... with infinite recursion

I've a table like this which contains links :
key_a key_b
--------------
a b
b c
g h
a g
c a
f g
not really tidy & infinite recursion ...
key_a = parent
key_b = child
Require a query which will recompose and attribute a number for each hierarchical group (parent + direct children + indirect children) :
key_a key_b nb_group
--------------------------
a b 1
a g 1
b c 1
**c a** 1
f g 2
g h 2
**link responsible of infinite loop**
Because we have
A-B-C-A
-> Only want to show simply the link as shown.
Any idea ?
Thanks in advance
The problem is that you aren't really dealing with strict hierarchies; you're dealing with directed graphs, where some graphs have cycles. Notice that your nbgroup #1 doesn't have any canonical root-- it could be a, b, or c due to the cyclic reference from c-a.
The basic way of dealing with this is to think in terms of graph techniques, not recursion. In fact, an iterative approach (not using a CTE) is the only solution I can think of in SQL. The basic approach is explained here.
Here is a SQL Fiddle with a solution that addresses both the cycles and the shared-leaf case. Notice it uses iteration (with a failsafe to prevent runaway processes) and table variables to operate; I don't think there's any getting around this. Note also the changed sample data (a-g changed to a-h; explained below).
If you dig into the SQL you'll notice that I changed some key things from the solution given in the link. That solution was dealing with undirected edges, whereas your edges are directed (if you used undirected edges the entire sample set is a single component because of the a-g connection).
This gets to the heart of why I changed a-g to a-h in my sample data. Your specification of the problem is straightforward if only leaf nodes are shared; that's the specification I coded to. In this case, a-h and g-h can both get bundled off to their proper components with no problem, because we're concerned about reachability from parents (even given cycles).
However, when you have shared branches, it's not clear what you want to show. Consider the a-g link: given this, g-h could exist in either component (a-g-h or f-g-h). You put it in the second, but it could have been in the first instead, right? This ambiguity is why I didn't try to address it in this solution.
Edit: To be clear, in my solution above, if shared braches ARE encountered, it treats the whole set as a single component. Not what you described above, but it will have to be changed after the problem is clarified. Hopefully this gets you close.
You should use a recursive query. In the first part we select all records which are top level nodes (have no parents) and using ROW_NUMBER() assign them group ID numbers. Then in the recursive part we add to them children one by one and use parent's groups Id numbers.
with CTE as
(
select t1.parent,t1.child,
ROW_NUMBER() over (order by t1.parent) rn
from t t1 where
not exists (select 1 from t where child=t1.parent)
union all
select t.parent,t.child, CTE.rn
from t
join CTE on t.parent=CTE.Child
)
select * from CTE
order by RN,parent
SQLFiddle demo
Painful problem of graph walking using recursive CTEs. This is the problem of finding connected subgraphs in a graph. The challenge with using recursive CTEs is to prevent unwarranted recursion -- that is, infinite loops In SQL Server, that typically means storing them in a string.
The idea is to get a list of all pairs of nodes that are connected (and a node is connected with itself). Then, take the minimum from the list of connected nodes and use this as an id for the connected subgraph.
The other idea is to walk the graph in both directions from a node. This ensures that all possible nodes are visited. The following is query that accomplishes this:
with fullt as (
select keyA, keyB
from t
union
select keyB, keyA
from t
),
CTE as (
select t.keyA, t.keyB, t.keyB as last, 1 as level,
','+cast(keyA as varchar(max))+','+cast(keyB as varchar(max))+',' as path
from fullt t
union all
select cte.keyA, cte.keyB,
(case when t.keyA = cte.last then t.keyB else t.keyA
end) as last,
1 + level,
cte.path+t.keyB+','
from fullt t join
CTE
on t.keyA = CTE.last or
t.keyB = cte.keyA
where cte.path not like '%,'+t.keyB+',%'
) -- select * from cte where 'g' in (keyA, keyB)
select t.keyA, t.keyB,
dense_rank() over (order by min(cte.Last)) as grp,
min(cte.Last)
from t join
CTE
on (t.keyA = CTE.keyA and t.keyB = cte.keyB) or
(t.keyA = CTE.keyB and t.keyB = cte.keyA)
where cte.path like '%,'+t.keyA+',%' or
cte.path like '%,'+t.keyB+',%'
group by t.id, t.keyA, t.keyB
order by t.id;
The SQLFiddle is here.
you might want to check with COMMON TABLE EXPRESSIONS
here's the link