Cycle detection for recursive SQL WITH clause - sql

Let us assume that we have normal hierarchical table with parent column pointing to its parent. I wanted to build query that will enumerate all ancestors with SQL WITH clause.
with data_ancestors (par, chi) AS (
SELECT d.parent, d.dat_id
FROM data d
WHERE d.parent IS NOT NULL
UNION ALL
SELECT p.parent, a.chi
FROM dta_ancestors a
JOIN data p
ON p.dat_id = a.par
WHERE p.parent IS NOT NULL
)
select * from dta_ancestors where par = 1 order by chi;
The problem here is that although the data should not contains cycles, it is not guaranteed so. In such wrong case I want to gradually degrade the functionality (loops should be arbitrary broken). However Oracle ends with error during execution on "wrong" data.
I know I can use different more Oracle specific approach like:
select p.dat_id, a.dat_id from data p, data a where a.dat_id in (
select d.dat_id from data d start with d.dat_id = p.dat_id connect by nocycle prior d.dat_id = d.parent
);
or as suggested in this question to make cycle detection by myself.
However are there any other nice solutions (mainly for Oracle but also other DBs) that solves the recursion problems with WITH clause?

Related

force Oracle to process recursive CTE on remote db site (perhaps using DRIVING_SITE hint)

I am trying to fetch data from remote table. The data is expanded from seed set of data in local table using recursive CTE. The query is very slow (300 seed rows to 800 final rows takes 7 minutes).
For other "tiny local, huge remote"-cases with no recursive query the DRIVING_SITE hint works excellently. I also tried to export seed set from local table into auxiliary table on remotedb with same structure and - being logged in remotedb - ran query as pure local query (my_table as p, my_table_seed_copy as i). It took 4s, which encouraged me to believe forcing query to remote site would make query fast.
What's the correct way to force Oracle to execute recursive query on the remote site?
with s (id, data) as (
select p.id, p.data
from my_table#remotedb p
where p.id in (select i.id from my_table i)
union all
select p.id, p.data
from s
join my_table#remotedb p on ...
)
select /*+DRIVING_SITE(p)*/ s.*
from s;
In the query above, I tried
select /*+DRIVING_SITE(p)*/ s.* in main select
select /*+DRIVING_SITE(s)*/ s.* in main select
omitting DRIVING_SITE in whole query
select /*+DRIVING_SITE(x)*/ s.* from s, dual#remotedb x as main select
select /*+DRIVING_SITE(p)*/ p.id, p.data in first inner select
select /*+DRIVING_SITE(p)*/ p.id, p.data in both inner selects
select /*+DRIVING_SITE(p) MATERIALIZE*/ p.id, p.data in both inner selects
(just for completeness - rewriting to connect by is not applicable for this case - actually the query is more complex and uses constructs which cannot be expressed by connect by)
All without success (i.e. data returned after 7 minutes).
Recursive query actually performs breadth-first search - seed rows represent 0-th level and recursive part finds element on n-th level from elements on (n-1)-th level. Original query was intended to be part of merge ... using ... clause.
Hence I rewrote query to PLSQL loop. Every cycle generates one level. Merge prevents insertion of duplicates so finally no new row is added and loop exits (transitive closure is constructed). Pseudocode:
loop
merge into my_table using (
select /*+DRIVING_SITE(r)*/ distinct r.* /*###BULKCOLLECT###*/
from my_table l
join my_table#remotedb r on ... -- same condition as s and p in original question are joined on
) ...
exit when rows_inserted = 0;
end loop;
Actual code is not so simple since DRIVING_SITE actually does not directly work with merge so we have to transfer data via work collection but that's different story. Also the count of inserted rows cannot be easily determined, it must be computed as difference between row count after and before merge.
The solution is not ideal. Anyway it's much faster than recursive CTE (30s, 13 cycles) because queries are provably utilizing the DRIVING_SITE hint.
I will leave question open for some time to wait if somebody finds answer how to make recursive query working or proving it is not possible.

Teradata SQL Tuning : What was the purpose of the below code

I tuned a query that was badly skewed written by a Teradata Co. Consultant few years back. The same code was a perpetually high CPU report and it has gotten worse
SELECT
c.child ,
a.username ,
CAST( SUM((((a.AmpCPUTime(DEC(18,3)))+
ZEROIFNULL(a.ParserCPUTime)) )) AS DECIMAL(18,3))
FROM pdcrinfo.dbqlogtbl a
LEFT OUTER JOIN (
SELECT queryid,logdate,
MIN (objectdatabasename) AS objectdatabasename
FROM pdcrinfo.dbqlobjtbl_hst
GROUP BY 1,2 ) b ON a.queryid=b.queryid
JOIN dbc.children c ON b.objectdatabasename=c.child
WHERE c.parent ='FINDB'
AND a.logdate BETWEEN '2015-12-01' AND '2015-12-31'
and b.logdate BETWEEN '2015-12-01' AND '2015-12-31'
GROUP BY 1,
2,
3
ORDER BY 1,
2,
3;
I already rewrote the query joining log & obj tables which have the same PI and then doing an exists on the dbc.child table and it runs fabulously - same o/p.
But I thought I got lucky just because FINDB does not have any children that are view databases .
My question :
I am trying to understand what is the purpose of
MIN (objectdatabasename)
Most of our table database names precede view database names ( which are of the form findb_vw etc ) and so I think he may have tried to eliminate view databases ?
The other thing : Why LOJ ( I changed to IJ ) because you want a value for Objectdatabasename . I think LOJ does not apply here
I am not sure so throwing the question open on the stage. So just to clarify - I am not looking for tuning tips. I wanted other perspectives on the MIN ( Objectdatabasename ) code .
You're right, the Left Join is useless (but the optimizer will change it to an Inner Join anyway, so it's just confusing).
The MIN (objectdatabasename) was probably used to avoid multiple rows for the same queryid resulting in duplicate rows (and maybe to remove the view dbs).
But IMHO the main reason for bad performance is a missing join condition between the DBQL tables. The tables in pdcrinfo should be partitioned by LogDate and you need to add AND a.LogDate=b.LogDate to the existing ON a.queryid=b.queryid to get a fast join (PI + partitioning), otherwise the optimizer must do some kind of preparation or a more expensive sliding window join.

SQL Server query runs slower when nothing is returned

My query runs slowly when the result set is empty. When there is something to return, it is lightning fast.
;with tree(NodeId,CategoryId,ParentId) as (
select ct.NodeId, ct.CategoryId, ct.ParentId
from dbo.CategoryTree as ct
where ct.ParentId = 6
union all
select t.NodeId, t.CategoryId, t.ParentId from dbo.CategoryTree as t
inner join tree as t2 on t.ParentId = t2.NodeId
), branch(NodeId,CategoryId,ParentId) as
(
select NodeId, CategoryId, ParentId from dbo.CategoryTree as t
where t.NodeId = 6
union all
select NodeId, CategoryId, ParentId
from tree as t
),facil(FacilityId) as(
select distinct fct.FacilityId
from dbo.FacilitiesCategoryTree as fct
inner join branch b on b.NodeId = fct.CategoryNodeId
)
select top 51 f.Id, f.CityId, f.NameGEO,
f.NameENG, f.NameRUS, f.DescrGEO, f.DescrENG,
f.DescrRUS, f.MoneyMin, f.MoneyAvg, f.Lat, f.Lng, f.SortIndex,
f.FrontImgUrl from dbo.Facilities f
inner join facil t2 on t2.FacilityId = f.Id
and f.EnabledUntil > 'Jan 14 2015 10:23PM'
order by f.SortIndex
Principal tables are:
Facilities table holds facilities, 256k records.
CategoryTree is used to group categories in a hierarchy.
NodeId int,
CategoryId int,
ParentId int
FacilitiesCategoryTree is used to link CategoryTree to Facilities.
Given NodeId, the second CTE returns all the nodes that are descendant of the given node including itself. Then there is a third CTE that returns facility ids that belong to these nodes.
Finally, the last CTE is joined to actual facilities table. The result is ordered by SortIndex which is used to manually indicate the order of facilities.
This query runs very fast when there is something to return even if I include many more predicates including full-text search and others, but when the given branch does not have any facilities, this query takes approx. 2 seconds to run.
If I exclude the order by clause, the query runs very fast again. All these tables are indexed and the query optimizer does not suggest any improvements.
What do you think is the problem and what can be done to improve the performance of queries with empty results?
Thank you.
Update1:
I am adding execution plans.
http://www.filedropper.com/withorderby
http://www.filedropper.com/withoutorderby
Update2:
I went through the recommendations of oryol and tried to save facility IDs from tree to the table variable and join it with facilities table and order by SortIndex. It eliminated the problem with empty results, but increased the execution time of queries with a result set from 250ms to 950ms.
I also changed the query to select from facil and join to the Facilities and added option (force order). The result was the same as above.
Finally, I denormalized facility/category mapping table to include SortIndex in this table. It increased the execution time of normal queries slightly from 250ms to 300ms, but it resolved the empty result set problem. I guess, I’ll stick to this method.
The first thing - you can slightly simplify the first two CTEs to just one:
with tree(NodeId,CategoryId,ParentId) as (
select ct.NodeId, ct.CategoryId, ct.ParentId
from dbo.CategoryTree as ct
where ct.NodeId = 6
union all
select t.NodeId, t.CategoryId, t.ParentId from dbo.CategoryTree as t
inner join tree as t2 on t.ParentId = t2.NodeId
)
The main problem that optimizer don't know or incorrectly estimate number of facilities which will be returned for your categories. And because you need facilities ordered by SortIndex optimizer decides to:
Go through all facilities ordered by SortIndex (using the appropriate index)
Skip rows which are not covered by other filters (EnabledUntil)
Using given Facility Id find one row in facilities from categories tree. If it exists returns result row. If not - skip this facility.
Repeat these iteration until 51 rows will be returned
So, in the worst case (if there are no 51 such facilities or they have very big SortIndex) it will require scan of all idx_Facilities_SortIndex and it requires a lot of time.
There are several ways to resolve this issue (including hints to optimizer to tell about row count or join order) to find the best way it's better to work with real database. First option which can be tried is to change query to:
Save facility IDs from tree to the table variable
Join it with facilities table and order by SortIndex
Another option (can be also comined with the first one) is to try to use FORCE ORDER query hint. In such case you will need to modify your select statement to select from facil and join it to the Facilities and add option (force order) query hint to the end of statement.
Query without order by select all facilities from tree. And after that extract other facility fields from facilities table.
Also, it's important to know about actual size of facilities in the tree (according to the estimates in the execution plan without order by it's really big - 395982). Does this estimate (more or less) correct?
If you really have a big amount of facilities returned after joining with category tree and facility/categories mapping table then the best solution will be to denormalize facility/category mapping table to include SortIndex in this table and add index to this table by NodeId and SortIndex.
So actually, we need to test queries / indexes with real data. Or to know different statistics of data:
Categories amount
Number of facilities per category and total number of rows in facilities / categories mapping table
SortIndex distribution (is it unique?)
etc.

Reusing results from a SQL query in a following query in Sqlite

I am using a recursive with statement to select all child from a given parent in a table representing tree structured entries. This is in Sqlite (which now supports recursive with).
This allows me to select very quickly thousands of record in this tree whithout suffering the huge performance loss due to preparing thousands of select statements from the calling application.
WITH RECURSIVE q(Id) AS
(
SELECT Id FROM Entity
WHERE Parent=(?)
UNION ALL
SELECT m.Id FROM Entity AS m
JOIN Entity ON m.Id=q.Parent
)
SELECT Id FROM q;
Now, suppose I have related data to these entities in an arbitrary number of other tables, that I want to subsequently load. Due to the arbitrary number of them (in a modular fashion) it is not possible to include the data fetching directly in this one. They must follow it.
But, if for each related tables I then do a SELECT statement, all the performance gain from selecting all the data from the tree directly inside Sqlite is almost useless because I will still stall on thousands of subsequent requests which will each prepare and issue a select statement.
So two questions :
The better solution is to formulate a similar recursive statement for each of the related tables, that will recursively gather the entities from this tree again, and this time select their related data by joining it.
This sounds really more efficient, but it's really tricky to formulate such a statement and I'm a bit lost here.
Now the real mystery is, would there be an even more efficient solution, which would be to somehow keep these results from the last query cached somewhere (the rows with the ids from the entity tree) and join them to the related tables in the following statement without having to recursively iterate over it again ?
Here is a try at the first option, supposing I want to select a field Data from related table Component : is the second UNION ALL legal ?
WITH RECURSIVE q(Data) AS
(
SELECT Id FROM Entity
WHERE Parent=(?)
UNION ALL
SELECT m.Id FROM Entity AS m
JOIN Entity ON m.Id=q.Parent
UNION ALL
SELECT Data FROM Component AS c
JOIN Component ON c.Id=q.Id
)
SELECT Data FROM q;
The documentation says:
 2. The table named on the left-hand side of the AS keyword must appear exactly once in the FROM clause of the right-most SELECT statement of the compound select, and nowhere else.
So your second query is not legal.
However, the CTE behaves like a normal table/view, so you can just join it to the related table:
WITH RECURSIVE q(Id) AS
( ... )
SELECT q.Id, c.Data
FROM q JOIN Component AS c ON q.Id = c.Id
If you want to reuse the computed values in q for multiple queries, there's nothing you can do with CTEs, but you can store them in a temporary table:
CREATE TEMPORARY TABLE q_123 AS
WITH RECURSIVE q(Id) AS
( ... )
SELECT Id FROM q;
SELECT * FROM q_123 JOIN Component ...;
SELECT * FROM q_123 JOIN Whatever ...;
DROP TABLE q_123;

OR query performance and strategies with Postgresql

In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?
What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.
Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.
To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.