force Oracle to process recursive CTE on remote db site (perhaps using DRIVING_SITE hint) - sql

I am trying to fetch data from remote table. The data is expanded from seed set of data in local table using recursive CTE. The query is very slow (300 seed rows to 800 final rows takes 7 minutes).
For other "tiny local, huge remote"-cases with no recursive query the DRIVING_SITE hint works excellently. I also tried to export seed set from local table into auxiliary table on remotedb with same structure and - being logged in remotedb - ran query as pure local query (my_table as p, my_table_seed_copy as i). It took 4s, which encouraged me to believe forcing query to remote site would make query fast.
What's the correct way to force Oracle to execute recursive query on the remote site?
with s (id, data) as (
select p.id, p.data
from my_table#remotedb p
where p.id in (select i.id from my_table i)
union all
select p.id, p.data
from s
join my_table#remotedb p on ...
)
select /*+DRIVING_SITE(p)*/ s.*
from s;
In the query above, I tried
select /*+DRIVING_SITE(p)*/ s.* in main select
select /*+DRIVING_SITE(s)*/ s.* in main select
omitting DRIVING_SITE in whole query
select /*+DRIVING_SITE(x)*/ s.* from s, dual#remotedb x as main select
select /*+DRIVING_SITE(p)*/ p.id, p.data in first inner select
select /*+DRIVING_SITE(p)*/ p.id, p.data in both inner selects
select /*+DRIVING_SITE(p) MATERIALIZE*/ p.id, p.data in both inner selects
(just for completeness - rewriting to connect by is not applicable for this case - actually the query is more complex and uses constructs which cannot be expressed by connect by)
All without success (i.e. data returned after 7 minutes).

Recursive query actually performs breadth-first search - seed rows represent 0-th level and recursive part finds element on n-th level from elements on (n-1)-th level. Original query was intended to be part of merge ... using ... clause.
Hence I rewrote query to PLSQL loop. Every cycle generates one level. Merge prevents insertion of duplicates so finally no new row is added and loop exits (transitive closure is constructed). Pseudocode:
loop
merge into my_table using (
select /*+DRIVING_SITE(r)*/ distinct r.* /*###BULKCOLLECT###*/
from my_table l
join my_table#remotedb r on ... -- same condition as s and p in original question are joined on
) ...
exit when rows_inserted = 0;
end loop;
Actual code is not so simple since DRIVING_SITE actually does not directly work with merge so we have to transfer data via work collection but that's different story. Also the count of inserted rows cannot be easily determined, it must be computed as difference between row count after and before merge.
The solution is not ideal. Anyway it's much faster than recursive CTE (30s, 13 cycles) because queries are provably utilizing the DRIVING_SITE hint.
I will leave question open for some time to wait if somebody finds answer how to make recursive query working or proving it is not possible.

Related

SQL Server Execute Order

As I know the order of execute in SQL is
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY
So I am confused with the correlated query like the below code.
Is FROM WHERE clause in outer query executed first or SELECT in inner query executed first? Can anyone give me idea and explanation? Thanks
SELECT
*, COUNT(1) OVER(PARTITION BY A) pt
FROM
(SELECT
tt.*,
(SELECT COUNT(id) FROM t WHERE data <= 10 AND ID < tt.ID) AS A
FROM
t tt
WHERE
data > 10) t1
As I know the order of execute in SQL is FROM-> WHERE-> GROUP BY-> HAVING -> SELECT ->ORDER BY
False. False. False. Presumably what you are referring to is this part of the documentation:
The following steps show the logical processing order, or binding
order, for a SELECT statement. This order determines when the objects
defined in one step are made available to the clauses in subsequent
steps.
As the documentation explains, this refers to the scoping rules when a query is parsed. It has nothing to do with the execution order. SQL Server -- as with almost any database -- reserves the ability to rearrange the query however it likes for processing.
In fact, the execution plan is really a directed acyclic graph (DAG), whose components generally do not have a 1-1 relationship with the clauses in a query. SQL Server is free to execute your query in whatever way it decides is best, so long as it produces the result set that you have described.

Reusing results from a SQL query in a following query in Sqlite

I am using a recursive with statement to select all child from a given parent in a table representing tree structured entries. This is in Sqlite (which now supports recursive with).
This allows me to select very quickly thousands of record in this tree whithout suffering the huge performance loss due to preparing thousands of select statements from the calling application.
WITH RECURSIVE q(Id) AS
(
SELECT Id FROM Entity
WHERE Parent=(?)
UNION ALL
SELECT m.Id FROM Entity AS m
JOIN Entity ON m.Id=q.Parent
)
SELECT Id FROM q;
Now, suppose I have related data to these entities in an arbitrary number of other tables, that I want to subsequently load. Due to the arbitrary number of them (in a modular fashion) it is not possible to include the data fetching directly in this one. They must follow it.
But, if for each related tables I then do a SELECT statement, all the performance gain from selecting all the data from the tree directly inside Sqlite is almost useless because I will still stall on thousands of subsequent requests which will each prepare and issue a select statement.
So two questions :
The better solution is to formulate a similar recursive statement for each of the related tables, that will recursively gather the entities from this tree again, and this time select their related data by joining it.
This sounds really more efficient, but it's really tricky to formulate such a statement and I'm a bit lost here.
Now the real mystery is, would there be an even more efficient solution, which would be to somehow keep these results from the last query cached somewhere (the rows with the ids from the entity tree) and join them to the related tables in the following statement without having to recursively iterate over it again ?
Here is a try at the first option, supposing I want to select a field Data from related table Component : is the second UNION ALL legal ?
WITH RECURSIVE q(Data) AS
(
SELECT Id FROM Entity
WHERE Parent=(?)
UNION ALL
SELECT m.Id FROM Entity AS m
JOIN Entity ON m.Id=q.Parent
UNION ALL
SELECT Data FROM Component AS c
JOIN Component ON c.Id=q.Id
)
SELECT Data FROM q;
The documentation says:
 2. The table named on the left-hand side of the AS keyword must appear exactly once in the FROM clause of the right-most SELECT statement of the compound select, and nowhere else.
So your second query is not legal.
However, the CTE behaves like a normal table/view, so you can just join it to the related table:
WITH RECURSIVE q(Id) AS
( ... )
SELECT q.Id, c.Data
FROM q JOIN Component AS c ON q.Id = c.Id
If you want to reuse the computed values in q for multiple queries, there's nothing you can do with CTEs, but you can store them in a temporary table:
CREATE TEMPORARY TABLE q_123 AS
WITH RECURSIVE q(Id) AS
( ... )
SELECT Id FROM q;
SELECT * FROM q_123 JOIN Component ...;
SELECT * FROM q_123 JOIN Whatever ...;
DROP TABLE q_123;

Cycle detection for recursive SQL WITH clause

Let us assume that we have normal hierarchical table with parent column pointing to its parent. I wanted to build query that will enumerate all ancestors with SQL WITH clause.
with data_ancestors (par, chi) AS (
SELECT d.parent, d.dat_id
FROM data d
WHERE d.parent IS NOT NULL
UNION ALL
SELECT p.parent, a.chi
FROM dta_ancestors a
JOIN data p
ON p.dat_id = a.par
WHERE p.parent IS NOT NULL
)
select * from dta_ancestors where par = 1 order by chi;
The problem here is that although the data should not contains cycles, it is not guaranteed so. In such wrong case I want to gradually degrade the functionality (loops should be arbitrary broken). However Oracle ends with error during execution on "wrong" data.
I know I can use different more Oracle specific approach like:
select p.dat_id, a.dat_id from data p, data a where a.dat_id in (
select d.dat_id from data d start with d.dat_id = p.dat_id connect by nocycle prior d.dat_id = d.parent
);
or as suggested in this question to make cycle detection by myself.
However are there any other nice solutions (mainly for Oracle but also other DBs) that solves the recursion problems with WITH clause?

How to avoid nested SQL query in this case?

I have an SQL question, related to this and this question (but different). Basically I want to know how I can avoid a nested query.
Let's say I have a huge table of jobs (jobs) executed by a company in their history. These jobs are characterized by year, month, location and the code belonging to the tool used for the job. Additionally I have a table of tools (tools), translating tool codes to tool descriptions and further data about the tool. Now they want a website where they can select year, month, location and tool using a dropdown box, after which the matching jobs will be displayed. I want to fill the last dropdown with only the relevant tools matching the before selection of year, month and location, so I write the following nested query:
SELECT c.tool_code, t.tool_description
FROM (
SELECT DISTINCT j.tool_code
FROM jobs AS j
WHERE j.year = ....
AND j.month = ....
AND j.location = ....
) AS c
LEFT JOIN tools as t
ON c.tool_code = t.tool_code
ORDER BY c.tool_code ASC
I resorted to this nested query because it was much faster than performing a JOIN on the complete database and selecting from that. It got my query time down a lot. But as I have recently read that MySQL nested queries should be avoided at all cost, I am wondering whether I am wrong in this approach. Should I rewrite my query differently? And how?
No, you shouldn't, your query is fine.
Just create an index on jobs (year, month, location, tool_code) and tools (tool_code) so that the INDEX FOR GROUP-BY can be used.
The article your provided describes the subquery predicates (IN (SELECT ...)), not the nested queries (SELECT FROM (SELECT ...)).
Even with the subqueries, the article is wrong: while MySQL is not able to optimize all subqueries, it deals with IN (SELECT …) predicates just fine.
I don't know why the author chose to put DISTINCT here:
SELECT id, name, price
FROM widgets
WHERE id IN
(
SELECT DISTINCT widgetId
FROM widgetOrders
)
and why do they think this will help to improve performance, but given that widgetID is indexed, MySQL will just transform this query:
SELECT id, name, price
FROM widgets
WHERE id IN
(
SELECT widgetId
FROM widgetOrders
)
into an index_subquery
Essentially, this is just like EXISTS clause: the inner subquery will be executed once per widgets row with the additional predicate added:
SELECT NULL
FROM widgetOrders
WHERE widgetId = widgets.id
and stop on the first match in widgetOrders.
This query:
SELECT DISTINCT w.id,w.name,w.price
FROM widgets w
INNER JOIN
widgetOrders o
ON w.id = o.widgetId
will have to use temporary to get rid of the duplicates and will be much slower.
You could avoid the subquery by using GROUP BY, but if the subquery performs better, keep it.
Why do you use a LEFT JOIN instead of a JOIN to join tools?

Why is inserting into and joining #temp tables faster?

I have a query that looks like
SELECT
P.Column1,
P.Column2,
P.Column3,
...
(
SELECT
A.ColumnX,
A.ColumnY,
...
FROM
dbo.TableReturningFunc1(#StaticParam1, #StaticParam2) AS A
WHERE
A.Key = P.Key
FOR XML AUTO, TYPE
),
(
SELECT
B.ColumnX,
B.ColumnY,
...
FROM
dbo.TableReturningFunc2(#StaticParam1, #StaticParam2) AS B
WHERE
B.Key = P.Key
FOR XML AUTO, TYPE
)
FROM
(
<joined tables here>
) AS P
FOR XML AUTO,ROOT('ROOT')
P has ~ 5000 rows
A and B ~ 4000 rows each
This query has a runtime performance of ~10+ minutes.
Changing it to this however:
SELECT
P.Column1,
P.Column2,
P.Column3,
...
INTO #P
SELECT
A.ColumnX,
A.ColumnY,
...
INTO #A
FROM
dbo.TableReturningFunc1(#StaticParam1, #StaticParam2) AS A
SELECT
B.ColumnX,
B.ColumnY,
...
INTO #B
FROM
dbo.TableReturningFunc2(#StaticParam1, #StaticParam2) AS B
SELECT
P.Column1,
P.Column2,
P.Column3,
...
(
SELECT
A.ColumnX,
A.ColumnY,
...
FROM
#A AS A
WHERE
A.Key = P.Key
FOR XML AUTO, TYPE
),
(
SELECT
B.ColumnX,
B.ColumnY,
...
FROM
#B AS B
WHERE
B.Key = P.Key
FOR XML AUTO, TYPE
)
FROM #P AS P
FOR XML AUTO,ROOT('ROOT')
Has a performance of ~4 seconds.
This makes not a lot of sense, as it would seem the cost to insert into a temp table and then do the join should be higher by default. My inclination is that SQL is doing the wrong type of "join" with the subquery, but maybe I've missed it, there's no way to specify the join type to use with correlated subqueries.
Is there a way to achieve this without using #temp tables/#table variables via indexes and/or hints?
EDIT: Note that dbo.TableReturningFunc1 and dbo.TableReturningFunc2 are inline TVF's, not multi-statement, or they are "parameterized" view statements.
Your procedures are being reevaluated for each row in P.
What you do with the temp tables is in fact caching the resultset generated by the stored procedures, thus removing the need to reevaluate.
Inserting into a temp table is fast because it does not generate redo / rollback.
Joins are also fast, since having a stable resultset allows possibility to create a temporary index with an Eager Spool or a Worktable
You can reuse the procedures without temp tables, using CTE's, but for this to be efficient, SQL Server needs to materialize the results of CTE.
You may try to force it do this with using an ORDER BY inside a subquery:
WITH f1 AS
(
SELECT TOP 1000000000
A.ColumnX,
A.ColumnY
FROM dbo.TableReturningFunc1(#StaticParam1, #StaticParam2) AS A
ORDER BY
A.key
),
f2 AS
(
SELECT TOP 1000000000
B.ColumnX,
B.ColumnY,
FROM dbo.TableReturningFunc2(#StaticParam1, #StaticParam2) AS B
ORDER BY
B.Key
)
SELECT …
, which may result in Eager Spool generated by the optimizer.
However, this is far from being guaranteed.
The guaranteed way is to add an OPTION (USE PLAN) to your query and wrap the correspondind CTE into the Spool clause.
See this entry in my blog on how to do that:
Generating XML in subqueries
This is hard to maintain, since you will need to rewrite your plan each time you rewrite the query, but this works well and is quite efficient.
Using the temp tables will be much easier, though.
This answer needs to be read together with Quassnoi's article
http://explainextended.com/2009/05/28/generating-xml-in-subqueries/
With judicious application of CROSS APPLY, you can force the caching or shortcut evaluation of inline TVFs. This query returns instantaneously.
SELECT *
FROM (
SELECT (
SELECT f.num
FOR XML PATH('fo'), ELEMENTS ABSENT
) AS x
FROM [20090528_tvf].t_integer i
cross apply (
select num
from [20090528_tvf].fn_num(9990) f
where f.num = i.num
) f
) q
--WHERE x IS NOT NULL -- covered by using CROSS apply
FOR XML AUTO
You haven't provided real structures so it's hard to construct something meaningful, but the technique should apply as well.
If you change the multi-statement TVF in Quassnoi's article to an inline TVF, the plan becomes even faster (at least one order of magnitude) and the plan magically reduces to something I cannot understand (it's too simple!).
CREATE FUNCTION [20090528_tvf].fn_num(#maxval INT)
RETURNS TABLE
AS RETURN
SELECT num + #maxval num
FROM t_integer
Statistics
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
(10 row(s) affected)
Table 't_integer'. Scan count 2, logical reads 22, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 2 ms.
It is a problem with your sub-query referencing your outer query, meaning the sub query has to be compiled and executed for each row in the outer query.
Rather than using explicit temp tables, you can use a derived table.
To simplify your example:
SELECT P.Column1,
(SELECT [your XML transformation etc] FROM A where A.ID = P.ID) AS A
If P contains 10,000 records then SELECT A.ColumnX FROM A where A.ID = P.ID will be executed 10,000 times.
You can instead use a derived table as thus:
SELECT P.Column1, A2.Column FROM
P LEFT JOIN
(SELECT A.ID, [your XML transformation etc] FROM A) AS A2
ON P.ID = A2.ID
Okay, not that illustrative pseudo-code, but the basic idea is the same as the temp table, except that SQL Server does the whole thing in memory: It first selects all the data in "A2" and constructs a temp table in memory, then joins on it. This saves you having to select it to TEMP yourself.
Just to give you an example of the principle in another context where it may make more immediate sense. Consider employee and absence information where you want to show the number of days absence recorded for each employee.
Bad: (runs as many queryes as there are employees in the DB)
SELECT EmpName,
(SELECT SUM(absdays) FROM Absence where Absence.PerID = Employee.PerID) AS Abstotal
FROM Employee
Good: (Runs only two queries)
SELECT EmpName, AbsSummary.Abstotal
FROM Employee LEFT JOIN
(SELECT PerID, SUM(absdays) As Abstotal
FROM Absence GROUP BY PerID) AS AbsSummary
ON AbsSummary.PerID = Employee.PerID
There are several possible reasons why using intermediate Temp tables might speed up a query, but the most likely in your case is that the functions which are being called (but are not listed), are probably Multi-statement TVF's and not in-line TVF's. Multi-statement TVF's are opaque to the optimization of their calling queries and thus the optimizer cannot tell if there are any oppurtunities for re-use of data, or other logical/physical operator re-ordering optimizations. Thus, all it can do is to re-execute the TVFs every time that the containing query is supposed to produce another row with the XML columns.
In short, multi-statement TVF's frustrate the optimizer.
The usual solutions, in order of (typical) preference are:
Re-write the offending multi-statement TVF to be an in-line TVF
In-line the function code into the calling query, or
Dump the offending TVF's data into a temp table. which is what you've done...
Consider using the WITH common_table_expression construct for what you now have as sub-selects or temporary tables, see http://msdn.microsoft.com/en-us/library/ms175972(SQL.90).aspx .
This makes not a lot of sense, as it
would seem the cost to insert into a
temp table and then do the join should
be higher by de> This makes not a lot of sense, as it
would seem the cost to insert into a
temp table and then do the join should
be higher by default.fault.
With temporary tables, you explitly instruct Sql Server which intermediate storage to use. But if you stash everything in a big query, Sql Server will decide for itself. The difference is not really that big; at the end of the day, temporary storage is used, whether you specify it as a temp table or not.
In your case, temporary tables work faster, so why not stick to them?
I agreed, Temp table is a good concept. When the row count increases in a table an example of 40 million rows and i want to update multiple columns on a table by applying joins with other table in that case i would always prefer to use Common table expression to update the columns in select statement using case statement, now my select statement result set contains updated rows.Inserting 40 million records into a temp table with select statement using case statement took 21 minutes for me and then creating an index took 10 minutes so my insert and index creation time took 30 minutes. Then i am going to apply update by joining temp table updated result set with main table. It took 5 minutes to update 10 million records out of 40 million records so my overall update time for 10 million records took almost 35 minutes vs 5 minutes from Common table expression. My choice in that case is Common table expression.
If temp tables turn out to be faster in your particular instance, you should instead use a table variable.
There is a good article here on the differences and performance implications:
http://www.codeproject.com/KB/database/SQP_performance.aspx