Joining Onto CTE Performance - sql

I have a stored procedure where I use a Common Table Expression to build a hierarchical path up a menu (so it can display something like Parent Menu -> Sub Menu -> Sub Sub Menu -> ...)
It works great for what I want to use it for, the issue comes when putting the information I get from the recursive CTE into the information I really want. I do an Inner Join from my Data to the CTE and get out the Hierarchical Path. For something that returns ~300 rows, the stored procedure takes on average 15-20 seconds.
When I insert the results from the CTE into a Temp Table and do the join based on that, the procedure takes less than a second.
I was just wondering why it takes so long to join using only the CTE, or if I am misusing CTE's in some way.
**Edit this is the stored procedure essentially
With Hierarchical_Path (Menu_ID, Parent_ID, Path)
As
(
Select
EM.Menu_Id, Parent_ID,
Convert(varchar(max),
EM.Description) as Path
From
Menu EM
Where
--EM.Topic_No is null
EM.Parent_ID = 0 and EM.Disabled = 0
Union All
Select
EM.Menu_ID,
EM.Parent_ID,
Convert(Varchar(max),E.Path + ' -> ' + EM.Description) as Path
From
Menu EM
Inner Join
Hierarchical_Path E
On
EM.Parent_ID = E.Menu_ID
)
SELECT distinct
EM.Description
,EMS.Path
FROM
dbo.Menu em
INNER JOIN
Hierarchical_Path EMS
ON
EMS.Menu_ID = em.Menu_Id
2 more INNER JOINs
2 Left Joins
WHERE Clause
When I run the query like this (joining onto the CTE) the performance is around 20 seconds.
When I insert the CTE results into a temp table, and join onto that, the performance is instantaneous.
Taking apart my query a bit more, it seems like it gets hung up on the where clause. I guess my question is more to the point of when exactly does a CTE run and does it get stored in memory? I was running under the assumption that it gets called once and then sticks around in memory, but under some circumstances could it be called mulitple times?

The difference is a CTE is not persisted and a temporary table is (at least for the session). Joining on a non-persisted column means SQL has no stats on the data at all compared to the same column in a temporary table which is already pre-evaluated. Basically, the temp table caches what you would use and SQL Server can better optimize for it. The same issues are run into when joining on the result of a function or a table variable.
My guess is that your CTE execution plan is doing the execution with a single thread while your temp table can use multiple threads. You can check this by including actual execution plan when you run the queries and looking for two horizontal arrows pointing in opposite directions on each operator. That indicates parallelism.
P.S. - Try setting "set statistics io on" and "set statistics time on" to see if the actual cost of running the queries are the same regardless of run duration.

Related

Why changing where statement to a variable cause query to be 4 times slower

I am inserting data from one table "Tags" from "Recovery" database into another table "Tags" in "R3" database
they all live in my laptop similar SQL Server instance
I have built the insert query and because Recovery..Tags table is around 180M records I decided to break it into smaller sebsets. ( 1 million recs at the time)
Here is my query (Let's call Query A)
insert into R3..Tags (iID,DT,RepID,Tag,xmiID,iBegin,iEnd,Confidence,Polarity,Uncertainty,Conditional,Generic,HistoryOf,CodingScheme,Code,CUI,TUI,PreferredText,ValueBegin,ValueEnd,Value,Deleted,sKey,RepType)
SELECT T.iID,T.DT,T.RepID,T.Tag,T.xmiID,T.iBegin,T.iEnd,T.Confidence,T.Polarity,T.Uncertainty,T.Conditional,T.Generic,T.HistoryOf,T.CodingScheme,T.Code,T.CUI,T.TUI,T.PreferredText,T.ValueBegin,T.ValueEnd,T.Value,T.Deleted,T.sKey,R.RepType
FROM Recovery..tags T inner join Recovery..Reps R on T.RepID = R.RepID
where T.iID between 13000001 and 14000000
it takes around 2 minutes.
That is ok
To make things a bit easier for me
I put the iiD in the were statement in a variable
so my query looks like this (Let's call Query B)
declare #i int = 12
insert into R3..Tags (iID,DT,RepID,Tag,xmiID,iBegin,iEnd,Confidence,Polarity,Uncertainty,Conditional,Generic,HistoryOf,CodingScheme,Code,CUI,TUI,PreferredText,ValueBegin,ValueEnd,Value,Deleted,sKey,RepType)
SELECT T.iID,T.DT,T.RepID,T.Tag,T.xmiID,T.iBegin,T.iEnd,T.Confidence,T.Polarity,T.Uncertainty,T.Conditional,T.Generic,T.HistoryOf,T.CodingScheme,T.Code,T.CUI,T.TUI,T.PreferredText,T.ValueBegin,T.ValueEnd,T.Value,T.Deleted,T.sKey,R.RepType
FROM Recovery..tags T inner join Recovery..Reps R on T.RepID = R.RepID
where T.iID between (1000000 * #i) + 1 and (#i+1)*1000000
but that cause the insert to become so slow (around 10 min)
So what I tried query A again and gave me around 2 min
I tried query B again and gave around 8 min!!
I am attaching exec plan for each one (at a site that shows an analysis of the query plan) - Query A Plan and Query B Plan
Any idea why this is happening?
and how to fix it?
The big difference in time is due to the very different plans that are being created to join Tags and Reps.
Fundamentally, in version A, it knows how much data is being extracted (a million rows) and it can design an efficient query for that. However, because you are using variables in B to define how much data is being imported, it has to define a more generic query - one that would work for 10 rows, a million rows, or a hundred million rows.
In the plans, here are the relevant sections of the query joining Tags and Reps...
... in A
... and B
Note that in A it takes just over a minute to do the join; in B it takes 6 and a half minutes.
The key thing that appears to take the time is that it does a table scan of the Tags table which takes 5:44 to complete. The plan has this as a table scan, as the next time you run the query you may want many more than 1 million rows.
A secondary issue is that the amount of data it reads (or expects to read) from Reps is also way out of whack. In A it expected to read 2 million rows and read 1421; in B it basically read them all (even though technically it probably only needed the same 1421).
I think you have two main approaches to fix
Look at indexing, to remove the table scan on Tags - ensure the indexes match what is needed and allows the query to do a scan on that index (it appears that the index at the top of #MikePetri's answer is what you need, or similar). This way instead of doing a table scan, it can do an index scan which can start 'in the middle' of the data set (a table scan must start at either the start or end of the data set).
Separate this into two processes. The first process gets the relevant million rows from Tags, and saves it in a temporary table. The second process uses the data in the temporary table to join to Reps (also try using option (recompile) in the second query, so that it checks the temporary table's size before creating the plan).
You can even put an index or two (and/or Primary Key) on that temporary table to make it better for the next step.
The reason the first query is so much faster is it went parallel. This means the cardinality estimator knew enough about the data it had to handle, and the query was large enough to tip the threshold for parallel execution. Then, the engine passed chunks of data for different processors to handle individually, then report back and repartition the streams.
With the value as a variable, it effectively becomes a scalar function evaluation, and a query cannot go parallel with a scalar function, because the value has to determined before the cardinality estimator can figure out what to do with it. Therefore, it runs in a single thread, and is slower.
Some sort of looping mechanism might help. Create the included indexes to assist the engine in handling this request. You can probably find a better looping mechanism, since you are familiar with the identity ranges you care about, but this should get you in the right direction. Adjust for your needs.
With a loop like this, it commits the changes with each loop, so you aren't locking the table indefinitely.
USE Recovery;
GO
CREATE INDEX NCI_iID
ON Tags (iID)
INCLUDE (
DT
,RepID
,tag
,xmiID
,iBegin
,iEnd
,Confidence
,Polarity
,Uncertainty
,Conditional
,Generic
,HistoryOf
,CodingScheme
,Code
,CUI
,TUI
,PreferredText
,ValueBegin
,ValueEnd
,value
,Deleted
,sKey
);
GO
CREATE INDEX NCI_RepID ON Reps (RepID) INCLUDE (RepType);
USE R3;
GO
CREATE INDEX NCI_iID ON Tags (iID);
GO
DECLARE #RowsToProcess BIGINT
,#StepIncrement INT = 1000000;
SELECT #RowsToProcess = (
SELECT COUNT(1)
FROM Recovery..tags AS T
WHERE NOT EXISTS (
SELECT 1
FROM R3..Tags AS rt
WHERE T.iID = rt.iID
)
);
WHILE #RowsToProcess > 0
BEGIN
INSERT INTO R3..Tags
(
iID
,DT
,RepID
,Tag
,xmiID
,iBegin
,iEnd
,Confidence
,Polarity
,Uncertainty
,Conditional
,Generic
,HistoryOf
,CodingScheme
,Code
,CUI
,TUI
,PreferredText
,ValueBegin
,ValueEnd
,Value
,Deleted
,sKey
,RepType
)
SELECT TOP (#StepIncrement)
T.iID
,T.DT
,T.RepID
,T.Tag
,T.xmiID
,T.iBegin
,T.iEnd
,T.Confidence
,T.Polarity
,T.Uncertainty
,T.Conditional
,T.Generic
,T.HistoryOf
,T.CodingScheme
,T.Code
,T.CUI
,T.TUI
,T.PreferredText
,T.ValueBegin
,T.ValueEnd
,T.Value
,T.Deleted
,T.sKey
,R.RepType
FROM Recovery..tags AS T
INNER JOIN Recovery..Reps AS R ON T.RepID = R.RepID
WHERE NOT EXISTS (
SELECT 1
FROM R3..Tags AS rt
WHERE T.iID = rt.iID
)
ORDER BY
T.iID;
SET #RowsToProcess = #RowsToProcess - #StepIncrement;
END;

Why do nested select statements take longer to process than temporary tables?

Forgive me if this is a repeat and/or obvious question, but I can't find a satisfactory answer either on stackoverflow or elsewhere online.
Using Microsoft SQL Server, I have a nested select query that looks like this:
select *
into FinalTable
from
(select * from RawTable1 join RawTable2)
join
(select * from RawTable3 join RawTable4)
Instead of using nested selects, the query can be written using temporary tables, like this:
select *
into Temp1
from RawTable1 join RawTable2
select *
into Temp2
from RawTable3 join RawTable4
select *
into FinalTable
from Temp1 join Temp2
Although equivalent, the second (non-nested) query runs several order of magnitude faster than the first (nested) query. This is true both on my development server and a client's server. Why?
The database engine is holds subqueries in requisite memory at execution time, since they are virtual and not physical, the optimiser can't select the best route, or at least not until a sort in the plan. Also this means the optimiser will be doing multiple full table scans on each operation rather than a possible index seek on a temporary table.
Consider each subquery to be a juggling ball. The more subqueries you give the db engine, the more things it's juggling at one time. If you simplify this in batches of code with a temp table, the optimiser finds a clear route, in most cases regardless of indexes too, at least for more recent versions of SQL Server.

Subquery v/s inner join in sql server

I have following queries
First one using inner join
SELECT item_ID,item_Code,item_Name
FROM [Pharmacy].[tblitemHdr] I
INNER JOIN EMR.tblFavourites F ON I.item_ID=F.itemID
WHERE F.doctorID = #doctorId AND F.favType = 'I'
second one using sub query like
SELECT item_ID,item_Code,item_Name from [Pharmacy].[tblitemHdr]
WHERE item_ID IN
(SELECT itemID FROM EMR.tblFavourites
WHERE doctorID = #doctorId AND favType = 'I'
)
In this item table [Pharmacy].[tblitemHdr] Contains 15 columns and 2000 records. And [Pharmacy].[tblitemHdr] contains 5 columns and around 100 records. in this scenario which query gives me better performance?
Usually joins will work faster than inner queries, but in reality it will depend on the execution plan generated by SQL Server. No matter how you write your query, SQL Server will always transform it on an execution plan. If it is "smart" enough to generate the same plan from both queries, you will get the same result.
Here and here some links to help.
In Sql Server Management Studio you can enable "Client Statistics" and also Include Actual Execution Plan. This will give you the ability to know precisely the execution time and load of each request.
Also between each request clean the cache to avoid cache side effect on performance
USE <YOURDATABASENAME>;
GO
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
I think it's always best to see with our own eyes than relying on theory !
Sub-query Vs Join
Table one 20 rows,2 cols
Table two 20 rows,2 cols
sub-query 20*20
join 20*2
logical, rectify
Detailed
The scan count indicates multiplication effect as the system will have to go through again and again to fetch data, for your performance measure, just look at the time
join is faster than subquery.
subquery makes for busy disk access, think of hard disk's read-write needle(head?) that goes back and forth when it access: User, SearchExpression, PageSize, DrilldownPageSize, User, SearchExpression, PageSize, DrilldownPageSize, User... and so on.
join works by concentrating the operation on the result of the first two tables, any subsequent joins would concentrate joining on the in-memory(or cached to disk) result of the first joined tables, and so on. less read-write needle movement, thus faster
Source: Here
First query is better than second query.. because first query we are joining both table.
and also check the explain plan for both queries...

SQL "WITH" Performance and Temp Table (possible "Query Hint" to simplify)

Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=#DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog

VB6 SQL 2005 Database Index Question

I have a VB app that accesses a sql database. I think it’s running slow, and I thought maybe I didn’t have the tables propery indexed. I was wondering how you would create the indexes? Here’s the situation.
My main loop is
Select * from Docrec
Order by YearFiled,DocNumb
Inside this loop I have two others databases hits.
Select * from Names
Where YearFiled = DocRec.YearFiled
and Volume = DocRec.Volume and Page = DocRec.Page
Order by SeqNumb
Select * from MapRec
Where FiledYear = DocRec.YearFiled
and Volume = DocRec.Volume and Page = DocRec.Page
Order by SeqNumb
Hopefully I made sense.
Try in one query using INNER JOIN:
SELECT * FROM Doctec d
INNER JOIN Names n ON d.YearField = n.YearField AND d.Volume = n.Volume AND d.Page = n.Page
INNER JOIN MapRec m ON m.FiledYear = n.YearFiled AND m.Volume = n.Volumen and m.Page = n.Page
ORDER BY YearFiled, DocNumb
You will have only one query to database. The problem can be that you hit database many times and get only one (or few) row(s) per time.
Off the top, one thing that would help would be determining if you really need all columns.
If you don't, instead of SELECT *, select just the columns you need - that way you're not pulling as much data.
If you do, then from SQL Server Management Studio (or whatever you use to manage the SQL Server) you'll need to look at what is indexed and what isn't. The columns you tend to search on the most would be your first candidates for an index.
Addendum
Now that I've seen your edit, it may help to look at why you're doing the queries the way you are, and see if there isn't a way to consolidate it down to one query. Without more context I'd just be guessing at more optimal queries.
In general looping through records is a poor idea. can you not do a set-based query that gives you everything you need in one pass?
As far as indexing consider any fields that you use in the ordering or where clauses and any fileds that arein joins. Primary keys are indexed as part of the setup of a primary ley but foreign keys are not. Often people forget that they need to index them as well.
Never use select * in a production environment. It is a poor practice. Do not ever return more data than you need.
I don't know if you need the loop. If all you are doing is grabbing the records in maprec that match for docrec and then the same for the second table then you can do this without a loop using inner join syntax.
select columnlist from maprec m inner join docrec d on (m.filedyear = d.yearfield and m.volume = d.volume and m.page=d.page)
and then again for the second table...
You could also trim up your queries to return only the columns needed instead of returning all if possible. This should help performance.
To create an index by yourself in SQL Server 2005, go to the design of the table and select the Manage Indexes & Keys toolbar item.
You can use the Database Engine Tuning Advisor. You can create a trace (using sql server profiler) of your queries and then the Advisor will tell you and create the indexes needed to optimize for your query executions.
UPDATE SINCE YOUR FIRST COMMENT TO ME:
You can still do this by running the first query then the second and third without a loop as I have shown above. Here's the trick. I am thinking you need to tie the first to the second and third one hence why you did a loop.
It's been a while since I have done VB6 recordsets BUT I do recall the ability to filter the recordset once returned from the DB. So, in this case, you could keep your loop but instead of calling SQL every time in the loop you would simply filter the resulting recordset data based on the first record. You would initialize / load the second & third query before this loop to load the data. Using the syntax above that I gave will load in each of those tables the matching to the parent table (docrec).
With this, you will still only hit the DB three times but still retain the loop you need to have the parent docrec table traversed so you can do work on it AND the child tables when you do have a match.
Here's a few links on ado recordset filtering....
http://www.devguru.com/technologies/ado/QuickRef/recordset_filter.html
http://msdn.microsoft.com/en-us/library/ee275540(BTS.10).aspx
http://www.w3schools.com/ado/prop_rs_filter.asp
With all this said.... I have this strange feeling that perhaps it could be solved with just a left join on your tables?
select * from docrec d
left join maprec m on (d.YearFiled= m.FiledYear and d.Volume = m.Volume and d.Page = m.Page)
left join names n on (d.YearFiled = n.YearFiled and d.Volume = n.Volume and d.Page = n.Page)
this will return all DocRec records AND add all the maprec values and name values where it matches OR NULL if not.
If this fits your need it will only hit the DB once.