Faster querying with temp table creation (SQL SERVER) [duplicate] - sql

I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.
Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.
I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.
It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.
I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...

There are a few possible explanations as to why you see this behavior. Some common ones are
The subquery or CTE may be being repeatedly re-evaluated.
Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.

I do not believe there is a query hint that instructs the engine to spool each subquery in turn.
There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.
Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.

Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:
CREATE FUNCTION SalesByStore (#storeid varchar(30))
RETURNS #t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
qty smallint NOT NULL) AS
BEGIN
INSERT #t (title, qty)
SELECT t.title, s.qty
FROM sales s
JOIN titles t ON t.title_id = s.title_id
WHERE s.stor_id = #storeid
RETURN
END
CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')

Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).
SELECT bar.*
FROM
(SELECT * FROM TableFoo WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.
SELECT bar.*
FROM
(SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.
A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.

Related

SQL adding Order By clause causes the query to run significantly faster. Explanation needed

So i have this query:
SELECT *
FROM ViewTechnicianStatus
WHERE NotificationClass = 2
AND MachineID IN (SELECT ID FROM MachinesTable WHERE DepartmentID = 1 AND IsMachineActive <> 0)
--ORDER BY ResponseDate DESC
The view is huge and complex with a lot of joins and subqueries. When i run this query it takes forever to finish, however if i add the ORDER BY it finishes instantly and returns 20 rows as intended. I don't understand how adding the ORDER BY could have such a huge positive impact on the performance. Would love if somebody could explain to me the phenomenon.
EDIT:
Here is the rundown with SET STATISTICS TIME, IO ON; flags on. Sorry for the hidden table names but i don't think i can expose those.
Without ORDER BY
With Order BY
To answer your question, The reason that your query runs faster when adding order by is due to the INDEXING. Probably in all the Clients that you tested, had Indexing for those specific fields/tables, and on using the Order by made the performance better.
Summary
OK.. I've thought about this for a while as I think it's an interesting issue. I believe it's very much an edge case - which is part of what makes it interesting.
I'm taking an educated guess based on the info provided - obviously, without being able to see it/play with it, I cannot be certain. But I think this explanation matches the evidence based on the info you provide and the statistics.
I think the main issue is a poor query plan. In the version without the sort, it uses an inappropriate nested loop; in the version with the sort, it does (say) a hash match or merge join.
I have found that SQL Server often has issues with query plans within complex views that reference other views, and especially if those sub-views have group bys/sorts/etc.
For demonstration of what difference that could occur, I'll simplify your complex view into 2 subgroups I'll call 'view groups' (which may be one or several views and tables - not a technical term, just a term to summarise them).
The first view group contains most tables,
The second view group contains the views getting data from tables 6 and 7.
For both approaches, how SQL uses the data in the view groups are probably the same (e.g., use the same indexes, etc). However, there's a difference in its approach to how it does the join between the two view groups.
Example - query planner underestimates cost of view group 2 and doesn't care which method is used
I'm guessing
The first view group, at the point of the join, is dealing with about 3000 rows (it hasn't filtered it down yet), and
The query builder thinks view group 2 is easy to run
In the version without the order by, the query plan is designed with a nested loop join. That is, it gets each value in view group 1, and then for each value it runs view group 2 to get the relevant data. This means the view group 2 is run 3000-ish times (once for each value in view group 1).
In the version with the order by, it decides to do (say) a hash match between view group 1 and view group 2. This means it has to only run view group 2 once, but spend a bit more time sorting it. However, because you asked for it to be sorted anyway, it chooses the hash match.
However, because the query designed underestimated the cost of view group 2, it turns out that the hash match is a much much better query plan for the circumstances.
Example - query planner use of cached plans
I believe (but may be wrong!) that when you reference views within views, it can often just used cached plans for the sub-views rather than trying to get the best plan possible for your current situation.
It may be that one of your views uses the "cached plan" whereas the other one tries to optimise the query plan including the sub-views.
Ironically, it may be that the query version with the order by is more complex, and in this case it uses the cached plans for view group 2. However, as it knows it hasn't optimised the plan for view group 2, it simply gets the data once for view group 2, then keeps all results in memory and uses it in a hash match.
In contrast, in the version without the order by, it takes a shot at optimising the query plan (including optimising how it uses the views), and makes a mess of it.
Possible solutions
These are all possibilities - they may make it better or may make it worse! Note that SQL is a declarative language (you tell the computer what to do/what you want, but not how to do it).
This is not a comprehensive list of possibilities, but they are things you can try
Pre-calculate all or part(s) of the views (e.g., put the pre-calculated stuff from tables 6 and 7 into a temporary table, then use the temporary tables in the views)
Simplify the SQL and/or move all the SQL into a single view that doesn't call other views
Use join hints e.g., instead of INNER JOIN, use INNER HASH JOIN in the appropriate position
Use OPTION(RECOMPILE)

Complexity comparison between temporary table + index creation vice multi-table group by without index

I have two potential roads to take on the following problem, the try it and see methodology won't pay off for this solution as the load on the server is constantly in flux. The two approaches I have are as follows:
select *
from
(
select foo.a,bar.b,baz.c
from foo,bar,baz
-- updated for clarity sake
where foo.a=b.bar
and b.bar=baz.c
)
group by a,b,c
vice
create table results as
select foo.a,bar.b,baz.c
from foo,bar,baz
where foo.a=b.bar
and b.bar=baz.c ;
create index results_spanning on results(a,b,c);
select * from results group by a,b,c;
So in case it isn't clear. The top query performs the group by outright against the multi-table select thus preventing me from using an index. The second query allows me to create a new table that stores the results of the query, proceeding to create a spanning index, then finishing the group by query to utilize the index.
What is the complexity difference of these two approaches, i.e. how do they scale and which is preferable in the case of large quantities of data. Also, the main issue is the performance of the overall select so that is what I am attempting to fix here.
Comments
Are you really doing a CROSS JOIN on three tables? Are those three
columns indexed in their own right? How often do you want to run the
query which delivers the end result?
1) No.
2) Yes, where clause omitted for the sake of discussion as this is clearly a super trivial example
3) Doesn't matter.
2nd Update
This is a temporary table as it is only valid for a brief moment in time, so yes this table will only be queried against one time.
If your query is executed frequently and unacceptably slow, you could look into creating materialized views to pre-compute the results. This gives you the benefit of an indexable "table", without the overhead of creating a table every time.
You'll need to refresh the materialized view (preferably fast if the tables are large) either on commit or on demand. There are some restrictions on how you can create on commit, fast refreshable views, and they will add to your commit time processing slightly, but they will always give the same result as running the base query. On demand MVs will become stale as the underlying data changes until these are refreshed. You'll need to determine whether this is acceptable or not.
So the question is, which is quicker?
Run a query once and sort the result set?
Run a query once to build a table, then build an index, then run the query again and sort the result set?
Hmmm. Tricky one.
The use cases for temporary tables are pretty rare in Oracle. They normally onlya apply when we need to freeze a result set which we are then going to query repeatedly. That is apparently not the case here.
So, take the first option and just tune the query if necessary.
The answer is, as is so often the case with tuning questions, it depends.
Why are you doing a GROUP BY in the first place. The query as you posted it doesn't do any aggregation so the only reason for doing GROUP BY woudl be to eliminate duplicate rows, i.e. a DISTINCT operation. If this is actually the case then you doing some form of cartesian join and one tuning the query would be to fix the WHERE clause so that it only returns discrete records.

Using temp table for sorting data in SQL Server

Recently, I came across a pattern (not sure, could be an anti-pattern) of sorting data in a SELECT query. The pattern is more of a verbose and non-declarative way for ordering data. The pattern is to dump relevant data from actual table into temporary table and then apply orderby on a field on the temporary table. I guess, the only reason why someone would do that is to improve the performance (which I doubt) and no other benefit.
For e.g. Let's say, there is a user table. The table might contain rows in millions. We want to retrieve all the users whose first name starts with 'G' and sorted by first name. The natural and more declarative way to implement a SQL query for this scenario is:
More natural and declarative way
SELECT * FROM Users
WHERE NAME LIKE 'G%'
ORDER BY Name
Verbose way
SELECT * INTO TempTable
FROM Users
WHERE NAME LIKE 'G%'
SELECT * FROM TempTable
ORDER BY Name
With that context, I have few questions:
Will there be any performance difference between two ways if there is no index on the first name field. If yes, which one would be better.
Will there be any performance difference between two ways if there is an index on the first name field. If yes, which one would be better.
Should not the SQL Server optimizer generate same execution plan for both the ways?
Is there any benefit in writing a verbose way from any other persective like locking/blocking?
Thanks in advance.
Reguzlarly: Anti pattern by people without an idea what they do.
SOMETIMES: ok, because SQL Server has a problem that is not resolvable otherwise - not seen that one in yeas, though.
It makes things slower because it forces the tmpddb table to be fully populated FIRST, while otherwise the query could POSSIBLY be resoled more efficiently.
last time I saw that was like 3 years ago. We got it 3 times as fast by not being smart and using a tempdb table ;)
Answers:
1: No, it still needs a table scan, obviously.
2: Possibly - depends on data amount, but an index seek by index would contain the data in order already (as the index is ordered by content).
3: no. Obviously. Query plan optimization is statement by statement. By cutting the execution in 2, the query optimizer CAN NOT merge the join into the first statement.
4: Only if you run into a query optimizer issue or a limitation of how many tables you can join - not in that degenerate case (degenerate in a technical meaning - i.e. very simplistic). BUt if you need to join MANY MANY tables it may be better to go with an interim step.
If the field you want to do an order by on is not indexed, you could put everything into a temp table and index it and then do the ordering and it might be faster. You would have to test to make sure.
There is never any benefit of the second approach that I can think of.
It means if the data is available pre-ordered SQL Server can't take advantage of this and adds an unnecessary blocking operator and additional sort to the plan.
In the case that the data is not available pre-ordered SQL Server will sort it in a work table either in memory or tempdb anyway and adding an explicit #temp table just adds an unnecessary additional step.
Edit
I suppose one case where the second approach could give an apparent benefit might be if the presence of the ORDER BY caused SQL Server to choose a different plan that turned out to be sub optimal. In which case I would resolve that in a different way by either improving statistics or by using hints/query rewrite to avoid the undesired plan.

Is there a performance difference between CTE , Sub-Query, Temporary Table or Table Variable?

In this excellent SO question, differences between CTE and sub-queries were discussed.
I would like to specifically ask:
In what circumstance is each of the following more efficient/faster?
CTE
Sub-Query
Temporary Table
Table Variable
Traditionally, I've used lots of temp tables in developing stored procedures - as they seem more readable than lots of intertwined sub-queries.
Non-recursive CTEs encapsulate sets of data very well, and are very readable, but are there specific circumstances where one can say they will always perform better? or is it a case of having to always fiddle around with the different options to find the most efficient solution?
EDIT
I've recently been told that in terms of efficiency, temporary tables are a good first choice as they have an associated histogram i.e. statistics.
SQL is a declarative language, not a procedural language. That is, you construct a SQL statement to describe the results that you want. You are not telling the SQL engine how to do the work.
As a general rule, it is a good idea to let the SQL engine and SQL optimizer find the best query plan. There are many person-years of effort that go into developing a SQL engine, so let the engineers do what they know how to do.
Of course, there are situations where the query plan is not optimal. Then you want to use query hints, restructure the query, update statistics, use temporary tables, add indexes, and so on to get better performance.
As for your question. The performance of CTEs and subqueries should, in theory, be the same since both provide the same information to the query optimizer. One difference is that a CTE used more than once could be easily identified and calculated once. The results could then be stored and read multiple times. Unfortunately, SQL Server does not seem to take advantage of this basic optimization method (you might call this common subquery elimination).
Temporary tables are a different matter, because you are providing more guidance on how the query should be run. One major difference is that the optimizer can use statistics from the temporary table to establish its query plan. This can result in performance gains. Also, if you have a complicated CTE (subquery) that is used more than once, then storing it in a temporary table will often give a performance boost. The query is executed only once.
The answer to your question is that you need to play around to get the performance you expect, particularly for complex queries that are run on a regular basis. In an ideal world, the query optimizer would find the perfect execution path. Although it often does, you may be able to find a way to get better performance.
There is no rule. I find CTEs more readable, and use them unless they exhibit some performance problem, in which case I investigate the actual problem rather than guess that the CTE is the problem and try to re-write it using a different approach. There is usually more to the issue than the way I chose to declaratively state my intentions with the query.
There are certainly cases when you can unravel CTEs or remove subqueries and replace them with a #temp table and reduce duration. This can be due to various things, such as stale stats, the inability to even get accurate stats (e.g. joining to a table-valued function), parallelism, or even the inability to generate an optimal plan because of the complexity of the query (in which case breaking it up may give the optimizer a fighting chance). But there are also cases where the I/O involved with creating a #temp table can outweigh the other performance aspects that may make a particular plan shape using a CTE less attractive.
Quite honestly, there are way too many variables to provide a "correct" answer to your question. There is no predictable way to know when a query may tip in favor of one approach or another - just know that, in theory, the same semantics for a CTE or a single subquery should execute the exact same. I think your question would be more valuable if you present some cases where this is not true - it may be that you have discovered a limitation in the optimizer (or discovered a known one), or it may be that your queries are not semantically equivalent or that one contains an element that thwarts optimization.
So I would suggest writing the query in a way that seems most natural to you, and only deviate when you discover an actual performance problem the optimizer is having. Personally I rank them CTE, then subquery, with #temp table being a last resort.
#temp is materalized and CTE is not.
CTE is just syntax so in theory it is just a subquery. It is executed. #temp is materialized. So an expensive CTE in a join that is execute many times may be better in a #temp. On the other side if it is an easy evaluation that is not executed but a few times then not worth the overhead of #temp.
The are some people on SO that don't like table variable but I like them as the are materialized and faster to create than #temp. There are times when the query optimizer does better with a #temp compared to a table variable.
The ability to create a PK on a #temp or table variable gives the query optimizer more information than a CTE (as you cannot declare a PK on a CTE).
Just 2 things I think make it ALWAYS preferable to use a # Temp Table rather then a CTE are:
You can not put a primary key on a CTE so the data being accessed by the CTE will have to traverse each one of the indexes in the CTE's tables rather then just accessing the PK or Index on the temp table.
Because you can not add constraints, indexes and primary keys to a CTE they are more prone to bugs creeping in and bad data.
-onedaywhen yesterday
Here is an example where #table constraints can prevent bad data which is not the case in CTE's
DECLARE #BadData TABLE (
ThisID int
, ThatID int );
INSERT INTO #BadData
( ThisID
, ThatID
)
VALUES
( 1, 1 ),
( 1, 2 ),
( 2, 2 ),
( 1, 1 );
IF OBJECT_ID('tempdb..#This') IS NOT NULL
DROP TABLE #This;
CREATE TABLE #This (
ThisID int NOT NULL
, ThatID int NOT NULL
UNIQUE(ThisID, ThatID) );
INSERT INTO #This
SELECT * FROM #BadData;
WITH This_CTE
AS (SELECT *
FROM #BadData)
SELECT *
FROM This_CTE;

What are the advantages/disadvantages of using a CTE?

I'm looking at improving the performance of some SQL, currently CTEs are being used and referenced multiple times in the script. Would I get improvements using a table variable instead? (Can't use a temporary table as the code is within functions).
You'll really have to performance test - There is no Yes/No answer. As per Andy Living's post above links to, a CTE is just shorthand for a query or subquery.
If you are calling it twice or more in the same function, you might get better performance if you fill a table variable and then join to/select from that. However, as table variables take up space somewhere, and don't have indexes/statistics (With the exception of any declared primary key on the table variable) there's no way of saying which will be faster.
They both have costs and savings, and which is the best way depends on the data they pull in and what they do with it. I've been in your situation, and after testing for speed under various conditions - Some functions used CTEs, and others used table variables.
A CTE is not much more than syntactic sugar.
It enhances the readability and allows to avoid repetition.
Just think of it as a placeholder for the actual statement specified in the WITH()-clause. The engine will replace any occurance of the CTE's name in your query with this statement (quite similar to a view). This is the meaning of inline.
Compared to a previously filled table (delared or created) You'll find advantages:
useable in ad-hoc-queries (functions, views)
no unexpected side effects (most narrow scope)
...and disadvantages:
You cannot use the CTE's result in different statements
You cannot use indexes, statistics to optimize your CTE's set (although it will implicitly use existing indexes and statistics of the targeted objects - if appropriate).
In terms of performance a persisted set (declared or created table) can be (much!) better in some cases, but it forces you into procedural code. You will have to race your horses to find out which is better...
Example: Various approaches to do the same
The following simple (rather useless) example describes a set of user tables together with their columns. I use various different approaches to tell SQL-Server what I want:
Try this with "include actual execution plan"
USE master; --in my case the master database has just 5 "user tables", you can use any other DB of course
GO
--simple join, first the small set joining to the large set
SELECT o.name AS TableName
,c.name AS ColumnName
FROM sys.objects o
INNER JOIN sys.columns c ON c.object_id=o.object_id
WHERE o.type='U';
GO
--simple join "the other way round" with the filter as part of the ON-clause
SELECT o.name AS TableName
,c.name AS ColumnName
FROM sys.columns c
INNER JOIN sys.objects o ON c.object_id=o.object_id AND o.type='U';
GO
--join from the large set with a sub-query to the small set
SELECT o.name AS TableName
,c.name AS ColumnName
FROM sys.columns c
INNER JOIN (
SELECT o.*
FROM sys.objects o
WHERE o.type='U' --user tables
) o ON c.object_id=o.object_id;
GO
--join for large to small with a row-wise APPLY
SELECT o.name AS TableName
,c.name AS ColumnName
FROM sys.columns c
CROSS APPLY (
SELECT o.*
FROM sys.objects o
WHERE o.type='U' --user tables
AND o.object_id=c.object_id
) o;
GO
--use a CTE to "pre-filter" the small set
WITH cte AS
(
SELECT o.*
FROM sys.objects o
WHERE o.type='U' --user tables
)
SELECT cte.name AS TableName
,c.name AS ColumnName
FROM sys.columns c
INNER JOIN cte ON c.object_id=cte.object_id;
GO
Now look at the result and at the execution plans:
All queries return the same result.
All queries produce the same execution plan
Important hint: This might differ on your machine!
Why is this?
T-SQL is a declarative language. Your statement is a description of WHAT you want to retrieve. It is not your job to tell the engine HOW this is done.
SQL-Server's extremely smart engine will find the best way to get the set you asked for. In the case above all result descriptions point to the same goal. The engine can deduce this from various statements and finds the same plan for all of them.
Well, is it just a matter of taste?
In a way...
There are some important things to keep in mind:
There is no reason for the engine to compute the CTE's result before the rest (although the statement might look so). Therefore it is wrong to describe a CTE as something like a temp table...
In other words: The visible order of your statement does not predict the actual order of execution!
The smart engine will reach its limits with complexity and nest level. Imagine various VIEWs, all using CTEs and calling each-other...
There are cases where the engine really f**s up. I remember a case where a CTE did not much more than a TRY_CAST. The idea was to ensure valid values in the query below. But the engine thought "Oh, just a CAST, not expensiv!" and included the acutal CAST to the execution plan on a higher position. I remember another case where the engine performed an expensive operation against millions of rows (unnecessarily, the final result was filtered to a tiny set), just because the actual order of execution was not as expected.
Okay... So when should I use a CTE?
The following points are good reasons to use a CTE:
A CTE can help you to avoid repeated sub queries.
A CTE can be used multiple times within your statement, e.g. within a JOIN with a dynamic behavior depending on the actual row-count.
You can use multiple CTEs within one statement and you can use the result of one CTE within a later CTE.
There are recursive (or better iterative) CTEs.
Sometimes I used single-row-CTEs to define / pre-compute variables later used in the query. Things you would do with declared variables in procedural T-SQL. You can use A CROSS JOIN to get them into your query easily.
and also very nice: the updatable CTE allows for very easy-to-read statements, same applies for DELETE.
As above: Nothing one could not do without the CTE, but it is far better to read (I really like speaking names).
Final hints
Well, there are cases, where ugly code performs better :-)
It is always good to have clean and readable code. A CTE will help you with this. So give it a try. If the performance is bad, get into depth, look at the execution plans and try to find a reason where the engine might decide wrong.
In most cases it is a bad idea trying to outsmart the engine with hints such as FORCE ORDER (but in can help)
UPDATE
I was asked to point to advantages and disadvantages specifically:
Uhm, technically there are no real advantages or disadvantages. Disregarding recursive CTEs there's nothing one couldn't solve without a CTE.
Advantages
The main advantage is readability and maintainabilty.
Sometimes a CTE can save hundreds of lines of code. Instead of a repeating a huge sub-query one can use just a name as a variable. Corrections to the sub-query can be solved just in one place.
The CTE can serve in ad-hoc queries and make your life easier.
Disadvantages
One possible disadvantage is that it's very easy, even for experienced developers, to mistake a CTE as a temp table, assume that the visible order of steps will be the same as the acutal order of execution and stumble into unexpected results or even errors.
And - of course :-) - the strange wrong syntax error you'll see when you write a CTE after another statement without a separating ;. That's why many people tend to use ;WITH.
Probably not. CTE's are especially good at querying data for tree structures.
The information and quotes are from the following article on mssqltips.com "Choose Between SQL Server Subquery T-SQL Code" by Eric Blinn. https://www.mssqltips.com/sqlservertip/6618/sql-server-query-performance-cte-view-subquery-temp-table-table-variable/
SQL Server 2019 CTEs, subqueries, and views
The SQL Server [2019] engine optimizes every query that is given to it. When
it encounters a CTE, traditional subquery, or view, it sees them all
the same way and optimizes them the same way. This involves looking
at the underlying tables, considering their statistics, and choosing
the best way to proceed. In most cases they will return the same plan
and therefore perform exactly the same.
TempDB table
For the query that inserted rows into the temporary table, the
optimizer looked at the table statistics and chose the best way
forward. It actually made new table statistics for the temporary
table and then used them to run the second. This brings about very
similar performance.
Table variable
The table variable has poor performance in the example given in the article due to lack of table statistics.
...the table variable does not have any table statistics generated for
it like the TempDB table did. This means the optimizer has to make a
wild guess as to how to proceed. In this example it made a very, very
poor decision.
This is not to write off table variables. They surely have their
place as will be discussed later in the tip.
Temp table vs Table variable
A temporary table will be stored on disk and have statistics
calculated on it and a table variable will not. Because of this
difference temporary tables are best when the expected row count
is >100 and the table variable for smaller expected row counts where the lack of statistics will be less likely to lead to a bad query plan.
Advantages of CTE
CTE can be termed as 'Temporary View' used as a good alternative for a View in some cases.
The main advantage over a view is usage of memory. As CTE's scope is limited only to its batch, the memory allocated for it is flushed as soon as its batch is crossed. But once a view is created, it is stored until user drops it. If the view is not used after creation then it's a mere waste of memory.
CPU cost for CTE execution is lesser when compared to that of View.
Like View, CTE doesn't store any metadata of its definition and provides better readability.
A CTE can be referred for multiple times in a query.
As the scope is limited to the batch, multiple CTEs can have the same name which a view cannot have.
It can be made recursive.
Disadvantages of CTE
Though using CTE is advantageous, it does have some limitations to be kept in mind,
We knew that it is a substitute for a view but a CTE cannot be nested while Views can be nested.
View once declared can be used for any number of times but CTE cannot be used. It should be declared every time you want to use it. For this scenario, CTE is not recommended to use as it is a tiring job for user to declare the batches again and again.
Between the anchor members there should be operators like UNION, UNION ALL or EXCEPT etc.
In Recursive CTEs, you can define many Anchor Members and Recursive Members but all the Anchor Members must be defined before the first Recursive Member. You cannot define an Anchor Member between two Recursive Member.
The number of columns, the data types used in Anchor and Recursive Members should be same.
In Recursive Member, aggregate functions like TOP, operator like DISTINCT, clause like HAVING and GROUP BY, Sub-queries, joins like Left Outer or Right Outer or Full Outer are not allowed. Regarding Joins, only Inner Join is allowed in Recursive Member.
Recursion Limit is 32767, crossing which results in the crash of server due to infinite loop.