Do CTEs improve performance? - sql

with ini as
(
select ...
)
select ini.a
join ini.b
join ini.c
How many times does the SQL Server engine calculate the results from the ini table ?
My question which I'm trying to answer (with your help) is if the with statement (CTE) improves performance by aliasing the results.

The CTE ini is simply a macro that expands and this use is syntax/clarity only.
MSDN says:
Using a CTE offers the advantages of improved readability and ease in maintenance of complex queries
Nothing about performance.
It is evaluated per mention: so three times here which you can see from an execution plan.
For recursive CTEs it's somewhat different as the CTE builds upon itself but it will still be evaluated once per mention

A CTE (common table expression, the part that is wrapped in the "with") is essentially a 1-time view. If you think of it in terms of a temporary view, perhaps the answer will become more clear. As far as I know, the interpreter will simply do the equivalent of copy/pasting whatever is within the CTE into the main query wherever it finds the reference.
I'm sure there are outside instances where it appears to help, but more often than not, I'd assume that the mere presence of a CTE itself is not going to improve the performance of a query. It'll help with readability and re-usability within that single select statement (i.e., you won't have to re-type the same sub-query multiple times), but I don't believe it will magically make things run faster (all things being equal). Of course, if your query is structured differently within the CTE than you would have done w/ sub-queries, then it's quite possible the CTE runs faster at that point, but you're now comparing apples to oranges.

I suppose it would also depend on whther you were using it to replace a derived table or a correlated subquery. Performance would be about the same in the first case and probably significantly better in the second if you joined to the CTE rather than just replaced the suquery code with a reference to the CTE. If you used it to replace a where NOT EXISTS clause with a left join to a CTE (in order to find the records in one table but not the other), I'd expect performance to be worse as Where Exists is usually the fastets way to do that type of task. I guess what I'm saying is that performance will still depend on how you use the CTE not just the fact that you generated one.

Related

Does the number of columns used for a CTE affects the performance of the query?

Using more columns within a CTE query affects the performance? I am currently trying to execute a query with the WITH sentence, and it seams that if I use more colum,s, it takes more time to load the data. Am I correct?
The number of columns defined in a CTE should have no effect on the actual performance of the query (it might affect the compile-time, which is generally miniscule).
Why? Because SQL Server "embeds" the code for the CTE in the query itself and then optimizes all the code together. Unused columns should be eliminated.
This might be an over generalization. There might be some cases where SQL Server doesn't eliminate the work for columns -- such as extra aggregation functions in an aggregation query or certain subqueries. But, in general, what is important is how the CTE is used, not how many columns are defined in it.
You can think of CTE as a View but it doesnt materialize to Disk.So A view expands it definition at run time ,same goes for CTE.

What are the pros/cons of using SQL variables versus subqueries?

I'm wondering there is a difference between SQL variables and subqueries. Whether one uses more processing power, or one is quicker, or even if one merely is more readable.
For (a very basic) example, I like to use variables to hold polygon and transformations in PostGIS:
WITH region_polygon AS (
SELECT ST_Transform(wkb_geometry, %(fishnet_srid)d) geom
FROM regions
LIMIT 1
), raster_pixels AS (
SELECT (ST_PixelAsPolygons(rast)).*
FROM test_regions_raster
LIMIT 1
)
SELECT x, y
FROM raster_pixels a, region_polygon b
WHERE ST_Within(a.geom, b.geom)
But would it be better in any way to use subqueries?
SELECT x, y
FROM (
SELECT ST_Transform(wkb_geometry, %(fishnet_srid)d) geom
FROM regions
LIMIT 1
) a, (
SELECT (ST_PixelAsPolygons(rast)).*
FROM test_regions_raster
LIMIT 1
) b
WHERE ST_Within(a.geom, b.geom)
Note that I'm using PostgreSQL.
There's an important syntactic advantage of common table expressions over derived tables when it comes to reuse. Consider the following, equivalent examples using self-joins:
Using common table expressions
WITH a(v) AS (SELECT 1 UNION SELECT 2)
SELECT *
FROM a AS x, a AS y
Using derived tables
SELECT *
FROM (SELECT 1 UNION SELECT 2) x(v),
(SELECT 1 UNION SELECT 2) y(v)
As you can see, using common table expressions, the view (SELECT 1 UNION SELECT 2) can be reused multiple times in your query. With derived tables, you will have to repeat your view declaration. In my example, this is still OK. In your own example, this starts getting a bit more hairy.
It's all about scope
Views in SQL are all about scoping. There are essentially four levels of declaring views:
As derived tables. They can be consumed exactly once.
As common table expressions. They can be consumed several times, but only in one query.
As views. They can be consumed several times in several queries.
As materialized views. Same as views, but the data is pre-calculated.
Some databases (in particular PostgreSQL) also know table-valued functions. From a mere syntax perspective, they're just like views - parameterised views.
Performance
Note that these thoughts only focus on syntax, not query planning. The different approaches may have very different performance implications, depending on the database vendor.
Those aren't variables, they're common table expressions (cte). In your query above, the execution plans are likely identical, because the optimizer should recognize they are equivalent queries. I prefer to use cte's because I think they're easier to read than subqueries, but that's it.
Edit: Upon further reading it looks like PostgreSQL does treat common table expressions differently than other databases, you can't update a cte in PostgreSQL, for instance. I'll leave my answer here because I believe for your query there won't be a difference, but I'm not terribly familiar with PostgreSQL.
As pointed out this construct is called Common Table Expression, not a variable.
I prefer to use CTE, rather than subquery, because it is way easier to read and write for me, especially when you have several nested CTEs.
You can write CTE once and refer to it several times in the rest of the query. With subquery you'll have to repeat the code several times.
Important difference of PostgreSQL from other databases (at least from MS SQL Server) is that PostgreSQL evaluates each CTE only once.
A useful property of WITH queries is that they are evaluated only once
per execution of the parent query, even if they are referred to more
than once by the parent query or sibling WITH queries. Thus, expensive
calculations that are needed in multiple places can be placed within a
WITH query to avoid redundant work. Another possible application is to
prevent unwanted multiple evaluations of functions with side-effects.
However, the other side of this coin is that the optimizer is less
able to push restrictions from the parent query down into a WITH query
than an ordinary sub-query. The WITH query will generally be evaluated
as written, without suppression of rows that the parent query might
discard afterwards. (But, as mentioned above, evaluation might stop
early if the reference(s) to the query demand only a limited number of
rows.)
MS SQL Server would inline each reference of CTE into the main query and optimize the whole result, but PostgreSQL doesn't. In some sense PostgreSQL is more flexible here. If you want the subquery to be evaluated only once, put it in CTE. If you don't want, put it in subquery and repeat the code. In SQL Server you'd have to use temporary table explicitly.
Your example in the question is too simple and most likely both variants are equivalent - check the execution plan.
Official docs mention it, as I quoted above, but Nick Barnes gave a link to a good article explaining it in more details and I thought it is worth putting it in an answer, rather that comment.
When optimising queries in PostgreSQL (true at least in 9.4 and
older), it’s worth keeping in mind that – unlike newer versions of
various other databases – PostgreSQL will always materialise a CTE
term in a query.
This can have quite surprising effects for those used to working with
DBs like MS SQL:
A query that should touch a small amount of data instead reads a whole
table and possibly spills it to a tempfile;
and You cannot UPDATE or
DELETE FROM a CTE term, because it’s more like a read-only temp table
rather than a dynamic view.
So, there is no definite answer whether CTE is better than subquery in PostgreSQL. In some cases it can be faster, in some cases it can be slower. But, IMHO, in most cases CTE is easier to write, read and maintain.
And, obviously, there is a case when you have no other option, but to use so-called recursive CTE (recursive queries are typically used to deal with hierarchical or tree-structured data).

Is there a performance difference between CTE , Sub-Query, Temporary Table or Table Variable?

In this excellent SO question, differences between CTE and sub-queries were discussed.
I would like to specifically ask:
In what circumstance is each of the following more efficient/faster?
CTE
Sub-Query
Temporary Table
Table Variable
Traditionally, I've used lots of temp tables in developing stored procedures - as they seem more readable than lots of intertwined sub-queries.
Non-recursive CTEs encapsulate sets of data very well, and are very readable, but are there specific circumstances where one can say they will always perform better? or is it a case of having to always fiddle around with the different options to find the most efficient solution?
EDIT
I've recently been told that in terms of efficiency, temporary tables are a good first choice as they have an associated histogram i.e. statistics.
SQL is a declarative language, not a procedural language. That is, you construct a SQL statement to describe the results that you want. You are not telling the SQL engine how to do the work.
As a general rule, it is a good idea to let the SQL engine and SQL optimizer find the best query plan. There are many person-years of effort that go into developing a SQL engine, so let the engineers do what they know how to do.
Of course, there are situations where the query plan is not optimal. Then you want to use query hints, restructure the query, update statistics, use temporary tables, add indexes, and so on to get better performance.
As for your question. The performance of CTEs and subqueries should, in theory, be the same since both provide the same information to the query optimizer. One difference is that a CTE used more than once could be easily identified and calculated once. The results could then be stored and read multiple times. Unfortunately, SQL Server does not seem to take advantage of this basic optimization method (you might call this common subquery elimination).
Temporary tables are a different matter, because you are providing more guidance on how the query should be run. One major difference is that the optimizer can use statistics from the temporary table to establish its query plan. This can result in performance gains. Also, if you have a complicated CTE (subquery) that is used more than once, then storing it in a temporary table will often give a performance boost. The query is executed only once.
The answer to your question is that you need to play around to get the performance you expect, particularly for complex queries that are run on a regular basis. In an ideal world, the query optimizer would find the perfect execution path. Although it often does, you may be able to find a way to get better performance.
There is no rule. I find CTEs more readable, and use them unless they exhibit some performance problem, in which case I investigate the actual problem rather than guess that the CTE is the problem and try to re-write it using a different approach. There is usually more to the issue than the way I chose to declaratively state my intentions with the query.
There are certainly cases when you can unravel CTEs or remove subqueries and replace them with a #temp table and reduce duration. This can be due to various things, such as stale stats, the inability to even get accurate stats (e.g. joining to a table-valued function), parallelism, or even the inability to generate an optimal plan because of the complexity of the query (in which case breaking it up may give the optimizer a fighting chance). But there are also cases where the I/O involved with creating a #temp table can outweigh the other performance aspects that may make a particular plan shape using a CTE less attractive.
Quite honestly, there are way too many variables to provide a "correct" answer to your question. There is no predictable way to know when a query may tip in favor of one approach or another - just know that, in theory, the same semantics for a CTE or a single subquery should execute the exact same. I think your question would be more valuable if you present some cases where this is not true - it may be that you have discovered a limitation in the optimizer (or discovered a known one), or it may be that your queries are not semantically equivalent or that one contains an element that thwarts optimization.
So I would suggest writing the query in a way that seems most natural to you, and only deviate when you discover an actual performance problem the optimizer is having. Personally I rank them CTE, then subquery, with #temp table being a last resort.
#temp is materalized and CTE is not.
CTE is just syntax so in theory it is just a subquery. It is executed. #temp is materialized. So an expensive CTE in a join that is execute many times may be better in a #temp. On the other side if it is an easy evaluation that is not executed but a few times then not worth the overhead of #temp.
The are some people on SO that don't like table variable but I like them as the are materialized and faster to create than #temp. There are times when the query optimizer does better with a #temp compared to a table variable.
The ability to create a PK on a #temp or table variable gives the query optimizer more information than a CTE (as you cannot declare a PK on a CTE).
Just 2 things I think make it ALWAYS preferable to use a # Temp Table rather then a CTE are:
You can not put a primary key on a CTE so the data being accessed by the CTE will have to traverse each one of the indexes in the CTE's tables rather then just accessing the PK or Index on the temp table.
Because you can not add constraints, indexes and primary keys to a CTE they are more prone to bugs creeping in and bad data.
-onedaywhen yesterday
Here is an example where #table constraints can prevent bad data which is not the case in CTE's
DECLARE #BadData TABLE (
ThisID int
, ThatID int );
INSERT INTO #BadData
( ThisID
, ThatID
)
VALUES
( 1, 1 ),
( 1, 2 ),
( 2, 2 ),
( 1, 1 );
IF OBJECT_ID('tempdb..#This') IS NOT NULL
DROP TABLE #This;
CREATE TABLE #This (
ThisID int NOT NULL
, ThatID int NOT NULL
UNIQUE(ThisID, ThatID) );
INSERT INTO #This
SELECT * FROM #BadData;
WITH This_CTE
AS (SELECT *
FROM #BadData)
SELECT *
FROM This_CTE;

Transact-SQL - sub query or left-join?

I have two tables containing Tasks and Notes, and want to retrieve a list of tasks with the number of associated notes for each one. These two queries do the job:
select t.TaskId,
(select count(n.TaskNoteId) from TaskNote n where n.TaskId = t.TaskId) 'Notes'
from Task t
-- or
select t.TaskId,
count(n.TaskNoteId) 'Notes'
from Task t
left join
TaskNote n
on t.TaskId = n.TaskId
group by t.TaskId
Is there a difference between them and should I be using one over the other, or are they just two ways of doing the same job? Thanks.
On small datasets they are wash when it comes to performance. When indexed, the LOJ is a little better.
I've found on large datasets that an inner join (an inner join will work too.) will outperform the subquery by a very large factor (sorry, no numbers).
In most cases, the optimizer will treat them the same.
I tend to prefer the second, because it has less nesting, which makes it easier to read and easier to maintain. I have started to use SQL Server's common table expressions to reduce nesting as well for the same reason.
In addition, the second syntax is more flexible if there are further aggregates which may be added in the future in addition to COUNT, like MIN(some_scalar), MAX(), AVG() etc.
The subquery will be slower as it is being executed for every row in the outer query. The join will be faster as it is done once. I believe that the query optimiser will not rewrite this query plan as it can't recognize the equivalence.
Normally you would do a join and group by for this sort of count. Correlated subqueries of the sort you show are mainly of interest if they have to do some grouping or more complex predicate on a table that is not participating in another join.
If you're using SQL Server Management Studio, you can enter both versions into the Query Editor and then right-click and choose Display Estimated Execution Plan. It will give you two percentage costs relative to the batch. If they're expected to take the same time, they'll both show as 50% - in which case, choose whichever you prefer for other reasons (easier to read, easier to maintain, better fit with your coding standards etc). Otherwise, you can pick the one with the lower percentage cost relative to the batch.
You can use the same technique to look at changing any query to improve performance by comparing two versions that do the same thing.
Of course, because it's a cost relative to the batch, it doesn't mean that either query is as fast as it could be - it just tells you how they compare to each other, not to some notional optimum query to get the same results.
There's no clear-cut answer on this. You should view the SQL Plan. In terms of relational algebra, they are essentially equivalent.
I make it a point to avoid subqueries wherever possible. The join will generally be more efficient.
You can use either, and they are semantically identical. In general, the rule of thumb is to use whichever form is easier for you to read, unless performance is an issue.
If performance is an issue, then experiment with rewriting the query using the other form. Sometimes the optimizer will use an index for one form, and not the other.

What are the advantages/disadvantages of using a CTE?

I'm looking at improving the performance of some SQL, currently CTEs are being used and referenced multiple times in the script. Would I get improvements using a table variable instead? (Can't use a temporary table as the code is within functions).
You'll really have to performance test - There is no Yes/No answer. As per Andy Living's post above links to, a CTE is just shorthand for a query or subquery.
If you are calling it twice or more in the same function, you might get better performance if you fill a table variable and then join to/select from that. However, as table variables take up space somewhere, and don't have indexes/statistics (With the exception of any declared primary key on the table variable) there's no way of saying which will be faster.
They both have costs and savings, and which is the best way depends on the data they pull in and what they do with it. I've been in your situation, and after testing for speed under various conditions - Some functions used CTEs, and others used table variables.
A CTE is not much more than syntactic sugar.
It enhances the readability and allows to avoid repetition.
Just think of it as a placeholder for the actual statement specified in the WITH()-clause. The engine will replace any occurance of the CTE's name in your query with this statement (quite similar to a view). This is the meaning of inline.
Compared to a previously filled table (delared or created) You'll find advantages:
useable in ad-hoc-queries (functions, views)
no unexpected side effects (most narrow scope)
...and disadvantages:
You cannot use the CTE's result in different statements
You cannot use indexes, statistics to optimize your CTE's set (although it will implicitly use existing indexes and statistics of the targeted objects - if appropriate).
In terms of performance a persisted set (declared or created table) can be (much!) better in some cases, but it forces you into procedural code. You will have to race your horses to find out which is better...
Example: Various approaches to do the same
The following simple (rather useless) example describes a set of user tables together with their columns. I use various different approaches to tell SQL-Server what I want:
Try this with "include actual execution plan"
USE master; --in my case the master database has just 5 "user tables", you can use any other DB of course
GO
--simple join, first the small set joining to the large set
SELECT o.name AS TableName
,c.name AS ColumnName
FROM sys.objects o
INNER JOIN sys.columns c ON c.object_id=o.object_id
WHERE o.type='U';
GO
--simple join "the other way round" with the filter as part of the ON-clause
SELECT o.name AS TableName
,c.name AS ColumnName
FROM sys.columns c
INNER JOIN sys.objects o ON c.object_id=o.object_id AND o.type='U';
GO
--join from the large set with a sub-query to the small set
SELECT o.name AS TableName
,c.name AS ColumnName
FROM sys.columns c
INNER JOIN (
SELECT o.*
FROM sys.objects o
WHERE o.type='U' --user tables
) o ON c.object_id=o.object_id;
GO
--join for large to small with a row-wise APPLY
SELECT o.name AS TableName
,c.name AS ColumnName
FROM sys.columns c
CROSS APPLY (
SELECT o.*
FROM sys.objects o
WHERE o.type='U' --user tables
AND o.object_id=c.object_id
) o;
GO
--use a CTE to "pre-filter" the small set
WITH cte AS
(
SELECT o.*
FROM sys.objects o
WHERE o.type='U' --user tables
)
SELECT cte.name AS TableName
,c.name AS ColumnName
FROM sys.columns c
INNER JOIN cte ON c.object_id=cte.object_id;
GO
Now look at the result and at the execution plans:
All queries return the same result.
All queries produce the same execution plan
Important hint: This might differ on your machine!
Why is this?
T-SQL is a declarative language. Your statement is a description of WHAT you want to retrieve. It is not your job to tell the engine HOW this is done.
SQL-Server's extremely smart engine will find the best way to get the set you asked for. In the case above all result descriptions point to the same goal. The engine can deduce this from various statements and finds the same plan for all of them.
Well, is it just a matter of taste?
In a way...
There are some important things to keep in mind:
There is no reason for the engine to compute the CTE's result before the rest (although the statement might look so). Therefore it is wrong to describe a CTE as something like a temp table...
In other words: The visible order of your statement does not predict the actual order of execution!
The smart engine will reach its limits with complexity and nest level. Imagine various VIEWs, all using CTEs and calling each-other...
There are cases where the engine really f**s up. I remember a case where a CTE did not much more than a TRY_CAST. The idea was to ensure valid values in the query below. But the engine thought "Oh, just a CAST, not expensiv!" and included the acutal CAST to the execution plan on a higher position. I remember another case where the engine performed an expensive operation against millions of rows (unnecessarily, the final result was filtered to a tiny set), just because the actual order of execution was not as expected.
Okay... So when should I use a CTE?
The following points are good reasons to use a CTE:
A CTE can help you to avoid repeated sub queries.
A CTE can be used multiple times within your statement, e.g. within a JOIN with a dynamic behavior depending on the actual row-count.
You can use multiple CTEs within one statement and you can use the result of one CTE within a later CTE.
There are recursive (or better iterative) CTEs.
Sometimes I used single-row-CTEs to define / pre-compute variables later used in the query. Things you would do with declared variables in procedural T-SQL. You can use A CROSS JOIN to get them into your query easily.
and also very nice: the updatable CTE allows for very easy-to-read statements, same applies for DELETE.
As above: Nothing one could not do without the CTE, but it is far better to read (I really like speaking names).
Final hints
Well, there are cases, where ugly code performs better :-)
It is always good to have clean and readable code. A CTE will help you with this. So give it a try. If the performance is bad, get into depth, look at the execution plans and try to find a reason where the engine might decide wrong.
In most cases it is a bad idea trying to outsmart the engine with hints such as FORCE ORDER (but in can help)
UPDATE
I was asked to point to advantages and disadvantages specifically:
Uhm, technically there are no real advantages or disadvantages. Disregarding recursive CTEs there's nothing one couldn't solve without a CTE.
Advantages
The main advantage is readability and maintainabilty.
Sometimes a CTE can save hundreds of lines of code. Instead of a repeating a huge sub-query one can use just a name as a variable. Corrections to the sub-query can be solved just in one place.
The CTE can serve in ad-hoc queries and make your life easier.
Disadvantages
One possible disadvantage is that it's very easy, even for experienced developers, to mistake a CTE as a temp table, assume that the visible order of steps will be the same as the acutal order of execution and stumble into unexpected results or even errors.
And - of course :-) - the strange wrong syntax error you'll see when you write a CTE after another statement without a separating ;. That's why many people tend to use ;WITH.
Probably not. CTE's are especially good at querying data for tree structures.
The information and quotes are from the following article on mssqltips.com "Choose Between SQL Server Subquery T-SQL Code" by Eric Blinn. https://www.mssqltips.com/sqlservertip/6618/sql-server-query-performance-cte-view-subquery-temp-table-table-variable/
SQL Server 2019 CTEs, subqueries, and views
The SQL Server [2019] engine optimizes every query that is given to it. When
it encounters a CTE, traditional subquery, or view, it sees them all
the same way and optimizes them the same way. This involves looking
at the underlying tables, considering their statistics, and choosing
the best way to proceed. In most cases they will return the same plan
and therefore perform exactly the same.
TempDB table
For the query that inserted rows into the temporary table, the
optimizer looked at the table statistics and chose the best way
forward. It actually made new table statistics for the temporary
table and then used them to run the second. This brings about very
similar performance.
Table variable
The table variable has poor performance in the example given in the article due to lack of table statistics.
...the table variable does not have any table statistics generated for
it like the TempDB table did. This means the optimizer has to make a
wild guess as to how to proceed. In this example it made a very, very
poor decision.
This is not to write off table variables. They surely have their
place as will be discussed later in the tip.
Temp table vs Table variable
A temporary table will be stored on disk and have statistics
calculated on it and a table variable will not. Because of this
difference temporary tables are best when the expected row count
is >100 and the table variable for smaller expected row counts where the lack of statistics will be less likely to lead to a bad query plan.
Advantages of CTE
CTE can be termed as 'Temporary View' used as a good alternative for a View in some cases.
The main advantage over a view is usage of memory. As CTE's scope is limited only to its batch, the memory allocated for it is flushed as soon as its batch is crossed. But once a view is created, it is stored until user drops it. If the view is not used after creation then it's a mere waste of memory.
CPU cost for CTE execution is lesser when compared to that of View.
Like View, CTE doesn't store any metadata of its definition and provides better readability.
A CTE can be referred for multiple times in a query.
As the scope is limited to the batch, multiple CTEs can have the same name which a view cannot have.
It can be made recursive.
Disadvantages of CTE
Though using CTE is advantageous, it does have some limitations to be kept in mind,
We knew that it is a substitute for a view but a CTE cannot be nested while Views can be nested.
View once declared can be used for any number of times but CTE cannot be used. It should be declared every time you want to use it. For this scenario, CTE is not recommended to use as it is a tiring job for user to declare the batches again and again.
Between the anchor members there should be operators like UNION, UNION ALL or EXCEPT etc.
In Recursive CTEs, you can define many Anchor Members and Recursive Members but all the Anchor Members must be defined before the first Recursive Member. You cannot define an Anchor Member between two Recursive Member.
The number of columns, the data types used in Anchor and Recursive Members should be same.
In Recursive Member, aggregate functions like TOP, operator like DISTINCT, clause like HAVING and GROUP BY, Sub-queries, joins like Left Outer or Right Outer or Full Outer are not allowed. Regarding Joins, only Inner Join is allowed in Recursive Member.
Recursion Limit is 32767, crossing which results in the crash of server due to infinite loop.