Why are recursive CTEs unable to use grouping and other clauses? - sql

I recently learned about Recursive Common Table Expressions (CTEs) while looking for a way to build a certain view of some data. After taking a while to write out how the first iteration of my query would work, I turned it into a CTE to watch the whole thing play out. I was surprised to see that grouping didn't work, so I just replaced it with a "Select TOP 1, ORDER BY" equivalent. I was again surprised that "TOP" wasn't allowed, and came to find that all of these clauses aren't allowed in the recursive part of a CTE:
DISTINCT
GROUP BY
HAVING
TOP
LEFT
RIGHT
OUTER JOIN
So I suppose I have 2 questions:
In order to better understand my situation and SQL, why aren't these clauses allowed?
If I need to do some sort of recursion using some of these clauses, is my only alternative to write a recursive stored procedure?
Thanks.

Referring to :-
In order to better understand my situation and SQL, why aren't these clauses allowed?
Based on my understanding of CTE's, the whole idea behind creating a CTE is so that you can create a temporary result-set and use it in a named manner like a regular table in SELECT, INSERT, UPDATE, or DELETE statements.
Because a CTE is logically very much like a view, the SELECT statement in your CTE query must follow the same requirements as those used for creating a view. Refer **CTE query definitions* section in following link on MSDN
Also, because a CTE is basically a named resultset and it can be used like any table in SELECT, INSERT, UPDATE, or DELETE statements, you always have the option of using the various operators you mentioned when you use the CTE in any of those statements.
With regarding to
2.If I need to do some sort of recursion using some of these clauses, is my only alternative to write a recursive stored procedure?
Also, I am pretty sure that you can use at least some of the keywords that you have mentioned above in a view / CTE select statement. For Example: Refer to the use of GROUP BY in a CTE here in the Creating a simple common table expression example
Maybe, if you can provide a sample scenario of what you are trying to achieve, we can suggest some possible solution.

Related

Does the number of columns used for a CTE affects the performance of the query?

Using more columns within a CTE query affects the performance? I am currently trying to execute a query with the WITH sentence, and it seams that if I use more colum,s, it takes more time to load the data. Am I correct?
The number of columns defined in a CTE should have no effect on the actual performance of the query (it might affect the compile-time, which is generally miniscule).
Why? Because SQL Server "embeds" the code for the CTE in the query itself and then optimizes all the code together. Unused columns should be eliminated.
This might be an over generalization. There might be some cases where SQL Server doesn't eliminate the work for columns -- such as extra aggregation functions in an aggregation query or certain subqueries. But, in general, what is important is how the CTE is used, not how many columns are defined in it.
You can think of CTE as a View but it doesnt materialize to Disk.So A view expands it definition at run time ,same goes for CTE.

What are the pros/cons of using SQL variables versus subqueries?

I'm wondering there is a difference between SQL variables and subqueries. Whether one uses more processing power, or one is quicker, or even if one merely is more readable.
For (a very basic) example, I like to use variables to hold polygon and transformations in PostGIS:
WITH region_polygon AS (
SELECT ST_Transform(wkb_geometry, %(fishnet_srid)d) geom
FROM regions
LIMIT 1
), raster_pixels AS (
SELECT (ST_PixelAsPolygons(rast)).*
FROM test_regions_raster
LIMIT 1
)
SELECT x, y
FROM raster_pixels a, region_polygon b
WHERE ST_Within(a.geom, b.geom)
But would it be better in any way to use subqueries?
SELECT x, y
FROM (
SELECT ST_Transform(wkb_geometry, %(fishnet_srid)d) geom
FROM regions
LIMIT 1
) a, (
SELECT (ST_PixelAsPolygons(rast)).*
FROM test_regions_raster
LIMIT 1
) b
WHERE ST_Within(a.geom, b.geom)
Note that I'm using PostgreSQL.
There's an important syntactic advantage of common table expressions over derived tables when it comes to reuse. Consider the following, equivalent examples using self-joins:
Using common table expressions
WITH a(v) AS (SELECT 1 UNION SELECT 2)
SELECT *
FROM a AS x, a AS y
Using derived tables
SELECT *
FROM (SELECT 1 UNION SELECT 2) x(v),
(SELECT 1 UNION SELECT 2) y(v)
As you can see, using common table expressions, the view (SELECT 1 UNION SELECT 2) can be reused multiple times in your query. With derived tables, you will have to repeat your view declaration. In my example, this is still OK. In your own example, this starts getting a bit more hairy.
It's all about scope
Views in SQL are all about scoping. There are essentially four levels of declaring views:
As derived tables. They can be consumed exactly once.
As common table expressions. They can be consumed several times, but only in one query.
As views. They can be consumed several times in several queries.
As materialized views. Same as views, but the data is pre-calculated.
Some databases (in particular PostgreSQL) also know table-valued functions. From a mere syntax perspective, they're just like views - parameterised views.
Performance
Note that these thoughts only focus on syntax, not query planning. The different approaches may have very different performance implications, depending on the database vendor.
Those aren't variables, they're common table expressions (cte). In your query above, the execution plans are likely identical, because the optimizer should recognize they are equivalent queries. I prefer to use cte's because I think they're easier to read than subqueries, but that's it.
Edit: Upon further reading it looks like PostgreSQL does treat common table expressions differently than other databases, you can't update a cte in PostgreSQL, for instance. I'll leave my answer here because I believe for your query there won't be a difference, but I'm not terribly familiar with PostgreSQL.
As pointed out this construct is called Common Table Expression, not a variable.
I prefer to use CTE, rather than subquery, because it is way easier to read and write for me, especially when you have several nested CTEs.
You can write CTE once and refer to it several times in the rest of the query. With subquery you'll have to repeat the code several times.
Important difference of PostgreSQL from other databases (at least from MS SQL Server) is that PostgreSQL evaluates each CTE only once.
A useful property of WITH queries is that they are evaluated only once
per execution of the parent query, even if they are referred to more
than once by the parent query or sibling WITH queries. Thus, expensive
calculations that are needed in multiple places can be placed within a
WITH query to avoid redundant work. Another possible application is to
prevent unwanted multiple evaluations of functions with side-effects.
However, the other side of this coin is that the optimizer is less
able to push restrictions from the parent query down into a WITH query
than an ordinary sub-query. The WITH query will generally be evaluated
as written, without suppression of rows that the parent query might
discard afterwards. (But, as mentioned above, evaluation might stop
early if the reference(s) to the query demand only a limited number of
rows.)
MS SQL Server would inline each reference of CTE into the main query and optimize the whole result, but PostgreSQL doesn't. In some sense PostgreSQL is more flexible here. If you want the subquery to be evaluated only once, put it in CTE. If you don't want, put it in subquery and repeat the code. In SQL Server you'd have to use temporary table explicitly.
Your example in the question is too simple and most likely both variants are equivalent - check the execution plan.
Official docs mention it, as I quoted above, but Nick Barnes gave a link to a good article explaining it in more details and I thought it is worth putting it in an answer, rather that comment.
When optimising queries in PostgreSQL (true at least in 9.4 and
older), it’s worth keeping in mind that – unlike newer versions of
various other databases – PostgreSQL will always materialise a CTE
term in a query.
This can have quite surprising effects for those used to working with
DBs like MS SQL:
A query that should touch a small amount of data instead reads a whole
table and possibly spills it to a tempfile;
and You cannot UPDATE or
DELETE FROM a CTE term, because it’s more like a read-only temp table
rather than a dynamic view.
So, there is no definite answer whether CTE is better than subquery in PostgreSQL. In some cases it can be faster, in some cases it can be slower. But, IMHO, in most cases CTE is easier to write, read and maintain.
And, obviously, there is a case when you have no other option, but to use so-called recursive CTE (recursive queries are typically used to deal with hierarchical or tree-structured data).

Using Subquery Table in FROM on another subquery

Basically I want to do something like this.
SELECT *
FROM TABLE, (SELECT * FROM TABLE2) SUBQ
WHERE TABLE.SOMETHING IN (SELECT DISTINCT COL FROM SUBQ)
I want to know if it's even possible to call that subquery table in my FROM on another subquery, and if it is, how to do it.
This is a simplified example (my queries are too long), so I'm not looking into another way of rewriting it that doesn't use the subquery in the FROM.
What you're looking for is called common table expression, it uses the WITH SQL keyword.
Of course if this can be refactored to use a temporary table or an (indexed) view it would be much better / more performant and it would serve multiple execution streams (if they all need access to the same data).
If different execution streams do not need to have access to the same data then creating many many temporary tables or views (while also trying to avoid name collisions) is less optimal and the CTE is more useful.

Do CTEs improve performance?

with ini as
(
select ...
)
select ini.a
join ini.b
join ini.c
How many times does the SQL Server engine calculate the results from the ini table ?
My question which I'm trying to answer (with your help) is if the with statement (CTE) improves performance by aliasing the results.
The CTE ini is simply a macro that expands and this use is syntax/clarity only.
MSDN says:
Using a CTE offers the advantages of improved readability and ease in maintenance of complex queries
Nothing about performance.
It is evaluated per mention: so three times here which you can see from an execution plan.
For recursive CTEs it's somewhat different as the CTE builds upon itself but it will still be evaluated once per mention
A CTE (common table expression, the part that is wrapped in the "with") is essentially a 1-time view. If you think of it in terms of a temporary view, perhaps the answer will become more clear. As far as I know, the interpreter will simply do the equivalent of copy/pasting whatever is within the CTE into the main query wherever it finds the reference.
I'm sure there are outside instances where it appears to help, but more often than not, I'd assume that the mere presence of a CTE itself is not going to improve the performance of a query. It'll help with readability and re-usability within that single select statement (i.e., you won't have to re-type the same sub-query multiple times), but I don't believe it will magically make things run faster (all things being equal). Of course, if your query is structured differently within the CTE than you would have done w/ sub-queries, then it's quite possible the CTE runs faster at that point, but you're now comparing apples to oranges.
I suppose it would also depend on whther you were using it to replace a derived table or a correlated subquery. Performance would be about the same in the first case and probably significantly better in the second if you joined to the CTE rather than just replaced the suquery code with a reference to the CTE. If you used it to replace a where NOT EXISTS clause with a left join to a CTE (in order to find the records in one table but not the other), I'd expect performance to be worse as Where Exists is usually the fastets way to do that type of task. I guess what I'm saying is that performance will still depend on how you use the CTE not just the fact that you generated one.

SQL Code Smells

Could you please list some of the bad practices in SQL, that novice people do?
I have found the use of "WHILE loop" in scenarios which could be resolved using set operations.
Another example is inserting data only if it does not exist. This can be achieved using LEFT OUTER JOIN. Some people go for "IF"
Any other thoughts?
Edit: What I am looking for is specific scenarios (as mentioned in the question) that could be achieved using SQL without using procedural constructs
Thanks
Lijo
Here are some I have seen:
Using cursors instead of equivalent (and faster) set operations (joins etc).
Dynamic SQL for everything.
Code that is open to SQL Injection attacks.
Full outer joins even when they are not needed.
Huge stored procedures (hundreds/thousands of lines).
No comments.
Placing ODBC or dynamic SQL calls all over the code.
Often it is better to define a data abstraction layer that provides access
to the databases. All the SQL code can hide in that layer.
This often avoids replication of similar queries, and makes changing
data models easier to do.
Personally for me: anything that is not a plain INSERT, UPDATE, DELETE or SELECT statement
I don't like logic in SQL.
My biggest beef here is definitely repetitive SQL. As an example, multiple stored procedures that perform the exact same joins but different filters.
Using Views in such cases can make your database MUCH easier to look at and work with
Creating vendor-specific SQL, when generic SQL would do.
Creating tables dynamically at runtime (other than TEMPORARY tables).
Letting your application code have table create or super user privs.
The question asking for a list of SQL smells, no answer can be
exhaustive. I will be expanding my answer as time permits and memory
serves:
Redundant grouping
Redundant grouping is the application of the GROUP BY statement—and
consequently of aggregate functions—to more columns than required. It
occurs when the author starts by collecting most of or all the data that
he needs, to group it at the very end. Redundant grouping is,
therefore, late grouping, for the correct approach is to group early,
and only the data that needs grouping.
If a main entity (main) have a journal (jrnl) and refer to another
enity appendage (apnd), then the following query:
SELECT
main . Id ,
main . Name ,
MAX(jrnl.Entry) AS Entry ,
MAX(jrnl.Date ) AS Date ,
apnd . Reference,
apnd . Status
FROM main
JOIN jrnl ON jrnl.Parent = main.Id
JOIN apnd ON apnd.Id = main.ApndId
GROUP BY main.Id, main.Name, apnd.Reference, apnd.Status
has redundant grouping, because the sole purpose of the GROUP BY
clause is to obtain the latest journal entry. It should be rewritten in
a non-redundant mannger as follows:
SELECT
main.Id ,
main.Name ,
skel.Entry ,
jrnl.Date ,
apnd.Reference,
apnd.Status
FROM
( SELECT
jrnl.Parent AS MainId,
MAX(jrnl.Entry) AS MaxEntry
FROM main
GROUP BY jrnl.Parent
) skel -- eton
JOIN main ON main.Id = skel.MainId
JOIN jrnl ON jrnl.Entry = skel.MaxEntry
JOIN apnd ON apnd.Id = main.ApndId
That is—we group on the narrowest dataset possible, and join the rest
afterwards, even if it means referencing the same tables!