Do subqueries preserve indexes - sql

I would like to know if the value returned from subqueries in Oracle loses indexes or not.
select * from emp where empid = 1
-- empid is indexed
select t1.* from (select t2.* from emp t2) t1 where t1.empid = 1
-- t1.empid is still indexed?

Yes, the second query uses the index. In fact, both compile down to the exact same execution plan. Check it on SQLFiddle.
You should keep in mind that SQL processing is lazy: sub-queries are not necessarily fully executed to get input data to the top-level query. Instead, they should be regarded as code that can be invoked, as needed, to get that data.

Part of the query optimisation process is to simplify the structure of the supplied query. This generally can mean replacing IN and EXISTS with joins, pushing predicates into inline views and eliminating subqueries entirely.
In fact because this behaviour is so prevalant there are a few techniques (optimiser hints, for example, or common table expressions, or certain logically redundant clauses) specifically designed to prevent predicate pushing and subquery merging and other query transformations where they are unintentionally disadvantageous.
By default you should expect that subqueries and inline views will be merged into the parent query where logically possible, and as others have mentioned this is almost certainly the case in your example.
It follows from all of this, of course, that using subqueries or inline views generally doesn't impair the optimiser's ability to use indexes or query rewrite or various other performance enhancing techniques.

There is no simple answer on this question: some times yes and sometimes no. It depends on many factors. In general , just not going deeply into theory, try to avoid such style of query writing to be on safe side or in other words views are bad in most of the cases. Please refer to http://www.orafaq.com/tuningguide/push%20predicates.html for more details.

Related

SQL Execution Order: does it exist or not?

I am really confused about the execution order in SQL. Basically, given any query (assume it's a complex one that has multiple JOINS, WHERE clauses, etc), is the query executed sequentially or not?
From the top answer at Order Of Execution of the SQL query, it seems like "SQL has no order of execution. ... The optimizer is free to choose any order it feels appropriate to produce the best execution time."
From the top answer at What's the execute order of the different parts of a SQL select statement?, in contrast, we see a clear execution order in the form
"
FROM
ON
OUTER
WHERE
...
"
I feel like I am missing something, but it seems as though the two posts are contradicting each other, and different articles online seem to support either one or the other.
But more fundamentally, what I wanted to know initially is this: Suppose we have a complex SQL query with multiple joins, INNER JOINs and LEFT JOINS, in a specific order. Is there going to be an order to the query, such that a later JOIN will apply to the result of an earlier join rather than to the initial table specified in the top FROM clause?
It is tricky. The short answer is: the DBMS will decide what order is best such that it produces the result that you have declared (remember, SQL is declarative, it does not prescribe how the query is to be computed).
But we can think of a "conceptual" order of execution that the DBMS will use to create the result. This conceptual order might be totally ignored by the DBMS, but if we (humans) follow it, we will get the same results as the DBMS. I see this as one of the benefits of a DBMS. Even if we suck and write an inefficient query, the DBMS might say, "no, no, this query you gave me sucks in terms of performance, I know how to do better" and most of the time, the DBMS is right. Sometimes it is not, and rewriting a query helps the DBMS find the "best" approach. This is very dependent of the DBMS of course...
This conceptual order help us we (humans) to understand how the DBMS executes a query. These are listed below.
First the order for non-aggregation:
Do the FROM section. Includes any joins, cross products, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Do the SELECT portion (report results, this is called projection).
If you use an aggregation function, without a group by then:
Do the FROM section. Includes any joins, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Do the aggregation function in the SELECT portion (converting all tuples of the result into one tuple). There is an implicit group by in this query.
If you use a group by:
Do the FROM section. Includes any joins, cross products, subqueries.
Do the WHERE clause (remove tuples, this is called selection)
Cluster subsets of the tuples according to the GROUP BY.
For each cluster of these tuples:
if there is a HAVING, do this predicate (similar to selection of the WHERE).Note that you can have access to aggregation functions.
For each cluster of these tuples output exactly one tuple such that:
Do the SELECT part of the query (similar to select in above aggregation, i.e. you can use aggregation functions).
Window functions happen during the SELECT stage (they take into consideration the set of tuples that would be output by the select at that stage).
There is one more kink:
if you have
select distinct ...
then after everything else is done, then remove DUPLICATED tuples from the results (i.e. return a set of tuples, not a list).
Finally, do the ORDER BY. The ORDER BY happens in all cases at the end, once the SELECT part has been done.
With respect to JOINS. As I mentioned above, they happen at the "FROM" part of the conceptual execution. The WHERE, GROUP BY, SELECT apply on the results of these operations. So you can think of these being the first phase of the execution of the query. If it contains a subquery, the process is recursive.
By the way, you can refer in an inner query to a relation in the outside context of the inner query, but not the other way around.
All of this is conceptual. In reality the DBMS might rewrite your query for the purpose of efficiency.
For example, assume R(a,b) and S(a,c). WHere S(a) is a foreign key that references R(A).
The query:
select b from R JOIN S using (a) where a > 10
can be rewritten by the DBMS to something similar to this:
select b FROM R JOIN (select a from s where a > 10) as T using (a);
or:
select b FROM (select * from R where a > 10) as T JOIN S using (a);
In fact, the DBMS does this all the time. It takes your query, and creates alternates queries. then estimates the execution time of each query and decides which one is the most likely to be the fastest. And then it executes it.
This is a fundamental process of query evaluation. Note that the 3 queries are identical in terms of results. But depending on the sizes of the relations, they might have very different execution times. For example, if R and S are huge, but very few tuples have a>0, then the join wastes time. Each query with a subselect might perform fast if that subselect matches very few tuples, but badly if they match a lot of tuples. This is the type of "magic" that happens inside the query evaluation engine of the DBMS.
You are confusing Order of execution with Logical Query Processing.
I did a quick google search and found a bunch of articles referring to Logical Query Processing as "order of execution". Let's clear this up.
Logical Query Processing
Logical Query Processing details the under-the-hood processing phases of a SQL Query... First the WHERE clause is evaluated for the optimizer to know where to get data from, then table operators, etc.
Understanding this will help you better design and tune queries. Logical query processing order will help you understand why you can reference a column by it's alias in an ORDER BY clause but not anywhere else.
Order of Execution
Consider this WHERE clause:
WHERE t1.Col1 = 'X'
AND t2.Col2 = 1
AND t3.Col3 > t2.Col4
The optimizer is not required to evaluate these predicates in any order; it can evaluate t2.Col2 = 1 first, then t1.Col1 = 'X'.... The optimizer, in some cases can evaluate joins in a different order than than you have presented in your query. When predicate logic dictates that the result will be the same, it is free to make (what it considers) the best choices for optimal performance.
Sadly there is not a lot about this topic out there. I do discuss this a little more here.
First there's the SQL query and the rules of SQL that apply to it. That's what in the other answers is referred to as "Logical query processing". With SQL you specify a result. The SQL standard does not allow you to specify how this result is reached.
Then there's the query optimizer. Based on statistics, heuristics, amount of available CPU, memory and other factors, it will determine the execution plan. It will evaluate how long the execution is expected to take. It will evaluate different execution plans to find the one that executes fastest. In that process, it can evaluate execution plans that use different indexes, and/or rearranges the join order, and/or leave out (outer) joins, etc. The optimizer has many tricks. The more expensive the best execution plan is expected to be, the more (advanced) execution plans will be evaluated. The end result is one (serial) execution plan and potentially a parallel execution plan.
All the evaluated execution plans will guarantee the correct result; the result that matches execution according to the "Logical query processing".
Finally, there's the SQL Server engine. After picking either the serial or parallel execution plan, it will execute it.
The other answers, whilst containing useful and interesting information, risk causing confusion in my view.
They all seem to introduce the notion of a "logical" order of execution, which differs from the actual order of execution, as if this is something special about SQL.
If someone asked about the order of execution of any ordinary language besides SQL, the answer would be "strictly sequential" or (for expressions) "in accordance with the rules of that language". I feel as though we wouldn't be having a long-winded exploration about how the compiler has total freedom to rearrange and rework any algorithm that the programmer writes, and distinguishing this from the merely "logical" representation in the source code.
Ultimately, SQL has a defined order of evaluation. It is the "logical" order referred to in other answers. What is most confusing to novices is that this order does not correspond with the syntactic order of the clauses in an SQL statement.
That is, a simple SELECT...FROM...WHERE...ORDER BY query would actually be evaluated by taking the table referred to in the from-clause, filtering rows according to the where-clause, then manipulating the columns (including filtering, renaming, or generating columns) according to the select-clause, and finally ordering the rows according to the order-by-clause. So clauses here are evaluated second, third, first, fourth, which is a disorderly pattern to any sensible programmer - the designers of SQL preferred to make it correspond more in their view to the structure of something spoken in ordinary English ("tell me the surnames from the register!").
Nevertheless, when the programmer writes SQL, they are specifying the canonical method by which the results are produced, same as if they write source code in any other language.
The query simplification and optimisation that database engines perform (like that which ordinary compilers perform) would be a completely separate topic of discussion, if it hadn't already been conflated. The essence of the situation on this front, is that the database engine can do whatever it damn well likes with the SQL you submit, provided that the data it returns to you is the same as if it had followed the evaluation order defined in SQL.
For example, it could sort the results first, and then filter them, despite this order of operations being clearly different to the order in which the relevant clauses are evaluated in SQL. It can do this because if you (say) have a deck of cards in random order, and go through the deck and throw away all the aces, and then sort the deck into standard order, the outcome (in terms of the final content and order of the deck) is no different than if you sort the deck into standard order first, and then go through and throw away all the aces. But the full details and rationale of this behaviour would be for a separate question entirely.

Will SQL Server be smart enough to not execute expensive queries if it is not needed ? (short-circuiting)

So SQL Server does not have short-circuiting in the explicit manner as with for example if-statements in general-purpose programming languages.
So consider the following mock-up query:
SELECT * FROM someTable
WHERE name = 'someValue' OR name in (*some extremely expensive nested sub-query only needed to cover 0.01% of cases*)
Let's say there are only 3 rows in the table and all of them match name = 'someValue'. Will the expensive sub-query ever run?
Let's say there are 3 million rows and all but 1 could be fetched with the name = 'someValue' except 1 row which need to be fetched with the sub-query. Will the sub-query ever be evaluated when it is not needed?
If one has a similar real case, one might be ok with letting the 0.01% wait for the expensive sub-query to run before getting the results as long as the results are fetched quickly without the sub-query for the 99.99% of cases.
(I know that my specific example above could be handled explicitly with IF-statements in an SP, as suggested in this related thread:
Sql short circuit OR or conditional exists in where clause
but let's assume that is not an option.)
As the comments point out, the optimizer in SQL Server is pretty smart.
You could attempt the short-circuiting by using case. As the documentation states:
The CASE expression evaluates its conditions sequentially and stops with the first condition whose condition is satisfied.
Note that there are some exceptions involving aggregation. So, you could do:
SELECT t.*
FROM someTable t
WHERE 'true' = (CASE WHEN t.name = 'someValue' THEN 'true'
WHEN t.name in (*some extremely expensive nested sub-query only needed to cover 0.01% of cases*)
THEN 'true'
END)
This type of enforced ordering is generally considered a bad idea. One exception is when one of the paths might involve an error,such as a type conversion error) -- however, that is generally fixed nowadays with the try_ functions.
In your case, I suspect that replacing the IN with EXISTS and using appropriate indexes might eliminate almost all the performance penalty of the subquery. However, that is a different matter.

What are the pros/cons of using SQL variables versus subqueries?

I'm wondering there is a difference between SQL variables and subqueries. Whether one uses more processing power, or one is quicker, or even if one merely is more readable.
For (a very basic) example, I like to use variables to hold polygon and transformations in PostGIS:
WITH region_polygon AS (
SELECT ST_Transform(wkb_geometry, %(fishnet_srid)d) geom
FROM regions
LIMIT 1
), raster_pixels AS (
SELECT (ST_PixelAsPolygons(rast)).*
FROM test_regions_raster
LIMIT 1
)
SELECT x, y
FROM raster_pixels a, region_polygon b
WHERE ST_Within(a.geom, b.geom)
But would it be better in any way to use subqueries?
SELECT x, y
FROM (
SELECT ST_Transform(wkb_geometry, %(fishnet_srid)d) geom
FROM regions
LIMIT 1
) a, (
SELECT (ST_PixelAsPolygons(rast)).*
FROM test_regions_raster
LIMIT 1
) b
WHERE ST_Within(a.geom, b.geom)
Note that I'm using PostgreSQL.
There's an important syntactic advantage of common table expressions over derived tables when it comes to reuse. Consider the following, equivalent examples using self-joins:
Using common table expressions
WITH a(v) AS (SELECT 1 UNION SELECT 2)
SELECT *
FROM a AS x, a AS y
Using derived tables
SELECT *
FROM (SELECT 1 UNION SELECT 2) x(v),
(SELECT 1 UNION SELECT 2) y(v)
As you can see, using common table expressions, the view (SELECT 1 UNION SELECT 2) can be reused multiple times in your query. With derived tables, you will have to repeat your view declaration. In my example, this is still OK. In your own example, this starts getting a bit more hairy.
It's all about scope
Views in SQL are all about scoping. There are essentially four levels of declaring views:
As derived tables. They can be consumed exactly once.
As common table expressions. They can be consumed several times, but only in one query.
As views. They can be consumed several times in several queries.
As materialized views. Same as views, but the data is pre-calculated.
Some databases (in particular PostgreSQL) also know table-valued functions. From a mere syntax perspective, they're just like views - parameterised views.
Performance
Note that these thoughts only focus on syntax, not query planning. The different approaches may have very different performance implications, depending on the database vendor.
Those aren't variables, they're common table expressions (cte). In your query above, the execution plans are likely identical, because the optimizer should recognize they are equivalent queries. I prefer to use cte's because I think they're easier to read than subqueries, but that's it.
Edit: Upon further reading it looks like PostgreSQL does treat common table expressions differently than other databases, you can't update a cte in PostgreSQL, for instance. I'll leave my answer here because I believe for your query there won't be a difference, but I'm not terribly familiar with PostgreSQL.
As pointed out this construct is called Common Table Expression, not a variable.
I prefer to use CTE, rather than subquery, because it is way easier to read and write for me, especially when you have several nested CTEs.
You can write CTE once and refer to it several times in the rest of the query. With subquery you'll have to repeat the code several times.
Important difference of PostgreSQL from other databases (at least from MS SQL Server) is that PostgreSQL evaluates each CTE only once.
A useful property of WITH queries is that they are evaluated only once
per execution of the parent query, even if they are referred to more
than once by the parent query or sibling WITH queries. Thus, expensive
calculations that are needed in multiple places can be placed within a
WITH query to avoid redundant work. Another possible application is to
prevent unwanted multiple evaluations of functions with side-effects.
However, the other side of this coin is that the optimizer is less
able to push restrictions from the parent query down into a WITH query
than an ordinary sub-query. The WITH query will generally be evaluated
as written, without suppression of rows that the parent query might
discard afterwards. (But, as mentioned above, evaluation might stop
early if the reference(s) to the query demand only a limited number of
rows.)
MS SQL Server would inline each reference of CTE into the main query and optimize the whole result, but PostgreSQL doesn't. In some sense PostgreSQL is more flexible here. If you want the subquery to be evaluated only once, put it in CTE. If you don't want, put it in subquery and repeat the code. In SQL Server you'd have to use temporary table explicitly.
Your example in the question is too simple and most likely both variants are equivalent - check the execution plan.
Official docs mention it, as I quoted above, but Nick Barnes gave a link to a good article explaining it in more details and I thought it is worth putting it in an answer, rather that comment.
When optimising queries in PostgreSQL (true at least in 9.4 and
older), it’s worth keeping in mind that – unlike newer versions of
various other databases – PostgreSQL will always materialise a CTE
term in a query.
This can have quite surprising effects for those used to working with
DBs like MS SQL:
A query that should touch a small amount of data instead reads a whole
table and possibly spills it to a tempfile;
and You cannot UPDATE or
DELETE FROM a CTE term, because it’s more like a read-only temp table
rather than a dynamic view.
So, there is no definite answer whether CTE is better than subquery in PostgreSQL. In some cases it can be faster, in some cases it can be slower. But, IMHO, in most cases CTE is easier to write, read and maintain.
And, obviously, there is a case when you have no other option, but to use so-called recursive CTE (recursive queries are typically used to deal with hierarchical or tree-structured data).

SQL: IN vs EXISTS

I read that normally you should use EXISTS when the results of the subquery are large, and IN when the subquery results are small.
But it would seem to me that it's also relevant if a subquery has to be re-evaluated for each row, or if it can be evaluated once for the entire query.
Consider the following example of two equivalent queries:
SELECT * FROM t1
WHERE attr IN
(SELECT attr FROM t2
WHERE attr2 = ?);
SELECT * FROM t1
WHERE EXISTS
(SELECT * FROM t2
WHERE t1.attr = t2.attr
AND attr2 = ?);
The former subquery can be evaluated once for the entire query, the latter has to be evaluated for each row.
Assume that the results of the subquery are very large. Which would be the best way to write this?
This is a good question. Especially as in Oracle you can convert every EXISTS clause into an IN clause and vice versa, because Oracle's IN clause can deal with tuples (where (abc) in (select x,y,z from ...), which most other dbms cannot.
And your reasoning is good. Yes, with the IN clause you suggest to load all the subquery's data once instead of looking up the records in a loopg. However this is just partly true, because:
As good as it seems to get all subquery data selected just once, the outer query must loop through the resulting array for every record. This can be quite slow, because it's just an array. If Oracle looks up data in a table instead there are often indexes to help it, so the nested loop with repeated table lookups is eventually faster.
Oracle's optimizer re-writes queries. So it can come to the same execution plan for the two statements or even get to quite unexpected plans. You never know ;-)
Oracle might decide not to loop at all. It may decide for a hash join instead, which works completely different and is usually very effective.
Having said this, Oracle's optimizer should notice that the two statements are exactly the same actually and should generate the same execution plan. But experience shows that the optimizer sometimes doesn't notice, and quite often the optimizer does better with the EXISTS clause for whatever reason. (Not as much difference as in MySQL, but still, EXISTS seems preferable over IN in Oracle, too.)
So as to your question "Assume that the results of the subquery are very large. Which would be the best way to write this?", it is unlikely for the IN clause to be faster than the EXISTS clause.
I often like the IN clause better for its simplicity and mostly find it a bit more readable. But when it comes to performance, it is sometimes better to use EXISTS (or even outer joins for that matter).

Do CTEs improve performance?

with ini as
(
select ...
)
select ini.a
join ini.b
join ini.c
How many times does the SQL Server engine calculate the results from the ini table ?
My question which I'm trying to answer (with your help) is if the with statement (CTE) improves performance by aliasing the results.
The CTE ini is simply a macro that expands and this use is syntax/clarity only.
MSDN says:
Using a CTE offers the advantages of improved readability and ease in maintenance of complex queries
Nothing about performance.
It is evaluated per mention: so three times here which you can see from an execution plan.
For recursive CTEs it's somewhat different as the CTE builds upon itself but it will still be evaluated once per mention
A CTE (common table expression, the part that is wrapped in the "with") is essentially a 1-time view. If you think of it in terms of a temporary view, perhaps the answer will become more clear. As far as I know, the interpreter will simply do the equivalent of copy/pasting whatever is within the CTE into the main query wherever it finds the reference.
I'm sure there are outside instances where it appears to help, but more often than not, I'd assume that the mere presence of a CTE itself is not going to improve the performance of a query. It'll help with readability and re-usability within that single select statement (i.e., you won't have to re-type the same sub-query multiple times), but I don't believe it will magically make things run faster (all things being equal). Of course, if your query is structured differently within the CTE than you would have done w/ sub-queries, then it's quite possible the CTE runs faster at that point, but you're now comparing apples to oranges.
I suppose it would also depend on whther you were using it to replace a derived table or a correlated subquery. Performance would be about the same in the first case and probably significantly better in the second if you joined to the CTE rather than just replaced the suquery code with a reference to the CTE. If you used it to replace a where NOT EXISTS clause with a left join to a CTE (in order to find the records in one table but not the other), I'd expect performance to be worse as Where Exists is usually the fastets way to do that type of task. I guess what I'm saying is that performance will still depend on how you use the CTE not just the fact that you generated one.