Look at the following query:
If I comment the subquery it uses parallel execution otherwise it doesn't.
After the query has been
SELECT /*+ parallel(c, 20) */
1, (SELECT 2 FROM DUAL)
FROM DUAL c;
You could have found the answer in the documentation:
A SELECT statement can be parallelized only if the following
conditions are satisfied:
The query includes a parallel hint specification (PARALLEL or
PARALLEL_INDEX) or the schema objects referred to in the query have a
PARALLEL declaration associated with them.
At least one of the tables specified in the query requires one of
the following:
A full table scan
An index range scan spanning multiple partitions
No scalar subqueries are in the SELECT list.
Your query falls at the final hurdle: it has a scalar subquery in its projection. If you want to parallelize the query you need to find another way to write it.
One Idea could be not to use a subquery, but you can try and use a join? Your sub query seems fairly simply, no grouping etc, so it should not be an issue to translate it into a join.
Maybe the optimizer is not capable of parallel execution when there are subqueries.
Related
I came across a SQL practice question. The revealed answer is
SELECT ROUND(ABS(a - c) + ABS(b - d), 4) FROM (
SELECT MIN(lat_n) AS a, MIN(long_w) AS b, MAX(lat_n) AS c, MAX(long_w) AS d
FROM station);
Normally, I would enocunter
select[] from[] where [] (select...)
which to imply that the selected variable from the inner loop at the where clause will determine what is to be queried in the outer loop. As mentioned at the beginning, this time the select is after
FROM to me I'm curious the functionality of this. Is it creating an imaginary table?
The piece in parentheses:
(SELECT MIN(lat_n) AS a, MIN(long_w) AS b, MAX(lat_n) AS c, MAX(long_w) AS d FROM station)
is a subquery.
What's important here is that the result of a subquery looks like a regular table to the outer query. In some SQL flavors, an alias is necessary immediately following the closing parenthesis (i.e. a name by which to refer to the table-like result).
Whether this is technically a "temporary table" is a bit of a detail as its result isn't stored outside the scope of the query; and there is an also a thing called a temporary table which is stored.
Additionally (and this might be the source of confusion), subqueries can also be used in the WHERE clause with an operator (e.g. IN) like this:
SELECT student_name
FROM students
WHERE student_school IN (SEELCT school_name FROM schools WHERE location='Springfield')
This is, as discussed in the comments and the other answer a subquery.
Logically, such a subquery (when it appears in the FROM clause) is executed "first", and then the results treated as a table1. Importantly though, that is not required by the SQL language2. The entire query (including any subqueries) is optimized as a whole.
This can include the optimizer doing things like pushing a predicate from the outer WHERE clause (which, admittedly, your query doesn't have one) down into the subquery, if it's better to evaluate that predicate earlier rather than later.
Similarly, if you had two subqueries in your query that both access the same base table, that does not necessarily mean that the database system will actually query that base table exactly twice.
In any case, whether the database system chooses to materialize the results (store them somewhere) is also decided during the optimization phase. So without knowing your exact RDBMS and the decisions that the optimizer takes to optimize this particular query, it's impossible to say whether it will result in something actually being stored.
1Note that there is no standard terminology for this "result set as a table" produced by a subquery. Some people have mentioned "temporary tables" but since that is a term with a specific meaning in SQL, I shall not be using it here. I generally use the term "result set" to describe any set of data consisting of both columns and rows. This can be used both as a description of the result of the overall query and to describe smaller sections within a query.
2Provided that the final results are the same "as if" the query had been executed in its logical processing order, implementations are free to perform processing in any ordering they choose to.
As there are so many terms involved, I just thought I'll throw in another answer ...
In a relational database we deal with tables. A query reads from tables and its result again is a table (albeit not a stored one).
So in the FROM clause we can access query results just like any stored table:
select * from (select * from t) x;
This makes the inner query a subquery to our main query. We could also call this an ad-hoc view, because view is the word we use for queries we access data from. We can move it to the begin of our main query in order to enhance readability and possibly use it multiple times in it:
with x as (select * from t) select * from x;
We can even store such queries for later access:
create view v as select * from t;
select * from v;
In the SQL standard these terms are used:
BASE TABLE is a stored table we create with CREATE TABLE .... t in above examples is supposed to be a base table.
VIEWED TABLE is a view we create with CREATE VIEW .... v above examples is a viewed table.
DERIVED TABLE is an ad-hoc view, such as x in the examples above.
When using subqueries in other clauses than FROM (e.g. in the SELECT clause or the WHERE clause), we don't use the term "derived table". This is because in these clauses we don't access tables (i.e. something like WHERE mytable = ... does not exist), but columns and expression results. So the term "subquery" is more general than the term "derived table". In those clauses we still use various terms for subqueries, though. There are correlated and non-correlated subqueries and scalar and non-scalar ones.
And to make things even more complicated we can use correlated subqueries in the FROM clause in modern DBMS that feature lateral joins (sometimes implemented as CROSS APPLY and OUTER APPLY). The standard calls these LATERAL DERIVED TABLES.
I have a long query (~200 lines) that I have embedded in a function:
CREATE FUNCTION spot_rate(base_currency character(3),
contra_currency character(3),
pricing_date date) RETURNS numeric(20,8)
Whether I run the query directly or the function I get similar results and similar performance. So far so good.
Now I have another long query that looks like:
SELECT x, sum(y * spot_rates.spot)
FROM (SELECT a, b, sum(c) FROM t1 JOIN t2 etc. (6 joins here)) AS table_1,
(SELECT
currency,
spot_rate(currency, 'USD', current_date) AS "spot"
FROM (SELECT DISTINCT currency FROM table_2) AS "currencies"
) AS "spot_rates"
WHERE
table_1.currency = spot_rates.currency
GROUP BY / ORDER BY
This query runs in 300 ms, which is slowish but fast enough at this stage (and probably makes sense given the number of rows and aggregation operations).
If however I replace spot_rate(currency, 'USD', current_date) by its equivalent query, it runs in 5+ seconds.
Running the subquery alone returns in ~200ms whether I use the function or the equivalent query.
Why would the query run more slowly than the function when used as a subquery?
ps: I hope there is a generic answer to this generic problem - if not I'll post more details but creating a contrived example is not straightforward.
EDIT: EXPLAIN ANALYZE run on the 2 subqueries and whole queries
subquery with function: http://explain.depesz.com/s/UHCF
subquery with direct query: http://explain.depesz.com/s/q5Q
whole query with function: http://explain.depesz.com/s/ZDt
whole query with direct query: http://explain.depesz.com/s/R2f
just the function body, using one set of arguments: http://explain.depesz.com/s/mEp
Just a wild guess: your query's range-table is exceeding the join_collapse_limit, causing a suboptimal plan to be used.
Try moving the subquery-body (the equivalent of the function) into a CTE, to keep it intact. (CTE's are always executed, and never broken-up by the query-generator/planner)
pre-calculting parts of the query into (TEMP) tables or materialised views can also help to reduce the number of RTEs
You could (temporarily) increase join_collapse_limit, but this will cost more planning time, and there certainly is a limit to this (the number of possible plans grows exponentially with the size of the range table.)
Normally, you can detect this behaviour by the bad query plan (like here: fewer index scans), but you'll need knowledge of the schema, and there must be some kind of reasonable plan possible (read: PK/FK and indices must be correct, too)
I am new to DB2 and I have a question about the with clause.
For example in the following query:
WITH values AS
(
SELECT user_id, user_data FROM USER WHERE user_age < 20
)
SELECT avg(values.user_data) FROM values
UNION
SELECT sum(values.user_data) FROM values
How many times will the common table expression be executed? Will the result of the with clause be stored in a temporary table or it will do sub-select twice.
(I use with and union here just to give an example, and sorry for my poor english)
As #Vladimir Oselsky has mentioned, only looking at the execution plan will give you a definite answer. In this contrived example the CTE subselect will likely run twice.
In DB2, common table expressions should create the Common Table Expression Node in the execution plan (see the documentation here). This node explicitly says:
They serve as intermediate tables. Traditionally, a nested table
expression also serves this purpose. However, a common table
expression can be referenced multiple times after it is instantiated;
nested table expressions cannot.
I read this as saying that the CTE is only evaluated once, instantiated, and then used multiple times. Also, if the CTE is referenced only one time, the "instantiation" is optimized away.
Note that this is the way that Postgres handles CTEs (materialized subqueries) and not the way the SQL Server handles them.
If I have the following toy query
SELECT *
FROM my_tables
WHERE my_id in (
SELECT my_other_id
FROM my_other_tables
) AND some_slow_func(arg) BETWEEN 1 AND 2;
Would the first condition in the WHERE clause short circuit the second condition which would have a complex run time?
I'm working on some sql that is actually part of a FOR LOOP in plpgsql, and I could do iterations over all records that exist in the my_other_tables, and then test within the scope of the FOR LOOP with the some_slow_func(). But I'm curious if sql supports, or plpgsql supports short circuiting.
Some Research:
I looked in the Postgres mailing lists and found this saying SQL in general doesn't support short circuiting:
http://www.postgresql.org/message-id/171423D4-9229-4D56-B06B-58D29BB50A77#yahoo.com
But one of the responses says that order can be enforced through subselects. I'm not exactly sure what he's speaking about. I know what a subselect is, but I'm not sure how order would be enforced? Could some one clarify this for me?
As documented, the evaluation order in a WHERE clause is supposed to be unpredictable.
It's different with subqueries. With PostgreSQL older than version 12, the simplest and common technique to drive the evaluation order is to write a subquery in a CTE. To make sure that the IN(...) is evaluated first, your code could be written as:
WITH subquery AS
(select * from my_tables
WHERE my_id in (SELECT my_other_id FROM my_other_tables)
)
SELECT * FROM subquery
WHERE some_slow_func(arg) BETWEEN 1 AND 2;
Starting with PostgreSQL version 12, WITH subqueries may be inlined by the optimizer (see the doc page on WITH queries for all the details), and the non-inlining is only guaranteed when adding the MATERIALIZED clause:
WITH subquery AS MATERIALIZED
(select * ... the rest is similar as above)
Something else that you may tweak is the cost of your function to signal to the optimizer that it's slow. The default cost for a function is 100, and it can be altered with a statement like:
ALTER FUNCTION funcname(argument types) cost N;
where N is the estimated per-call cost, expressed in an arbitrary unit that should be compared to the Planner Cost Constants.
I know this is an old question, but recently ran into similar issue, and found using a CASE predicate in the WHERE clause worked better for me. In the context of the answer above:
SELECT *
FROM my_tables
WHERE CASE WHEN my_id in (SELECT my_other_id
FROM my_other_tables)
AND some_slow_func(arg) BETWEEN 1 AND 2
THEN 1
ELSE 0
END = 1;
This makes for SQL that is slightly more DB agnostic. Of course, it may not make use of indexes if you have some on my_id, but depending on the context you're in, this could be a good option.
According to the Postgresql docs and this answer by Tom Lane, the order of execution of WHERE constraints is not reliable.
I think your best bet here may be to add that other part of your WHERE clause into the top of your function and "fail fast"; ie, run my_id in (
SELECT my_other_id
FROM my_other_tables) in your function, and if it doesn't pass, return right there before doing you intensive processing. That should get you about the same effect.
Hello I made a SQL test and dubious/curious about one question:
In which sequence are queries and sub-queries executed by the SQL engine?
the answers was
primary query -> sub query -> sub sub query and so on
sub sub query -> sub query -> prime query
the whole query is interpreted at one time
There is no fixed sequence of interpretation, the query parser takes a decision on fly
I choosed the last answer (just supposing that it is most reliable w.r.t. others).
Now the curiosity:
where can i read about this and briefly what is the mechanism under all of that?
Thank you.
I think answer 4 is correct. There are a few considerations:
type of subquery - is it corrrelated, or not. Consider:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
)
Here, the subquery is not correlated to the outer query. If the number of values in t2.id is small in comparison to t1.id, it is probably most efficient to first execute the subquery, and keep the result in memory, and then scan t1 or an index on t1.id, matching against the cached values.
But if the query is:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
WHERE t2.type = t1.type
)
here the subquery is correlated - there is no way to compute the subquery unless t1.type is known. Since the value for t1.type may vary for each row of the outer query, this subquery could be executed once for each row of the outer query.
Then again, the RDBMS may be really smart and realize there are only a few possible values for t2.type. In that case, it may still use the approach used for the uncorrelated subquery if it can guess that the cost of executing the subquery once will be cheaper that doing it for each row.
Option 4 is close.
SQL is declarative: you tell the query optimiser what you want and it works out the best (subject to time/"cost" etc) way of doing it. This may vary for outwardly identical queries and tables depending on statistics, data distribution, row counts, parallelism and god knows what else.
This means there is no fixed order. But it's not quite "on the fly"
Even with identical servers, schema, queries, and data I've seen execution plans differ
The SQL engine tries to optimise the order in which (sub)queries are executed. The part deciding about that is called a query optimizer. The query optimizer knows how many rows are in each table, which tables have indexes and on what fields. It uses that information to decide what part to execute first.
If you want something to read up on these topics, get a copy of Inside SQL Server 2008: T-SQL Querying. It has two dedicated chapters on how queries are processed logically and physically in SQL Server.
It's usually depends from your DBMS, but ... I think second answer is more plausible.
Prime query usually can't be calculated without sub query results.