Does Postgresql plpgsql/sql support short circuiting in the where clause? - sql

If I have the following toy query
SELECT *
FROM my_tables
WHERE my_id in (
SELECT my_other_id
FROM my_other_tables
) AND some_slow_func(arg) BETWEEN 1 AND 2;
Would the first condition in the WHERE clause short circuit the second condition which would have a complex run time?
I'm working on some sql that is actually part of a FOR LOOP in plpgsql, and I could do iterations over all records that exist in the my_other_tables, and then test within the scope of the FOR LOOP with the some_slow_func(). But I'm curious if sql supports, or plpgsql supports short circuiting.
Some Research:
I looked in the Postgres mailing lists and found this saying SQL in general doesn't support short circuiting:
http://www.postgresql.org/message-id/171423D4-9229-4D56-B06B-58D29BB50A77#yahoo.com
But one of the responses says that order can be enforced through subselects. I'm not exactly sure what he's speaking about. I know what a subselect is, but I'm not sure how order would be enforced? Could some one clarify this for me?

As documented, the evaluation order in a WHERE clause is supposed to be unpredictable.
It's different with subqueries. With PostgreSQL older than version 12, the simplest and common technique to drive the evaluation order is to write a subquery in a CTE. To make sure that the IN(...) is evaluated first, your code could be written as:
WITH subquery AS
(select * from my_tables
WHERE my_id in (SELECT my_other_id FROM my_other_tables)
)
SELECT * FROM subquery
WHERE some_slow_func(arg) BETWEEN 1 AND 2;
Starting with PostgreSQL version 12, WITH subqueries may be inlined by the optimizer (see the doc page on WITH queries for all the details), and the non-inlining is only guaranteed when adding the MATERIALIZED clause:
WITH subquery AS MATERIALIZED
(select * ... the rest is similar as above)
Something else that you may tweak is the cost of your function to signal to the optimizer that it's slow. The default cost for a function is 100, and it can be altered with a statement like:
ALTER FUNCTION funcname(argument types) cost N;
where N is the estimated per-call cost, expressed in an arbitrary unit that should be compared to the Planner Cost Constants.

I know this is an old question, but recently ran into similar issue, and found using a CASE predicate in the WHERE clause worked better for me. In the context of the answer above:
SELECT *
FROM my_tables
WHERE CASE WHEN my_id in (SELECT my_other_id
FROM my_other_tables)
AND some_slow_func(arg) BETWEEN 1 AND 2
THEN 1
ELSE 0
END = 1;
This makes for SQL that is slightly more DB agnostic. Of course, it may not make use of indexes if you have some on my_id, but depending on the context you're in, this could be a good option.

According to the Postgresql docs and this answer by Tom Lane, the order of execution of WHERE constraints is not reliable.
I think your best bet here may be to add that other part of your WHERE clause into the top of your function and "fail fast"; ie, run my_id in (
SELECT my_other_id
FROM my_other_tables) in your function, and if it doesn't pass, return right there before doing you intensive processing. That should get you about the same effect.

Related

Postgres all subqueries in coalesce executed

COALESCE in Postgres is a function that returns the first parameter not null.
So I used coalesce in subqueries like:
SELECT COALESCE (
( SELECT * FROM users WHERE... ORDER BY ...),
( SELECT * FROM users WHERE... ORDER BY ...),
( SELECT * FROM users WHERE... ORDER BY ...),
( SELECT * FROM users WHERE... ORDER BY ...)
);
I change the where in any query and they contain lots of params and CASE, also different ORDER BY clauses.
This is because I always want to return something but giving priorities.
What I noticed while issuing EXPLAIN ANALYZE is that any query is executed despite the first one actually returns NOT a null value.
I would expect the engine to run only the first one query and not the following ones if it returns not null.
This way I could have a bad performance.
So am I doing any bad practice and is it better to run the queries separately for performance reason?
EDIT:
Sorry you where right I don’t select * but I select only one column. I didn’t post my code because I am not interested in my query but it’s a generic question to understand how the engine is working. So I reproduce a very simple fiddle here http://sqlfiddle.com/#!17/a8aa7/4
I may be wrong but I think it behaves as I was telling: it runs all the subqueries despite the first one already returns a not null value
EDIT 2: ok I read only now it says never executed. So the other two queries aren’t getting executed. What confused me was the fact they were included in the query plan.
Anyways it’s still important for my question. Is it better to run all the queries separately for performance reasons? Because it seems like that even if the first one returns a not null value the other two subqueries can slow down the performance
For separate SELECT queries, I suggest to use UNION ALL / LIMIT 1 instead. Based on your fiddle:
(select user_id from users order by age limit 1) -- misleading example, see below
UNION ALL
(select user_id from users where user_id=1)
UNION ALL
(select user_id from users order by user_id DESC limit 1)
LIMIT 1;
db<>fiddle here
For three reasons:
Works for any SELECT list: single expressions (your fiddle), multiple or whole row (your example in the question).
You can distinguish actual NULL values from "no row". Since user_id is the PK in the example (and hence, NOT NULL), the problem cannot surface in the example. But with an expression that can be NULL, COALESCE cannot distinguish between both, "no row" is coerced to NULL for the purpose of the query. See:
Return a value if no record is found
Faster.
Aside, your first SELECT in the example makes this a wild-goose chase. It returns a row if there is at least one. The rest is noise in this case.
Related:
PostgreSQL combine multiple select statements
SQL - does order of OR conditions matter?
Way to try multiple SELECTs till a result is available?

can oracle hints be used to defer a (pl/sql) condition till last?

I'm trying to optimise a select (cursor in pl/sql code actually) that includes a pl/sql function e.g.
select * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns
and myplsqlfunction(t.val) = 'X'
The myplsqlfunction() is very expensive, but is only applicable to a manageably small subset of the other conditions.
The problem is that Oracle appears to evaluating myplsqlfunction() on more data than is ideal.
My evidence for this is if I recast the above as either
select * from (
select * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns
) where myplsqlfunction(t.val) = 'X'
or pl/sql as:
begin
for t in ( select * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns ) loop
if myplsqlfunction(t.val) = 'X' then
-- process the desired subset
end if;
end loop;
end;
performance is an order of magnitude better.
I am resigned to restructuring the offending code to use either of the 2 above idioms, but it would be delighted if there was any simpler way to get the Oracle optimizer to do this for me.
You could specify a bunch of hints to force a particular plan. But that would almost assuredly be more of a pain than restructuring the code.
I would expect that what you really want to do is to associate non-default statistics with the function. If you tell Oracle that the function is less selective than the optimizer is guessing or (more likely) if you provide high values for the CPU or I/O cost of the function, you'll cause the optimizer to try to call the function as few times as possible. The oracle-developer.net article walks through how to pick reasonably correct values for the cost (or going a step beyond that how to make those statistics change over time as the cost of the function call changes). You can probably fix your immediate problem by setting crazy-high costs but you probably want to go to the effort of setting accurate values so that you're giving the optimizer the most accurate information possible. Setting costs way too high or way too low tends to cause some set of queries to do something stupid.
You can use WITH clause to first evaluate all your join conditions and get a manageable subset of data. Then you can go for the pl/sql Function on the subset of data. But it all depends on the volume still you can try this. Let me know for any issues.
You can use CTE like:
WITH X as
( select /*+ MATERIALIZE */ * from mytable t,mytable2 t2...
where t.thing = 'XXX'
... lots more joins and sql predicate on various columns
)
SELECT * FROM X
where myplsqlfunction(t.val) = 'X';
Note the Materiliaze hint. CTEs can be either inlined or materialized(into TEMP tablespace).
Another option would be to use NO_PUSH_PRED hint. This is generally better solution (avoids materializing of the subquery), but it requires some tweaking.
PS: you should not call another SQL from myplsqlfunction. This SQL might see data added after your query started and you might get surprising results.
You can also declare your function as RESULT_CACHE, to force the Oracle to remember return values from the function - if applicable i.e. the amount of possible function's parameter values is reasonably small.
Probably the best solution is to associate the stats, as Justin describes.

Using Order By in IN condition

Let's say we have this silly query:
Select * From Emp Where Id In (Select Id From Emp)
Let's do a small modification inside IN condition by adding an Order By clause.
Select * From Mail Where Id In (Select Id From Mail Order By Id)
Then we are getting the following error:
ORA-00907: missing right parenthesis
Oracle assumes that IN condition will end after declaring the From table. As a result waits for right parenthesis, but we give an Order By instead.
Why can't we add an Order By inside IN condition?
FYI: I don't ask for a equivalent query. This is an example after all. I just can't understand why this error occurs.
Consider the condition x in (select something from somewhere). It returns true if x is equal to any of the somethings returned from the query - regardless of whether it's the first, second, last, or anything in the middle. The order that the somethings are returned is inconsequential. Adding an order by clause to a query often comes with a hefty performance hit, so I'm guessing this this limitation was introduced to prevent adding a clause that has no impact on the query's correctness on the one hand and may be quite expensive on the other.
It would not make sense to sort the values inside the IN clause. Think in this way:
IN (a, b, c)
is same as
IN (c, b, a)
IS SAME AS
IN (b, c, a)
Internally Oracle applies OR condition, so it is equivalent to:
WHERE id = a OR id = b OR id = c
Would it make any sense to order the conditions?
Ordering comes at it's own expense. So, when there is no need to sort, just don't do it. And in this case, Oracle applied the same rule.
When it comes to performance of the query, optimizer needs choose the best possible execution plan i.e. with the least cost to achieve the desired output. ORDER BY is useless in this case, and Oracle did a good job to prevent using it at all.
For the documentation,
Type of Condition Operation
----------------- ----------
IN Equal-to-any-member-of test. Equivalent to =ANY.
So, when you need to test ANY value for membership in a list of values, there is no need of ordered list, just a random matching does the job.
If you look at Oracle SQL reference (syntax diagrams) you will find a reason. ORDER BY is part of "select" statement, while IN clause uses lover level "subquery" statement. Your problem relates to nature of the Oracle's SQL grammar.
PS: there might be more gotchas like multiple UNION, MINUS on "subqueries" and then also you can use ONLY one ORDER BY clause, as this applies only onto result of UNION operation.
This will fail too:
select * from dual order by 1
union all
select * from dual order by 1;

Subquery is faster using a function

I have a long query (~200 lines) that I have embedded in a function:
CREATE FUNCTION spot_rate(base_currency character(3),
contra_currency character(3),
pricing_date date) RETURNS numeric(20,8)
Whether I run the query directly or the function I get similar results and similar performance. So far so good.
Now I have another long query that looks like:
SELECT x, sum(y * spot_rates.spot)
FROM (SELECT a, b, sum(c) FROM t1 JOIN t2 etc. (6 joins here)) AS table_1,
(SELECT
currency,
spot_rate(currency, 'USD', current_date) AS "spot"
FROM (SELECT DISTINCT currency FROM table_2) AS "currencies"
) AS "spot_rates"
WHERE
table_1.currency = spot_rates.currency
GROUP BY / ORDER BY
This query runs in 300 ms, which is slowish but fast enough at this stage (and probably makes sense given the number of rows and aggregation operations).
If however I replace spot_rate(currency, 'USD', current_date) by its equivalent query, it runs in 5+ seconds.
Running the subquery alone returns in ~200ms whether I use the function or the equivalent query.
Why would the query run more slowly than the function when used as a subquery?
ps: I hope there is a generic answer to this generic problem - if not I'll post more details but creating a contrived example is not straightforward.
EDIT: EXPLAIN ANALYZE run on the 2 subqueries and whole queries
subquery with function: http://explain.depesz.com/s/UHCF
subquery with direct query: http://explain.depesz.com/s/q5Q
whole query with function: http://explain.depesz.com/s/ZDt
whole query with direct query: http://explain.depesz.com/s/R2f
just the function body, using one set of arguments: http://explain.depesz.com/s/mEp
Just a wild guess: your query's range-table is exceeding the join_collapse_limit, causing a suboptimal plan to be used.
Try moving the subquery-body (the equivalent of the function) into a CTE, to keep it intact. (CTE's are always executed, and never broken-up by the query-generator/planner)
pre-calculting parts of the query into (TEMP) tables or materialised views can also help to reduce the number of RTEs
You could (temporarily) increase join_collapse_limit, but this will cost more planning time, and there certainly is a limit to this (the number of possible plans grows exponentially with the size of the range table.)
Normally, you can detect this behaviour by the bad query plan (like here: fewer index scans), but you'll need knowledge of the schema, and there must be some kind of reasonable plan possible (read: PK/FK and indices must be correct, too)

WHERE-CASE clause Subquery Performance

The question can be specific to SQL server.
When I write a query such as :
SELECT * FROM IndustryData WHERE Date='20131231'
AND ReportTypeID = CASE WHEN (fnQuarterDate('20131231')='20131231') THEN 1
WHEN (fnQuarterDate('20131231')!='20131231') THEN 4
END;
Does the Function Call fnQuarterDate (or any Subquery) within Case inside a Where clause is executed for EACH row of the table ?
How would it be better if I get the function's (or any subquery) value beforehand inside a variable like:
DECLARE #X INT
IF fnQuarterDate('20131231')='20131231'
SET #X=1
ELSE
SET #X=0
SELECT * FROM IndustryData WHERE Date='20131231'
AND ReportTypeID = CASE WHEN (#X = 1) THEN 1
WHEN (#X = 0) THEN 4
END;
I know that in MySQL if there is a subquery inside IN(..) within a WHERE clause, it is executed for each row, I just wanted to find out the same for SQL SERVER.
...
Just populated table with about 30K rows and found out the Time Difference:
Query1= 70ms ; Query 2= 6ms. I think that explains it but still don't know the actual facts behind it.
Also would there be any difference if instead of a UDF there was a simple subquery ?
I think the solution may in theory help you increase the performance, but it also depends on what the scalar function actually does. I think that in this case (my guess is formatting the date to last day in the quarter) would really be negligible.
You may want to read this page with suggested workarounds:
http://connect.microsoft.com/SQLServer/feedback/details/273443/the-scalar-expression-function-would-speed-performance-while-keeping-the-benefits-of-functions#
Because SQL Server must execute each function on every row, using any function incurs a cursor like performance penalty.
And in Workarounds, there is a comment that
I had the same problem when I used scalar UDF in join column, the
performance was horrible. After I replaced the UDF with temp table
that contains the results of UDF and used it in join clause, the
performance was order of magnitudes better. MS team should fix UDF's
to be more reliable.
So it appears that yes, this may increase the performance.
Your solution is correct, but I would recommend considering an improvement of the SQL to use ELSE instead, it looks cleaner to me:
AND ReportTypeID = CASE WHEN (#X = 1) THEN 1
ELSE 4
END;
It depends. See User-Defined Functions:
The number of times that a function specified in a query is actually executed can vary between execution plans built by the optimizer. An example is a function invoked by a subquery in a WHERE clause. The number of times the subquery and its function is executed can vary with different access paths chosen by the optimizer.
This approach uses in-line MySQL variables... The query alias of "sqlvars" will prepare the #dateBasis first with the date in question, then a second variable #qtrReportType based on the function call done ONCE for the entire query. Then, by cross-join (via no where clause between the tables since the sqlvars is considered a single row anyhow), will use those values to get data from your IndustryData table.
select
ID.*
from
( select
#dateBasis := '20131231',
#qtrReportType := case when fnQuarterDate(#dateBasis) = #dateBasis
then 1 else 4 end ) sqlvars,
IndustryData ID
where
ID.Date = #dateBasis
AND ID.ReportTypeID = #qtrReportType