Why does oracle optimiser treat join by JOIN and WHERE differently? - sql

I have a query on which I used query optimiser:
SELECT res.studentid,
res.examid,
r.percentcorrect,
MAX(attempt) AS attempt
FROM tbl res
JOIN (SELECT studentid,
examid,
MAX(percentcorrect) AS percentcorrect
FROM tbl
GROUP BY studentid, examid) r
ON r.studentid = res.studentid
AND r.examid = res.examid
AND r.percentcorrect = res.percentcorrect
GROUP BY res.studentid, res.examid, r.percentcorrect
ORDER BY res.examid
What surprised me was that the optimiser returned the following as over 40% faster:
SELECT /*+ NO_CPU_COSTING */ res.studentid,
res.examid,
r.percentcorrect,
MAX(attempt) AS attempt
FROM tbl res,
(SELECT studentid,
examid,
MAX(percentcorrect) AS percentcorrect
FROM tbl
GROUP BY studentid, examid) r
WHERE r.studentid = res.studentid
AND r.examid = res.examid
AND r.percentcorrect = res.percentcorrect
GROUP BY res.studentid, res.examid, r.percentcorrect
ORDER BY res.examid
Here are the execution plans for both:
How is that possible? I always thought the optimiser treats JOIN exactly as the WHERE clause in the optimised query...

From here:
In general you should find that the cost of a table scan will increase
when you enable CPU Costing (also known as "System Statistics"). This
means that your improved run time is likely to be due to changes in
execution path that have started to favour execution plans. There are
a few articles about system statistics on my blog that might give you
more background, and a couple of links from there to other relevant
articles:
http://jonathanlewis.wordpress.com/category/oracle/statistics/system-stats/
In other words, your statistics might be stale, but since you have "turned them off" for this query, you avoid using an inefficient path: hence the (temporary?) improvement.

Related

Very slow Clustered Index Seek when there's a WHERE clause

I have an important SQL query that is performing too slowly. I pinpointed its performance issues to a view. Here is (roughly) what the view looks like:
Without WHERE clause
-- the 'top 100' isn't part of the view, but I've added it for testing purposes
SELECT top 100
fs.*,
fss.Status,
fss.CreateDateTimeUtc StatusDateTimeUtc,
fss.IsError,
fss.CorrelationId
FROM dbo.FormSubmission fs WITH (NOLOCK)
CROSS APPLY (
SELECT TOP 1
FormId,
SubmissionId,
Status,
CreateDateTimeUtc,
IsError,
CorrelationId
FROM dbo.FormSubmissionStatus x WITH (NOLOCK)
WHERE x.FormId = fs.FormId AND x.SubmissionId = fs.SubmissionId
ORDER BY CreateDateTimeUtc DESC
) fss
If I run this, it's pretty quick. Here are some metrics and the execution plan:
00:00:00.441
Table 'FormSubmissionStatus'. Scan count 102, logical reads 468
Table 'FormSubmission'. Scan count 1, logical reads 4
With WHERE clause
However, as soon as I add this WHERE clause, it gets much slower.
where status in ('Transmitted', 'Acknowledging')
Metrics and exeuction plan:
00:00:15.1984
Table 'FormSubmissionStatus'. Scan count 4145754, logical reads 17619490
Table 'FormSubmission'. Scan count 1, logical reads 101978
Index attempt
I tried various types of new indexes and I haven't seen any real improvements. Here is an example of one:
create index ix_fss_datetime_formId_submissionId_status
on FormSubmissionStatus (CreateDateTimeUtc) include (formId, submissionId, status)
where status in ('Transmitted', 'Acknowledging')
What else can I try to speed this up?
If it helps to know, the PK for this table is a composite of FormId (uniqueidentifier), SubmissionId (varchar50), Status(varchar50), and CreateDateTimeUtc(datetime2)
Update
Per #J.Salas's suggestion in the comments, I tried putting the WHERE clase in the subquery and saw a massive improvement (~700ms execution time vs the ~15s).
This isn't a solution, since I can't have that where clause in my view (the query that uses this view adds the WHERE clause). However, it does point to the subquery being a problem. Is there a way I could restructure it? Maybe do the subquery as a temp table and join on fs?
Looking at the query plan I do not hold hold much hope that the following could help. But your view query could be reformulated to use a CTE and ROW_NUMBER() instead of CROSS APPLY. I believe the following is equivalent in meaning:
WITH fss AS (SELECT
FormId,
SubmissionId,
Status,
CreateDateTimeUtc,
IsError,
CorrelationId,
ROW_NUMBER() OVER (PARTITION BY FormId, SubmissionId ORDER BY CreateDateTimeUtc DESC) AS RN
FROM dbo.FormSubmissionStatus)
SELECT
fs.*,
fss.Status,
fss.CreateDateTimeUtc StatusDateTimeUtc,
fss.IsError,
fss.CorrelationId
FROM dbo.FormSubmission fs
INNER JOIN fss
ON fss.FormId = fs.FormId
AND fss.SubmissionId = fs.SubmissionId
WHERE fss.RN = 1;
The APPLY operator in your original query is saying for ever row in fs run this query. Which if taken literally would cause that second query to run many many times. However SQL Server is free to optimize the plan so that the results are as if the subquery fss was run once per row of fs. So it may not be able to optimize the above any better.
For indexes I would try on (FormId, SubmissionId, CreateDateTimeUtc DESC) maybe with INCLUDE (Status). But really anything besides the FormId, SubmissionId, and CreateDateTimeUtc would depend on how the view is used.
Query tuning is a matter of educated guesses combined with trial and error. To get better information for making informed guesses something like Brent Ozar's SQL Server First Responder Kit can help get information on what is actually happening in production. How to use is beyond the scope of a single StackOverflow answer.

SQL Server choosing inefficient execution plan

I've got a query that gets run in certain circumstances with an 'over-simplified' execution plan that actually turns out to be quite slow (3-5 seconds). The query is:
SELECT DISTINCT Salesperson.*
FROM Salesperson
INNER JOIN SalesOrder on Salesperson.Id = SalesOrder.SalespersonId
INNER JOIN PrelimOrder on SalesOrder.Id = PrelimOrder.OrderId
INNER JOIN PrelimOrderStatus on PrelimOrder.CurrentStatusId = PrelimOrderStatus.Id
INNER JOIN PrelimOrderStatusType on PrelimOrderStatus.StatusTypeId = PrelimOrderStatusType.Id
WHERE
PrelimOrderStatusType.StatusTypeCode = 'Draft'
AND Salesperson.EndDate IS NULL
and the slow execution plan looks like:
The thing that stands out straight away is that the actual number of rows/executions is significantly higher than the respective estimates:
If I remove the Salesperson.EndDate IS NULL clause, then a faster, parallelized execution plan is run:
A similar execution plan also runs quite fast if I remove the DISTINCT keyword.
From what I can gather, it seems that the optimiser decides, based on its incorrect estimates, that the query won't be costly to run and therefore doesn't choose the parallelized plan. But I can't for the life of me figure out why it is choosing the incorrect plan. I have checked my statistics and they are all as they should be. I have tested in both SQL Server 2008 to 2016 with identical results.
SELECT DISTINCT is expensive. So, it is best to avoid it. Something like this:
SELECT sp.*
FROM Salesperson sp
WHERE EXISTS (SELECT 1
FROM SalesOrder so INNER JOIN
PrelimOrder po
ON so.Id = po.OrderId INNER JOIN
PrelimOrderStatus pos
ON po.CurrentStatusId = pos.Id INNER JOIN
PrelimOrderStatusType post
ON pos.StatusTypeId = post.Id
WHERE sp.Id = so.SalespersonId AND
post.StatusTypeCode = 'Draft'
) AND
sp.EndDate IS NULL;
Note: An index on SalesPerson(EndDate, Id) would be helpful.
As #Gordon Linoff already said, DISTINCT usually is bad news for performance. Often it means you're amassing way too much data and then squeezing it back together in a more compact set. Better to keep it small all throughout the process, if possible.
Also, it's kind of counter-intuitive that the query plan with index scans turns out to be faster than the one with index seeks; it seems (in this case) parallelism makes up for it. You could try playing around with the
Cost Threshold For Parallelism Option but beware that this is a server-wide setting! (then again, in my opinion the default of 5 is rather high for most use-cases I've run into personally; CPU's are aplenty these days, time still isn't =).
Bit of a long reach, but I was wondering if you could 'split' the query in 2, thus eliminating (a small) part of the guesswork of the server. I'm assuming here that StatusTypeCode is unique. (verify the datatype of the variable too!)
DECLARE #StatusTypeId int
SELECT #StatusTypeId = Id
FROM PrelimOrderStatusType
WHERE StatusTypeCode = 'Draft'
SELECT Salesperson.*
FROM Salesperson
WHERE Salesperson.EndDate IS NULL
AND EXISTS ( SELECT *
FROM SalesOrder
ON SalesOrder.SalespersonId = Salesperson.Id
JOIN PrelimOrder
ON PrelimOrder.OrderId = SalesOrder.Id
JOIN PrelimOrderStatus
ON PrelimOrderStatus.Id = PrelimOrder.CurrentStatusId
AND PrelimOrderStatus.StatusTypeId = #StatusTypeId)
If it doesn't help, could you give give the definition of the indexes that are being used?

How can I rewrite this query (it includes UNION, WITH and self-join applied for filtering)?

I have written this view when deadline was coming.
WITH AllCategories
AS (SELECT CaseTable.CaseID,
CT.Category,
CT.CategoryType,
Q.Note AS CategoryCaseNote,
Q.CategoryID,
Q.CategoryIsDefaultValue
FROM CaseTable
INNER JOIN
((SELECT CaseID, -- Filled categories in table
CategoryCaseNote AS Note,
CategoryID,
-1 AS QuestionID,
0 AS CategoryIsDefaultValue
FROM CaseCategory)
UNION ALL
(SELECT -1 AS CaseID, -- possible categories
NULL AS Note,
CategoryID AS CategoryID,
QuestionID,
1 AS CategoryIsDefaultValue
FROM SHOW_QuestionCategory)) AS Q
ON (Q.QuestionID = -1
OR Q.QuestionID = CaseTransactionTable.QuestionID)
AND (Q.CaseID = -1
OR Q.CaseID = CaseTable.CaseTransactionID)
LEFT OUTER JOIN
CategoryTable AS CT
ON Q.CategoryID = CT.CategoryID)
SELECT A.*
FROM AllCategories AS A
INNER JOIN
(SELECT CaseID,
CategoryID,
MIN(CategoryIsDefaultValue) AS CategoryIsDefaultValue
FROM AllCategories
GROUP BY CaseID, CategoryID) AS B
ON A.CaseID = B.CaseID
AND A.CategoryID = B.CategoryID
AND A.CategoryIsDefaultValue = B.CategoryIsDefaultValue
Now it's becoming bottleneck because of very expensive join between CaseTable and subquery with UNION (resulting in over 30% cost of frequently used procedure; in execution plan it's nested loops node with ~70% cost of select).
I have tried to rewrite it multiple times, but these attempts resulted only in worser perfomance.
Table CaseCategory have unique index on tuple (CaseID, CategoryID).
It's probably a combination of problems with bad cardinality estimates and use of CTE. With what you've told us, I'd try to give some general guidance. Info you provided on the index means nothing without knowing the cardinality and distribution od the data. BTW, not sure if it would qualify as an answer, but it's too long for a comment. Feel free to downvote :)
There is a stored procedure selecting from the view, am I correct? I also presume you have some WHERE clause somewhere, right?
In that case, get rid of the view alltogether, and move the code into the procedure. This will allow to get rid of the CTE (which is most likely executed twice), and to save the intermediate results from what is now the CTE into a #temp table. Could be benefitial to apply the same #temp-table strategy with the UNION ALL subquery.
Make sure to apply the WHERE predicates as soon as possible (SQL Server is usually good with pushing, but this combination of proc-view-CTE might confuse it).

Oracle performance issue in getting first row in sub query

I have a performance issue on the following (example) select statement that returns the first row using a sub query:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT DISTINCT
FIRST_VALUE(L.LOCATION) OVER (ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
),
P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
The DISTINCT is causing the performance issue by performing a SORT and UNIQUE but I can't figure out an alternative.
I would however prefer something akin to the following but referencing within 2 select statements doesn't work:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT LOCATION
FROM (SELECT L.LOCATION LOCATION
ROWNUM RN
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
ORDER BY L.SORT1, L.SORT2 DESC
) R
WHERE RN <=1
), P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
Additionally:
- My permissions do not allow me to create a function.
- I am cycling through 10k to 100k records in the main query.
- The sub query could return 3 to 7 rows before limiting to 1 row.
Any assistance in improving the performance is appreciated.
It's difficult to understand without sample data and cardinalities, but does this get you what you want? A unique list of projects and items, with the first occurrence of a location?
SELECT
P.ITEM_NUMBER,
P.PROJECT_NUMBER,
MIN(L.LOCATION) KEEP (DENSE_RANK FIRST ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM
LOCATIONS L
INNER JOIN
PROJECT P
ON L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
GROUP BY
P.ITEM_NUMBER,
P.PROJECT_NUMBER
I encounter similar problem in the past -- and while this is not ultimate solution (in fact might just be a corner-cuts) -- Oracle query optimizer can be adjusted with the OPTIMIZER_MODE init param.
Have a look at chapter 11.2.1 on http://docs.oracle.com/cd/B28359_01/server.111/b28274/optimops.htm#i38318
FIRST_ROWS
The optimizer uses a mix of cost and heuristics to find a best plan
for fast delivery of the first few rows. Note: Using heuristics
sometimes leads the query optimizer to generate a plan with a cost
that is significantly larger than the cost of a plan without applying
the heuristic. FIRST_ROWS is available for backward compatibility and
plan stability; use FIRST_ROWS_n instead.
Of course there are tons other factors you should analyse like your index, join efficiency, query plan etc..

Using analytics with left join and partition by

I have two different queries which produce the same results. I wonder which one is more efficent. The second one, I am using one select clause less, but I am moving the where to the outter select. Which one is executed first? The left join or the where clause?
Using 3 "selects":
select * from
(
select * from
(
select
max(t.PRICE_DATETIME) over (partition by t.PRODUCT_ID) as LATEST_SNAPSHOT,
t.*
from
PRICE_TABLE t
) a
where
a.PRICE_DATETIME = a.LATEST_SNAPSHOT;
) r
left join
PRODUCT_TABLE l on (r.PRODUCT_ID = l.PRODUCT_ID and r.PRICE_DATETIME = l.PRICE_DATETIME)
Using 2 selects:
select * from
(
select
max(t.PRICE_DATETIME) over (partition by t.PRODUCT_ID) as LATEST_SNAPSHOT,
t.*
from
PRICE_TABLE t
) r
left join
PRODUCT_TABLE l on (r.PRODUCT_ID = l.PRODUCT_ID and r.PRICE_DATETIME = l.PRICE_DATETIME)
where
r.PRICE_DATETIME = r.LATEST_SNAPSHOT;
ps: I know, I know, "select star" is evil, but I'm writing it this way only here to make it smaller.
"I wonder which one is more efficent"
You can answer this question yourself pretty easily by turning on statistics.
set statistics io on
set statistics time on
-- query goes here
set statistics io off
set statistics time off
Do this for each of your two queries and compare the results. You'll get some useful output about how many reads SQL Server is doing, how many milliseconds each takes to complete, etc.
You can also see the execution plan SQL Server generates viewing the estimated execution plan (ctrl+L or right-click and choose that option) or by enabling "Display Actual Execution Plan" (ctrl+M) and running the queries. That could help answer the question about order of execution; I couldn't tell you off the top of my head.