Very slow Clustered Index Seek when there's a WHERE clause - sql

I have an important SQL query that is performing too slowly. I pinpointed its performance issues to a view. Here is (roughly) what the view looks like:
Without WHERE clause
-- the 'top 100' isn't part of the view, but I've added it for testing purposes
SELECT top 100
fs.*,
fss.Status,
fss.CreateDateTimeUtc StatusDateTimeUtc,
fss.IsError,
fss.CorrelationId
FROM dbo.FormSubmission fs WITH (NOLOCK)
CROSS APPLY (
SELECT TOP 1
FormId,
SubmissionId,
Status,
CreateDateTimeUtc,
IsError,
CorrelationId
FROM dbo.FormSubmissionStatus x WITH (NOLOCK)
WHERE x.FormId = fs.FormId AND x.SubmissionId = fs.SubmissionId
ORDER BY CreateDateTimeUtc DESC
) fss
If I run this, it's pretty quick. Here are some metrics and the execution plan:
00:00:00.441
Table 'FormSubmissionStatus'. Scan count 102, logical reads 468
Table 'FormSubmission'. Scan count 1, logical reads 4
With WHERE clause
However, as soon as I add this WHERE clause, it gets much slower.
where status in ('Transmitted', 'Acknowledging')
Metrics and exeuction plan:
00:00:15.1984
Table 'FormSubmissionStatus'. Scan count 4145754, logical reads 17619490
Table 'FormSubmission'. Scan count 1, logical reads 101978
Index attempt
I tried various types of new indexes and I haven't seen any real improvements. Here is an example of one:
create index ix_fss_datetime_formId_submissionId_status
on FormSubmissionStatus (CreateDateTimeUtc) include (formId, submissionId, status)
where status in ('Transmitted', 'Acknowledging')
What else can I try to speed this up?
If it helps to know, the PK for this table is a composite of FormId (uniqueidentifier), SubmissionId (varchar50), Status(varchar50), and CreateDateTimeUtc(datetime2)
Update
Per #J.Salas's suggestion in the comments, I tried putting the WHERE clase in the subquery and saw a massive improvement (~700ms execution time vs the ~15s).
This isn't a solution, since I can't have that where clause in my view (the query that uses this view adds the WHERE clause). However, it does point to the subquery being a problem. Is there a way I could restructure it? Maybe do the subquery as a temp table and join on fs?

Looking at the query plan I do not hold hold much hope that the following could help. But your view query could be reformulated to use a CTE and ROW_NUMBER() instead of CROSS APPLY. I believe the following is equivalent in meaning:
WITH fss AS (SELECT
FormId,
SubmissionId,
Status,
CreateDateTimeUtc,
IsError,
CorrelationId,
ROW_NUMBER() OVER (PARTITION BY FormId, SubmissionId ORDER BY CreateDateTimeUtc DESC) AS RN
FROM dbo.FormSubmissionStatus)
SELECT
fs.*,
fss.Status,
fss.CreateDateTimeUtc StatusDateTimeUtc,
fss.IsError,
fss.CorrelationId
FROM dbo.FormSubmission fs
INNER JOIN fss
ON fss.FormId = fs.FormId
AND fss.SubmissionId = fs.SubmissionId
WHERE fss.RN = 1;
The APPLY operator in your original query is saying for ever row in fs run this query. Which if taken literally would cause that second query to run many many times. However SQL Server is free to optimize the plan so that the results are as if the subquery fss was run once per row of fs. So it may not be able to optimize the above any better.
For indexes I would try on (FormId, SubmissionId, CreateDateTimeUtc DESC) maybe with INCLUDE (Status). But really anything besides the FormId, SubmissionId, and CreateDateTimeUtc would depend on how the view is used.
Query tuning is a matter of educated guesses combined with trial and error. To get better information for making informed guesses something like Brent Ozar's SQL Server First Responder Kit can help get information on what is actually happening in production. How to use is beyond the scope of a single StackOverflow answer.

Related

Avoid full table scan under window function run with use of join condition

Given a database table events with columns: event_id, correlation_id, username, create_timestamp. It contains more than 1M records.
There is a problem I am trying to solve: for each event of a particular user display its latest sibling event. Sibling events are the events which has same correlation_id. The query I use for that is the following:
SELECT
"events"."event_id" AS "event_id",
"latest"."event_id" AS "latest_event_id"
FROM
events "events"
JOIN (
SELECT
"latest"."correlation_id" AS "correlation_id",
"latest"."event_id" AS "event_id",
ROW_NUMBER () OVER (
PARTITION BY "latest"."correlation_id"
ORDER BY
"latest"."create_timestamp" ASC
) AS "rn"
FROM
events "latest"
) "latest" ON (
"latest"."correlation_id" = "events"."correlation_id"
AND "latest"."rn" = 1
)
WHERE
"events"."username" = 'user1'
It gets correct list of results but causes performance problems which must be fixed. Here is an execution plan of the query:
Hash Right Join (cost=13538.03..15522.72 rows=1612 width=64)
Hash Cond: (("latest".correlation_id)::text = ("events".correlation_id)::text)
-> Subquery Scan on "latest" (cost=12031.35..13981.87 rows=300 width=70)
Filter: ("latest".rn = 1)
-> WindowAgg (cost=12031.35..13231.67 rows=60016 width=86)
-> Sort (cost=12031.35..12181.39 rows=60016 width=78)
Sort Key: "latest_1".correlation_id, "latest_1".create_timestamp
-> Seq Scan on events "latest_1" (cost=0.00..7268.16 rows=60016 width=78)
-> Hash (cost=1486.53..1486.53 rows=1612 width=70)
-> Index Scan using events_username on events "events" (cost=0.41..1486.53 rows=1612 width=70)
Index Cond: ((username)::text = 'user1'::text)
From the plan, I can conclude that the performance problem mainly caused by calculation of latest events for ALL events in the table which takes ~80% of cost. Also it performs the calculations even if there are no events for a user at all. Ideally, I would like the query to do these steps which seem more efficient for me:
find all events by a user
for each event from Step 1, find all siblings, sort them and get the 1st
To simplify the discussion, let's consider all required indexes as already created for needed columns. It doesn't seem to me that the problem can be solved purely by index creation.
Any ideas what can be done to improve the performance? Probably, there are options to rewrite the query or adjust a configuration of the table.
Note that this question is significantly obfuscated in terms of business meaning to clearly demonstrate the technical problem I face.
The window function has to scan the whole table. It has no idea that really you are only interested in the first value. A lateral join could perform better and is more readable anyway:
SELECT
e.event_id,
latest.latest_event_id
FROM
events AS e
CROSS JOIN LATERAL
(SELECT
l.event_id AS latest_event_id
FROM
events AS l
WHERE
l.correlation_id = e.correlation_id
ORDER BY l.create_timestamp
FETCH FIRST 1 ROWS ONLY
) AS latest
WHERE e.username = 'user1';
The perfect index to support that would be
CREATE INDEX ON event (correlation_id, create_timestamp);
All those needless double quotes are making my eyes bleed.
This should very fast with a lateral join, provided the number of returned rows is rather low, i.e. 'user1' is rather specific.
explain analyze SELECT
events.event_id AS event_id,
latest.event_id AS latest_event_id
FROM
events "events"
cross JOIN lateral (
SELECT
latest.event_id AS event_id
FROM events latest
WHERE latest.correlation_id=events.correlation_id
ORDER by create_timestamp ASC limit 1
) latest
WHERE
events.username = 'user1';
You will want an index on username, and one on (correlation_id, create_timestamp)
If the number of rows returned is large, then your current query, which precomputes in bulk, is probably better. But would be faster if you used DISTINCT ON rather than the window function to pull out just the latest for each correlation_id. Unfortunately the planner does not understand these queries to be equivalent, and so will not interconvert between them based on what it thinks will be faster.
One option that may improve efficiency is to rewrite the query filtering "rn" = 1 beforehand to reduce resulting rows when joining tables:
WITH "latestCte"("correlation_id", "event_id") as (SELECT
"correlation_id",
"event_id",
ROW_NUMBER () OVER (
PARTITION BY "correlation_id"
ORDER BY
"create_timestamp" ASC
) AS "rn"
FROM
events)
SELECT
"events"."event_id" AS "event_id",
"latest"."event_id" AS "latest_event_id"
FROM
events "events"
JOIN (
SELECT "correlation_id", "event_id" FROM "latestCte" WHERE "rn" = 1
) "latest" ON (
"latest"."correlation_id" = "events"."correlation_id"
)
WHERE
"events"."username" = 'user1'
Hope it helps, also I am curious to see the resulting execution plan of this query. Best regards.
Without having access to the data, I'm really just throwing out ideas...
Instead of a subquery, it's worth trying a materialized CTE
Rather than the row_number analytic, you can try a distinct on. Honestly, I don't predict any gains. It's basically the same thing at the database level
Sample of both:
with latest as materialized (
SELECT distinct on ("correlation_id")
"correlation_id", "event_id"
FROM events
order by
"correlation_id", "create_timestamp" desc
)
SELECT
e."event_id",
l."event_id" AS "latest_event_id"
FROM
events "events" e
join latest l ON
l."correlation_id" = e."correlation_id"
WHERE
e."username" = 'user1'
Additional suggestion -- if you are doing this over and over, I'd consider creating a temp table or materialized view for "latest," including an index by coorelation_id rather than re-running the subquery (or CTE) every single time. This will be a one-time pain followed my repeated gain.
Yet one more suggestion -- if at all possible, drop the double quotes from your object names. Maybe it's just me, but I find them brutal. Unless you have spaces, reserved words or mandatory uppercase in your field names (please don't do that), then these create more problems than they solve. I kept them in the query I listed above, but it pained me.
And this last comment goes back to knowing your data... since row_number and distinct on are relatively expensive operations, it may make sense to make your subquery/cte more selective by introducing the "user1" constraint. This is completely untested, but something like this:
SELECT distinct on (e1.correlation_id)
e1.correlation_id, e1.event_id
FROM events e1
join events e2 on
e1.correlation_id = e2.correlation_id and
e2.username = 'user1'
order by
e1.correlation_id, e1.create_timestamp desc
Although I like the LATERAL JOIN approach suggested by others, when it comes to fetching just 1 field I'm 50/50 about using that and using a subquery as below.(If you need to fetch multiple fields using the same logic than by all means LATERAL is the way to go!)
I wonder if either would perform better, presumably they are executed in a very similar way by the SQL engine.
SELECT e.event_id,
(SELECT l.event_id
FROM events AS l
WHERE l.correlation_id = e.correlation_id
ORDER BY l.create_timestamp ASC -- shouldn't this be DESC?
FETCH FIRST 1 ROWS ONLY) as latest_event_id
FROM events AS e
WHERE e.username = 'user1';
Note: You're currently asking for the OLDEST correlated record. In your post you say you're looking for the "latest sibling event". "Latest" IMHO implies the most recent one, so it would have the biggest create_timestamp, meaning you need to ORDER BY that field from high to low and then take the first one.
Edit: identical as suggested above, for this approach you also want the index on correlation_id and create_timestamp
CREATE INDEX ON event (correlation_id, create_timestamp);
You might even want to include the event_id to avoid a bookmark lookup although these pages are likely to be in cache anyway so not sure if it will really help all that much.
CREATE INDEX ON event (correlation_id, create_timestamp, event_id);
PS: same is true about adding correlation_id to your events_username index... but that's all quite geared towards this (probably simplified) query and do keep in mind that more (and bigger) indexes will bring some overhead elsewhere even when they might bring big benefits elsewhere... it's always a compromise.

Query not using indexes using using Table function

I have following query:
select i.pkey as instrument_pkey,
p.asof,
p.price,
p.lastprice as lastprice,
p.settlementprice as settlement_price,
p.snaptime,
p.owner as source_id,
i.type as instrument_type
from quotes_mxsequities p,
instruments i,
(select instrument, maxbackdays
from TABLE(cast (:H_ARRAY as R_TAB))) lbd
where p.asof between :ASOF - lbd.maxbackdays and :ASOF
and p.instrument = lbd.instrument
and p.owner = :SOURCE_ID
and p.instrument = i.pkey
Since I have started using table function, query has started doing full table scan on table quotes_mxsequities which is large table.
Earlier when I used IN clause include of table function index was being used.
Any suggestion on how to enforce index usage?
EDIT:
I will try to get explain plan but just to add, H_ARRAY is expected to have around 10k entries. quotes_mxsequities is a large table millions of rows. Instruments is again a large table but has lesser rows than quotes_mxsequities.
Full table scan is happening for quotes_mxsequities while instruments is using index
Quite difficult to answer with no explain plan and informations about table structure, number of rows, etc.
As a general, simplified approach, you could try to force the use on an index with the INDEX hint.
Your problem can even be due to a wrong order in table processing; you can try to make Oracle follow the right order ( I suppose LBD first) with the LEADING hint.
Another point could be the full access, while you probably need a NESTED LOOP; in this case you can try the USE_NL hint
It's hard to be sure form the limited information provided, but it looks like this is an issue with the optimiser not being able to establish the cardinality of the table collection expression, since its contents aren't known at parse time. With a stored nested table the statistics would be available, but here there are none for it to use.
Without that information the optimiser defaults to guessing your table collection will have 8K entries, and uses that as the cardinality estimate; if that is a significant proportion of the number of rows in quotes_mxsequities then it will decide the index isn't going to be efficient, and will use a full table scan.
You can use the undocumented cardinality hint to tell the optimiser roughly how many elements you actually expect in the collection; you presumably won't know exactly, but you might know you usually expect around 10. So you could add a hint:
select /*+ CARDINALITY(lbd, 10) */ i.pkey as instrument_pkey,
You may also find the dynamic sampling hint useful here, but without your real data to look at, the cardinality hint applies to the basic execution plan so it's easy to see its effect.
Incidentally, you don't need the subquery on the table expression, you can simplify slightly to:
from TABLE(cast (:H_ARRAY as R_TAB)) lbd,
quotes_mxsequities p,
instruments i
or even better use modern join syntax:
select /*+ CARDINALITY(lbd, 10) */ i.pkey as instrument_pkey,
p.asof,
p.price,
p.lastprice as lastprice,
p.settlementprice as settlement_price,
p.snaptime,
p.owner as source_id,
i.type as instrument_type
from TABLE(cast (:H_ARRAY as R_TAB)) lbd
join quotes_mxsequities p
on p.asof between :ASOF - lbd.maxbackdays and :ASOF
and p.instrument = lbd.instrument
join instruments i
on i.pkey = p.instrument
where p.owner = :SOURCE_ID;

dense_rank filling up tempdb on SQL server?

I've got this query here which uses dense_rank to number groups in order to select the first group only. It is working but its slow and tempdb (SQL server) becomes so big that the disk is filled up. Is it normal for dense_rank that it's such a heavy operation? And how else should this be done then without resorting to coding?
select
a,b,c,d
from
(select a,b,c,d,
dense_rank() over (order by s.[time] desc) as gn
from [Order] o
JOIN Scan s ON s.OrderId = o.OrderId
JOIN PriceDetail p ON p.ScanId = s.ScanId) as p
where p.OrderNumber = #OrderNumber
and p.Number = #Number
and p.Time > getdate() - 20
and p.gn = 1
group by a,b,c,d,p.gn
Any operation that has to sort a large dataset may fill tempdb. dense_rank is no exception, just like rank, row_number, ntile etc etc.
You are asking for a sort over what appears to be a global, complete sort of every scan entry, since database start. The way you expressed the query the join must occur before the sort, so the sort will be both big and wide. After all is said and done, consuming a lot of IO, CPU and tempdb space, you will restrict the result to a small subset for only a specified order and some conditions (which mentions columns not present in projection, so they must be some made up example not the real code).
You have a filter on WHERE gn=1 followed by a GROUP BY gn. This is unnecessary, the gn is already unique from the predicate so it cannot contribute to the group by.
You compute the dense_rank over every order scan and then you filter by p.OrderNumber = #OrderNumber AND p.gn = 1. This makes even less sense. This query will only return results if the #OrderNumber happens to contain the scan with rank 1 over all orders! It cannot possibly be correct.
Your query makes no sense. The fact that is slow is just a bonus. Post your actual requirements.
If you want to learn about performance investigation, read How to analyse SQL Server performance.
PS. As a rule, computing ranks and selecting =1 can always be expressed as a TOP(1) correlated subquery, with usually much better results. Indexes help, obviously.
PPS. Use of group by without any aggregate function is yest another serious code smell.

Oracle performance issue in getting first row in sub query

I have a performance issue on the following (example) select statement that returns the first row using a sub query:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT DISTINCT
FIRST_VALUE(L.LOCATION) OVER (ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
),
P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
The DISTINCT is causing the performance issue by performing a SORT and UNIQUE but I can't figure out an alternative.
I would however prefer something akin to the following but referencing within 2 select statements doesn't work:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT LOCATION
FROM (SELECT L.LOCATION LOCATION
ROWNUM RN
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
ORDER BY L.SORT1, L.SORT2 DESC
) R
WHERE RN <=1
), P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
Additionally:
- My permissions do not allow me to create a function.
- I am cycling through 10k to 100k records in the main query.
- The sub query could return 3 to 7 rows before limiting to 1 row.
Any assistance in improving the performance is appreciated.
It's difficult to understand without sample data and cardinalities, but does this get you what you want? A unique list of projects and items, with the first occurrence of a location?
SELECT
P.ITEM_NUMBER,
P.PROJECT_NUMBER,
MIN(L.LOCATION) KEEP (DENSE_RANK FIRST ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM
LOCATIONS L
INNER JOIN
PROJECT P
ON L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
GROUP BY
P.ITEM_NUMBER,
P.PROJECT_NUMBER
I encounter similar problem in the past -- and while this is not ultimate solution (in fact might just be a corner-cuts) -- Oracle query optimizer can be adjusted with the OPTIMIZER_MODE init param.
Have a look at chapter 11.2.1 on http://docs.oracle.com/cd/B28359_01/server.111/b28274/optimops.htm#i38318
FIRST_ROWS
The optimizer uses a mix of cost and heuristics to find a best plan
for fast delivery of the first few rows. Note: Using heuristics
sometimes leads the query optimizer to generate a plan with a cost
that is significantly larger than the cost of a plan without applying
the heuristic. FIRST_ROWS is available for backward compatibility and
plan stability; use FIRST_ROWS_n instead.
Of course there are tons other factors you should analyse like your index, join efficiency, query plan etc..

Avoiding a nested subquery in SQL

I have a SQL table that contains data of the form:
Id int
EventTime dateTime
CurrentValue int
The table may have multiple rows for a given id that represent changes to the value over time (the EventTime identifying the time at which the value changed).
Given a specific point in time, I would like to be able to calculate the count of distinct Ids for each given Value.
Right now, I am using a nested subquery and a temporary table, but it seems it could be much more efficient.
SELECT [Id],
(
SELECT
TOP 1 [CurrentValue]
FROM [ValueHistory]
WHERE [Ids].[Id]=[ValueHistory].[Id] AND
[EventTime] < #StartTime
ORDER BY [EventTime] DESC
) as [LastValue]
INTO #temp
FROM [Ids]
SELECT [LastValue], COUNT([LastValue])
FROM #temp
GROUP BY [LastValue]
DROP TABLE #temp
Here is my first go:
select ids.Id, count( distinct currentvalue)
from ids
join valuehistory vh on ids.id = vh.id
where vh.eventtime < #StartTime
group by ids.id
However, I am not sure I understand your table model very clearly, or the specific question you are trying to solve.
This would be: The distinct 'currentvalues' from valuehistory before a certain date that for each Id.
Is that what you are looking for?
I think I understand your question.
You want to get the most recent value for each id, group by that value, and then see how many ids have that same value? Is this correct?
If so, here's my first shot:
declare #StartTime datetime
set #StartTime = '20090513'
select ValueHistory.CurrentValue, count(ValueHistory.id)
from
(
select id, max(EventTime) as LatestUpdateTime
from ValueHistory
where EventTime < #StartTime
group by id
) CurrentValues
inner join ValueHistory on CurrentValues.id = ValueHistory.id
and CurrentValues.LatestUpdateTime = ValueHistory.EventTime
group by ValueHistory.CurrentValue
No guarantee that this is actually faster though - for this to work with any decent speed you'll need an index on EventTime.
Let us keep in mind that, because the SQL language describes what you want and not how to get it, there are many ways of expressing a query that will eventually be turned into the same query execution plan by a good query optimizer. Of course, the level of "good" depends on the database you're using.
In general, subqueries are just a syntactically different way of describing joins. The query optimizer is going to recognize this and determine the most optimal way, to the best of its knowledge, to execute the query. Temporary tables may be created as needed. So in many cases, re-working the query is going to do nothing for your actual execution time -- it may come out to the same query execution plan in the end.
If you're going to attempt to optimize, you need to examine the query plan by doing a describe on that query. Make sure it's not doing full-table scans against large tables, and is picking the appropriate indices where possible. If, and only if, it is making sub-optimal choices here, should you attempt to manually optimize the query.
Now, having said all that, the query you pasted isn't entirely compatible with your stated goal of "calculat[ing] the count of distinct Ids for each given Value". So forgive me if I don't quite answer your need, but here's something to perf-test against your current query. (Syntax is approximate, sorry -- away from my desk).
SELECT [IDs].[Id], vh1.[CurrentValue], COUNT(vh2.[CurrentValue]) FROM
[IDs].[Id] as ids JOIN [ValueHistory] AS vh1 ON ids.[Id]=vh1.[Id]
JOIN [ValueHistory] AS vh2 ON vh1.[CurrentValue]=vh2.[CurrentValue]
GROUP BY [Id], [LastValue];
Note that you'll probably see better performance increases by adding indices to make those joins optimal than re-working the query, assuming you're willing to take the performance hit to update operations.