T-SQL Query Optimization - sql

I'm working on some upgrades to an internal web analytics system we provide for our clients (in the absence of a preferred vendor or Google Analytics), and I'm working on the following query:
select
path as EntryPage,
count(Path) as [Count]
from
(
/* Sub-query 1 */
select
pv2.path
from
pageviews pv2
inner join
(
/* Sub-query 2 */
select
pv1.sessionid,
min(pv1.created) as created
from
pageviews pv1
inner join Sessions s1 on pv1.SessionID = s1.SessionID
inner join Visitors v1 on s1.VisitorID = v1.VisitorID
where
pv1.Domain = isnull(#Domain, pv1.Domain) and
v1.Campaign = #Campaign
group by
pv1.sessionid
) t1 on pv2.sessionid = t1.sessionid and pv2.created = t1.created
) t2
group by
Path;
I've tested this query with 2 million rows in the PageViews table and it takes about 20 seconds to run. I'm noticing a clustered index scan twice in the execution plan, both times it hits the PageViews table. There is a clustered index on the Created column in that table.
The problem is that in both cases it appears to iterate over all 2 million rows, which I believe is the performance bottleneck. Is there anything I can do to prevent this, or am I pretty much maxed out as far as optimization goes?
For reference, the purpose of the query is to find the first page view for each session.
EDIT: After much frustration, despite the help received here, I could not make this query work. Therefore, I decided to simply store a reference to the entry page (and now exit page) in the sessions table, which allows me to do the following:
select
pv.Path,
count(*)
from
PageViews pv
inner join Sessions s on pv.SessionID = s.SessionID
and pv.PageViewID = s.ExitPage
inner join Visitors v on s.VisitorID = v.VisitorID
where
(
#Domain is null or
pv.Domain = #Domain
) and
v.Campaign = #Campaign
group by pv.Path;
This query runs in 3 seconds or less. Now I either have to update the entry/exit page in real time as the page views are recorded (the optimal solution) or run a batch update at some interval. Either way, it solves the problem, but not like I'd intended.
Edit Edit: Adding a missing index (after cleaning up from last night) reduced the query to mere milliseconds). Woo hoo!

For starters,
where pv1.Domain = isnull(#Domain, pv1.Domain)
won't SARG. You can't optimize a match on a function, as I remember.

I'm back. To answer your first question, you could probably just do a union on the two conditions, since they are obviously disjoint.
Actually, you're trying to cover both the case where you provide a domain, and where you don't. You want two queries. They may optimize entirely differently.

What's the nature of the data in these tables? Do you find most of the data is inserted/deleted regularly?
Is that the full schema for the tables? The query plan shows different indexing..
Edit: Sorry, just read the last line of text. I'd suggest if the tables are routinely cleared/insertsed, you could think about ditching the clustered index and using the tables as heap tables.. just a thought
Definately should put non-clustered index(es) on Campaign, Domain as John suggested

Your inner query (pv1) will require a nonclustered index on (Domain).
The second query (pv2) can already find the rows it needs due to the clustered index on Created, but pv1 might be returning so many rows that SQL Server decides that a table scan is quicker than all the locks it would need to take. As pv1 groups on SessionID (and hence has to order by SessionID), a nonclustered index on SessionID, Created, and including path should permit a MERGE join to occur. If not, you can force a merge join with "SELECT .. FROM pageviews pv2 INNER MERGE JOIN ..."
The two indexes listed above will be:
CREATE NONCLUSTERED INDEX ncixcampaigndomain ON PageViews (Domain)
CREATE NONCLUSTERED INDEX ncixsessionidcreated ON PageViews(SessionID, Created) INCLUDE (path)

SELECT
sessionid,
MIN(created) AS created
FROM
pageviews pv
JOIN
visitors v ON pv.visitorid = v.visitorid
WHERE
v.campaign = #Campaign
GROUP BY
sessionid
so that gives you the sessions for a campaign. Now let's see what you're doing with that.
OK, this gets rid of your grouping:
SELECT
campaignid,
sessionid,
pv.path
FROM
pageviews pv
JOIN
visitors v ON pv.visitorid = v.visitorid
WHERE
v.campaign = #Campaign
AND NOT EXISTS (
SELECT 1 FROM pageviews
WHERE sessionid = pv.sessionid
AND created < pv.created
)

To continue from doofledorf.
Try this:
where
(#Domain is null or pv1.Domain = #Domain) and
v1.Campaign = #Campaign
Ok, I have a couple of suggestions
Create this covered index:
create index idx2 on [PageViews]([SessionID], Domain, Created, Path)
If you can amend the Sessions table so that it stores the entry page, eg. EntryPageViewID you will be able to heavily optimise this.

Related

Avoid full table scan under window function run with use of join condition

Given a database table events with columns: event_id, correlation_id, username, create_timestamp. It contains more than 1M records.
There is a problem I am trying to solve: for each event of a particular user display its latest sibling event. Sibling events are the events which has same correlation_id. The query I use for that is the following:
SELECT
"events"."event_id" AS "event_id",
"latest"."event_id" AS "latest_event_id"
FROM
events "events"
JOIN (
SELECT
"latest"."correlation_id" AS "correlation_id",
"latest"."event_id" AS "event_id",
ROW_NUMBER () OVER (
PARTITION BY "latest"."correlation_id"
ORDER BY
"latest"."create_timestamp" ASC
) AS "rn"
FROM
events "latest"
) "latest" ON (
"latest"."correlation_id" = "events"."correlation_id"
AND "latest"."rn" = 1
)
WHERE
"events"."username" = 'user1'
It gets correct list of results but causes performance problems which must be fixed. Here is an execution plan of the query:
Hash Right Join (cost=13538.03..15522.72 rows=1612 width=64)
Hash Cond: (("latest".correlation_id)::text = ("events".correlation_id)::text)
-> Subquery Scan on "latest" (cost=12031.35..13981.87 rows=300 width=70)
Filter: ("latest".rn = 1)
-> WindowAgg (cost=12031.35..13231.67 rows=60016 width=86)
-> Sort (cost=12031.35..12181.39 rows=60016 width=78)
Sort Key: "latest_1".correlation_id, "latest_1".create_timestamp
-> Seq Scan on events "latest_1" (cost=0.00..7268.16 rows=60016 width=78)
-> Hash (cost=1486.53..1486.53 rows=1612 width=70)
-> Index Scan using events_username on events "events" (cost=0.41..1486.53 rows=1612 width=70)
Index Cond: ((username)::text = 'user1'::text)
From the plan, I can conclude that the performance problem mainly caused by calculation of latest events for ALL events in the table which takes ~80% of cost. Also it performs the calculations even if there are no events for a user at all. Ideally, I would like the query to do these steps which seem more efficient for me:
find all events by a user
for each event from Step 1, find all siblings, sort them and get the 1st
To simplify the discussion, let's consider all required indexes as already created for needed columns. It doesn't seem to me that the problem can be solved purely by index creation.
Any ideas what can be done to improve the performance? Probably, there are options to rewrite the query or adjust a configuration of the table.
Note that this question is significantly obfuscated in terms of business meaning to clearly demonstrate the technical problem I face.
The window function has to scan the whole table. It has no idea that really you are only interested in the first value. A lateral join could perform better and is more readable anyway:
SELECT
e.event_id,
latest.latest_event_id
FROM
events AS e
CROSS JOIN LATERAL
(SELECT
l.event_id AS latest_event_id
FROM
events AS l
WHERE
l.correlation_id = e.correlation_id
ORDER BY l.create_timestamp
FETCH FIRST 1 ROWS ONLY
) AS latest
WHERE e.username = 'user1';
The perfect index to support that would be
CREATE INDEX ON event (correlation_id, create_timestamp);
All those needless double quotes are making my eyes bleed.
This should very fast with a lateral join, provided the number of returned rows is rather low, i.e. 'user1' is rather specific.
explain analyze SELECT
events.event_id AS event_id,
latest.event_id AS latest_event_id
FROM
events "events"
cross JOIN lateral (
SELECT
latest.event_id AS event_id
FROM events latest
WHERE latest.correlation_id=events.correlation_id
ORDER by create_timestamp ASC limit 1
) latest
WHERE
events.username = 'user1';
You will want an index on username, and one on (correlation_id, create_timestamp)
If the number of rows returned is large, then your current query, which precomputes in bulk, is probably better. But would be faster if you used DISTINCT ON rather than the window function to pull out just the latest for each correlation_id. Unfortunately the planner does not understand these queries to be equivalent, and so will not interconvert between them based on what it thinks will be faster.
One option that may improve efficiency is to rewrite the query filtering "rn" = 1 beforehand to reduce resulting rows when joining tables:
WITH "latestCte"("correlation_id", "event_id") as (SELECT
"correlation_id",
"event_id",
ROW_NUMBER () OVER (
PARTITION BY "correlation_id"
ORDER BY
"create_timestamp" ASC
) AS "rn"
FROM
events)
SELECT
"events"."event_id" AS "event_id",
"latest"."event_id" AS "latest_event_id"
FROM
events "events"
JOIN (
SELECT "correlation_id", "event_id" FROM "latestCte" WHERE "rn" = 1
) "latest" ON (
"latest"."correlation_id" = "events"."correlation_id"
)
WHERE
"events"."username" = 'user1'
Hope it helps, also I am curious to see the resulting execution plan of this query. Best regards.
Without having access to the data, I'm really just throwing out ideas...
Instead of a subquery, it's worth trying a materialized CTE
Rather than the row_number analytic, you can try a distinct on. Honestly, I don't predict any gains. It's basically the same thing at the database level
Sample of both:
with latest as materialized (
SELECT distinct on ("correlation_id")
"correlation_id", "event_id"
FROM events
order by
"correlation_id", "create_timestamp" desc
)
SELECT
e."event_id",
l."event_id" AS "latest_event_id"
FROM
events "events" e
join latest l ON
l."correlation_id" = e."correlation_id"
WHERE
e."username" = 'user1'
Additional suggestion -- if you are doing this over and over, I'd consider creating a temp table or materialized view for "latest," including an index by coorelation_id rather than re-running the subquery (or CTE) every single time. This will be a one-time pain followed my repeated gain.
Yet one more suggestion -- if at all possible, drop the double quotes from your object names. Maybe it's just me, but I find them brutal. Unless you have spaces, reserved words or mandatory uppercase in your field names (please don't do that), then these create more problems than they solve. I kept them in the query I listed above, but it pained me.
And this last comment goes back to knowing your data... since row_number and distinct on are relatively expensive operations, it may make sense to make your subquery/cte more selective by introducing the "user1" constraint. This is completely untested, but something like this:
SELECT distinct on (e1.correlation_id)
e1.correlation_id, e1.event_id
FROM events e1
join events e2 on
e1.correlation_id = e2.correlation_id and
e2.username = 'user1'
order by
e1.correlation_id, e1.create_timestamp desc
Although I like the LATERAL JOIN approach suggested by others, when it comes to fetching just 1 field I'm 50/50 about using that and using a subquery as below.(If you need to fetch multiple fields using the same logic than by all means LATERAL is the way to go!)
I wonder if either would perform better, presumably they are executed in a very similar way by the SQL engine.
SELECT e.event_id,
(SELECT l.event_id
FROM events AS l
WHERE l.correlation_id = e.correlation_id
ORDER BY l.create_timestamp ASC -- shouldn't this be DESC?
FETCH FIRST 1 ROWS ONLY) as latest_event_id
FROM events AS e
WHERE e.username = 'user1';
Note: You're currently asking for the OLDEST correlated record. In your post you say you're looking for the "latest sibling event". "Latest" IMHO implies the most recent one, so it would have the biggest create_timestamp, meaning you need to ORDER BY that field from high to low and then take the first one.
Edit: identical as suggested above, for this approach you also want the index on correlation_id and create_timestamp
CREATE INDEX ON event (correlation_id, create_timestamp);
You might even want to include the event_id to avoid a bookmark lookup although these pages are likely to be in cache anyway so not sure if it will really help all that much.
CREATE INDEX ON event (correlation_id, create_timestamp, event_id);
PS: same is true about adding correlation_id to your events_username index... but that's all quite geared towards this (probably simplified) query and do keep in mind that more (and bigger) indexes will bring some overhead elsewhere even when they might bring big benefits elsewhere... it's always a compromise.

Bad performance when not selecting a specific column from a view

Using SQL Server 2016 SP1. I have a view Users that goes like
SELECT
ROW_NUMBER() OVER (ORDER BY ID) AS DataModelID, *
FROM
(Some query) AS tbl
I then select from it
SELECT
U1.ID UserId, U1.IdentityNumber IdentityNumber,
U1.ArabicFirstName, U1.ArabicSecondName
FROM
USERS U1
LEFT JOIN
USERS U2 ON U1.IdentityNumber = U2.IdentityNumber
AND U1.ID <> U2.ID
AND U1.RoleId = 2
WHERE
U2.ID IS NOT NULL
AND U1.IdentityNumber <> ''
AND PATINDEX('[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]', U1.IdentityNumber) = 1
The thing here is with the above query when selecting * or include column DataModelID it runs in 3 secs but when selecting any columns without this one it runs in more than 2 mins.
Why is this happening, running faster when including a column?
I tried everything for the cash to clear it and run multiple times and it has the same results
Without seeing the actual execution plan there is no way to say for sure but as #mvisser mentioned - the likely cause is that the optimizer is choosing a better index when you do a
SELECT * or include column DataModelID than when you don't. There are a number of solutions here, one suggestion would be to look at the execution plan for the queries that run in 3 seconds, note what index is being used and use an index hint (see section G) to force the optimizer to use that index in your queries that don't reference those columns. I would not suggest this though - there are too many unanswered variables to consider this a viable option.
Here's what I recommend:
First, as #Lukasz Szozda mentioned, this is not SARGable:
AND PATINDEX( '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]',U1.IdentityNumber) = 1
But this is:
U1.IdentityNumber LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'
So I'd fix that first. Next, the fastest, most sure-fire way to resolve this is to simply include DataModelID in your queries even if you don't need them. You can either filter that column out at the application level or create a stored proc that populates a temp table then, for the final result set, you can retrieve your results from that temp table excluding DataModelID.
OPTION #2
You can create an Indexed View on you USERS table that looks something like this:
CREATE VIEW dbo.vwUSERS_clean
WITH SCHEMABINDING AS
SELECT U1.ID, UserId, U1.IdentityNumber IdentityNumber,
U1.ArabicFirstName, U1.ArabicSecondName, DataModelID, U2.IdentityNumber
FROM USERS U1
WHERE U2.ID IS NOT NULL
AND U1.IdentityNumber <> ''
AND U1.IdentityNumber LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]';
GO
Then create a unique, clustered index on it. Next you would change the query that you posted to reference your indexed view (e.g. change both references to USERS to dbo.vwUSERS_clean WITH (NOEXPAND)).
Note that ROW_NUMBER is not allowed in indexed views but, if you make ID your clustered index (or the first column in a composite clustered index) then there will be no cost to Adding ROW_NUMBER() OVER ORDER BY ID to queries that reference that Indexed view.

Proving that a query is a blocking issue

When running this query, so far there are 4 different results that can occur, but with the most frequency the main result is that it runs for over 10 minutes before I kill it, but around 1% of the time it will return either 2 records or 3 records or no records:
select *
from [master] as [p]( nolock )
inner join eddd(nolock) as [e]
on [p].[guid] = [e].pggg
inner join stppp as [s]( nolock )
on [e].[guid] = [s].evggg
inner join [edddex] as [x]( nolock )
on [e].[guid] = [x].[guid]
where [p].[guid] = '370ECB0F-6222-4D02-86DD-336BAAA49B81'
and ( CAST([s].rddd as date) = '20150821'
or CAST([s].rddd as date) = '20150821' )
and [s].[st111num] = 4
and [e].[ev111type] = 6
and [x].[l11105] = 6;
My friend argues that because removing this filter:
and ( CAST([s].rddd as date) = '20150821'
or CAST([s].rddd as date) = '20150821' )
causes the query to respond immediately, that this proves this is not a blocking/locking issue.
When having only SELECT access against a specific database and not having the ability to audit or view system tables, how does one find out whether a specific query is returning different results due to locking?
The CAST([s].rddd is the performance problem. When you put a function ( Like CAST(), MAX(), etc ) around a criteria field, you disable the use of the index that would otherwise aid in the execution of your query, and this causes a full table scan. Modify your query so that you can avoid use of such functions, AND, be sure your filters/matches involve the LEADING index fields. Not employing the leading index fields ALSO causes the indexe(s) to not be used.
For more insight, run an EXPLAIN PLAN ( Show Plan ...whichever you system utilizes ) and it will tell you what indexes are used, and much more.
Also,.... SELECT * is a performance killer, as it has to return ALL FIELDS, so even if the indexes can assist with the filtering, every block containing any of the needed records must be fully read. SELECT * is very rarely needed, so visit that as well and trim it down to only the needed fields.

SQL Server search filter and order by performance issues

We have a table value function that returns a list of people you may access, and we have a relation between a search and a person called search result.
What we want to do is that wan't to select all people from the search and present them.
The query looks like this
SELECT qm.PersonID, p.FullName
FROM QueryMembership qm
INNER JOIN dbo.GetPersonAccess(1) ON GetPersonAccess.PersonID = qm.PersonID
INNER JOIN Person p ON p.PersonID = qm.PersonID
WHERE qm.QueryID = 1234
There are only 25 rows with QueryID=1234 but there are almost 5 million rows total in the QueryMembership table. The person table has about 40K people in it.
QueryID is not a PK, but it is an index. The query plan tells me 97% of the total cost is spent doing "Key Lookup" witht the seek predicate.
QueryMembershipID = Scalar Operator (QueryMembership.QueryMembershipID as QM.QueryMembershipID)
Why is the PK in there when it's not used in the query at all? and why is it taking so long time?
The number of people total 25, with the index, this should be a table scan for all the QueryMembership rows that have QueryID=1234 and then a JOIN on the 25 people that exists in the table value function. Which btw only have to be evaluated once and completes in less than 1 second.
if you want to avoid "key lookup", use covered index
create index ix_QueryMembership_NameHere on QueryMembership (QueryID)
include (PersonID);
add more column names, that you gonna select in include arguments.
for the point that, why PK's "key lookup" working so slow, try DBCC FREEPROCCACHE, ALTER INDEX ALL ON QueryMembership REBUILD, ALTER INDEX ALL ON QueryMembership REORGANIZE
This may help if your PK's index is disabled, or cache keeps wrong plan.
You should define indexes on the tables you query. In particular on columns referenced in the WHERE and ORDER BY clauses.
Use the Database Tuning Advisor to see what SQL Server recommends.
For specifics, of course you would need to post your query and table design.
But I have to make a couple of points here:
You've already jumped to the conclusion that the slowness is a result of the ORDER BY clause. I doubt it. The real test is whether or not removing the ORDER BY speeds up the query, which you haven't done. Dollars to donuts, it won't make a difference.
You only get the "log n" in your big-O claim when the optimizer actually chooses to use the index you defined. That may not be happening because your index may not be selective enough. The thing that makes your temp table solution faster than the optimizer's solution is that you know something about the subset of data being returned that the optimizer does not (specifically, that it is a really small subset of data). If your indexes are not selective enough for your query, the optimizer can't always reasonably assume this, and it will choose a plan that avoids what it thinks could be a worst-case scenario of tons of index lookups, followed by tons of seeks and then a big sort. Oftentimes, it chooses to scan and hash instead. So what you did with the temp table is often a way to solve this problem. Often you can narrow down your indexes or create an indexed view on the subset of data you want to work against. It all depends on the specifics of your wuery.
You need indexes on your WHERE and ORDER BY clauses. I am not an expert but I would bet it is doing a table scan for each row. Since your speed issue is resolved by Removing the INNER JOIN or the ORDER BY I bet the issue is specifically with the join. I bet it is doing the table scan on your joined table because of the sort. By putting an index on the columns in your WHERE clause first you will be able to see if that is in fact the case.
Have you tried restructuring the query into a CTE to separate the TVF call? So, something like:
With QueryMembershipPerson
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QMP.PersonId
EDIT: Btw, I'm assuming that there is an index on PersonId in both the QueryMembership and the Person table.
EDIT What about two table expressions like so:
With
QueryMembershipPerson As
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
, With PersonAccess As
(
Select PersonId
From dbo.GetPersonAccess(1)
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join PersonAccess As PA
On PA.PersonId = QMP.PersonId
Yet another solution would be a derived table like so:
Select ...
From (
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
) As QueryMembershipPerson
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QueryMembershipPerson.PersonId
If pushing some of the query into a temp table and then joining on that works, I'd be surprised that you couldn't combine that concept into a CTE or a query with a derived table.

OR query performance and strategies with Postgresql

In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?
What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.
Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.
To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.