Speed up LEFT OUTER JOIN query in Firebird - sql

The question is for Firebird 2.5. Let's assume we have the following query:
SELECT
EVENTS.ID,
EVENTS.TS,
EVENTS.DEV_TS,
EVENTS.COMPLETE_TS,
EVENTS.OBJ_ID,
EVENTS.OBJ_CODE,
EVENTS.SIGNAL_CODE,
EVENTS.SIGNAL_EVENT,
EVENTS.REACTION,
EVENTS.PROT_TYPE,
EVENTS.GROUP_CODE,
EVENTS.DEV_TYPE,
EVENTS.DEV_CODE,
EVENTS.SIGNAL_LEVEL,
EVENTS.SIGNAL_INFO,
EVENTS.USER_ID,
EVENTS.MEDIA_ID,
SIGNALS.ID AS SIGNAL_ID,
SIGNALS.SIGNAL_TYPE,
SIGNALS.IMAGE AS SIGNAL_IMAGE,
SIGNALS.NAME AS SIGNAL_NAME,
REACTION.INFO,
USERS.NAME AS USER_NAME
FROM EVENTS
LEFT OUTER JOIN SIGNALS ON (EVENTS.SIGNAL_ID = SIGNALS.ID)
LEFT OUTER JOIN REACTION ON (EVENTS.ID = REACTION.EVENTS_ID)
LEFT OUTER JOIN USERS ON (EVENTS.USER_ID = USERS.ID)
WHERE (TS BETWEEN '27.07.2021 00:00:00' AND '28.07.2021 10:34:08')
AND (OBJ_ID = 8973)
AND (DEV_CODE IN (0, 1234))
AND (DEV_TYPE = 79)
AND (PROT_TYPE = 8)
ORDER BY TS;
EVENTS has about 190 million records by now and this query takes too much time to complete. As I read here, the tables have to have indexes on all the columns that are used.
Here are the CREATE INDEX statements for the EVENTS table:
CREATE INDEX FK_EVENTS_OBJ ON EVENTS (OBJ_ID);
CREATE INDEX FK_EVENTS_SIGNALS ON EVENTS (SIGNAL_ID);
CREATE INDEX IDX_EVENTS_COMPLETE_TS ON EVENTS (COMPLETE_TS);
CREATE INDEX IDX_EVENTS_OBJ_SIGNAL_TS ON EVENTS (OBJ_ID,SIGNAL_ID,TS);
CREATE INDEX IDX_EVENTS_TS ON EVENTS (TS);
Here is the data from the PLAN analyzer:
PLAN JOIN (JOIN (JOIN (EVENTS ORDER IDX_EVENTS_TS INDEX (FK_EVENTS_OBJ, IDX_EVENTS_TS), SIGNALS INDEX (PK_SIGNALS)), REACTION INDEX (IDX_REACTION_EVENTS)), USERS INDEX (PK_USERS))
As requested the speed of the execution:
without LEFT JOIN -> 138ms
with LEFT JOIN -> 338ms
Is there another way to speed up the execution of the query besides indexing the columns or maybe add another index?
If I add another index will the optimizer choose to use it?

You can only optimize the joins themselves by being sure that the keys are indexed in the second tables. These all look like primary keys, so they should have appropriate indexes.
For this WHERE clause:
WHERE TS BETWEEN '27.07.2021 00:00:00' AND '28.07.2021 10:34:08')
OBJ_ID = 8973 AND
DEV_CODE IN (0, 1234) AND
DEV_TYPE = 79 AND
PROT_TYPE = 8
You probably want an index on (OBJ_ID, DEV_TYPE, PROT_TYPE, TS, DEV_CODE). The order of the first three keys is not particularly important because they are all equality comparisons. I am guessing that one day of data is fewer rows than two device codes.

First of all you want to find the table1 rows quickly. You are using several columns in your WHERE clause to get them. Provide an index on these columns. Which column is the most selective? I.e. which criteria narrows the result rows most? Let's say it's dt, so we put this first:
create index idx1 on table1 (dt, oid, pt, ts, dc);
I have put ts and dt last, because we are looking for more than one value in these columns. It may still be that putting ts or dsas the first column is a good choice. Sometimes we have to play around with this. I.e. provide several indexes with the column order changed and then see which one gets used by the DBMS.
Tables table2 and tabe4 get accessed by the primary key for which exists an index. But table3 gets accessed by t1id. So provide an index on that, too:
create index idx2 on table3 (t1id);

Related

What is index do i need?

I have problems with the performance of this query. If I remove Order by section all work well. But I really want it. I tried to use many indexes but have not any results. Can you help me pls?
SELECT *
FROM "refuel_request" AS "refuel_request"
LEFT OUTER JOIN "user" AS "user" ON "refuel_request"."user_id" = "user"."user_id"
LEFT OUTER JOIN "bill_qr" AS "bill_qr" ON "refuel_request"."bill_qr_id" = "bill_qr"."bill_qr_id"
LEFT OUTER JOIN "car" AS "order.car" ON "refuel_request"."car_id" = "order.car"."car_id"
LEFT OUTER JOIN "refuel_request_status" AS "refuel_request_status" ON "refuel_request"."refuel_request_status_id" = "refuel_request_status"."refuel_request_status_id"
WHERE
refuel_request."refuel_request_status_id" IN ( '1', '2', '3')
ORDER BY "refuel_request".created_at desc
LIMIT 10
There is explain of this query
EXPLAIN (ANALYZE, BUFFERS)
Primary Keys and/or Foreign Keys
pk_refuel_request_id
refuel_request_bill_qr_id_fkey
refuel_request_user_id_fkey
All outer joind tables are 1:n related to refuel_request. This means your query is looking for the last ten created refuel requests with status 1 to 3.
You are outer joining the tables, because not every reful_request is related to a user, a bill_qr, a car, and a status. Or you outer join mistakenly. Anyway, none of the joins changes the number of retrieved rows; it's still one row per refuel request. In order to join the other tables' rows the DBMS just needs their primary key indexes. Nothing to worry about.
The only thing we must care about is finding the top reful_request rows for the statuses you are interested in as quickly as possible.
Use a partial index that only contains data for the statuses in question. The column you index is the created_at column, so as to get the top 10 immediately.
CREATE INDEX idx ON refuel_request (created_at DESC)
WHERE refuel_request_status_id IN (1, 2, 3);
Partial indexes are explained here: https://www.postgresql.org/docs/current/indexes-partial.html
You cannot have an index that supports both the WHERE condition and the ORDER BY, because you are using IN and not =.
The fastest option is to split the query into three parts, so that each part compares refuel_request.refuel_request_status_id with =. Combine these three queries with UNION ALL. Each of the queries has ORDER BY and LIMIT 10, and you wrap the whole thing in an outer query that has another ORDER BY and LIMIT 10.
Then you need these indexes:
CREATE INDEX ON refuel_request (refuel_request_status_id, created_at);
CREATE INDEX ON "user" (user_id);
CREATE INDEX ON bill_qr (bill_qr_id);
CREATE INDEX ON car (car_id);
CREATE INDEX ON refuel_request_status (refuel_request_status_id);
You need at least the indexes for the joins (do you really need LEFT joins?)
LEFT OUTER JOIN "user" AS "user" ON "refuel_request"."user_id" = "user"."user_id"
So, refuel_request.user_id must be in the index
LEFT OUTER JOIN "bill_qr" AS "bill_qr" ON "refuel_request"."bill_qr_id" =
LEFT OUTER JOIN "car" AS "order.car" ON "refuel_request"."car_id" =
bill_qr_id and car_id too
LEFT OUTER JOIN "refuel_request_status" AS "refuel_request_status" ON "refuel_request"."refuel_request_status_id" =
and refuel_request_status_id
WHERE
refuel_request."refuel_request_status_id" IN ( '1', '2', '3')
refuel_request_status_id must be the first key in the index as we need it in the WHERE
ORDER BY "refuel_request".created_at desc
and then created_at since it's in the ORDER clause. This will not improve performances per se, but will allow to run the ORDER BY without requiring access to the table data, the same reason why we put the other non-WHERE columns in there. Of course a partial index is even better, we shift the WHERE in the partiality clause and use created_at for the rest (the LIMIT 10 now means that we can do without the extra columns in the index, since retrieving three 1:N rows costs very little; in a different situation we might find it useful to keep those extra columns).
So one index that contains, in this order:
refuel_request_status_id, created_at, bill_qr_id, car_id too, user_id
^ WHERE ^ ORDER ^ used by the JOINS
However, do you really need a SELECT *? I believe you'd get better performances if you only included the fields you're really going to use.
The most effective index for this query would be on refuel_request (refuel_request_status_id, created_at DESC) so that both the main filtering and the ordering can be done using the index. You also want indexes on the columns you're joining, but those tables are small and inconsequential at the moment. In any case, the index I suggest isn't actually going to help much with the performance pain points you're having right now. Here are some suggestions:
Don't use SELECT * unless you really need all of the columns from all of these tables you're joining. Specifying only the necessary columns means postgres can load less data into memory, and work over it faster.
Postgres is spending a lot of time on the joins, joining about a million rows each time, when you're really only interested in ten of those rows. We can encourage it do the order/limit first by rearranging the query somewhat:
WITH refuel_request_subset AS MATERIALIZED (
SELECT *
FROM refuel_request
WHERE refuel_request_status_id IN ('1', '2', '3')
ORDER BY created_at DESC
LIMIT 10
)
SELECT *
FROM refuel_request_subset AS refuel_request
LEFT OUTER JOIN user ON refuel_request.user_id = user.user_id
LEFT OUTER JOIN bill_qr ON refuel_request.bill_qr_id = bill_qr.bill_qr_id
LEFT OUTER JOIN car AS "order.car" ON refuel_request.car_id = "order.car".car_id
LEFT OUTER JOIN refuel_request_status ON refuel_request.refuel_request_status_id = refuel_request_status.refuel_request_status_id;
Note: This assumes that the LEFT JOINS will not add rows to the result set, as is the case with your current dataset.
This trick only really works if you have a fixed number of IDs, but you can do the refuel_request_subset query separately for each ID and then UNION the results, as opposed to using the IN operator. That would allow postgres to fully use the index mentioned above.

Best way to get distinct count from a query joining two tables

I have 2 tables, table A & table B.
Table A (has thousands of rows)
id
uuid
name
type
created_by
org_id
Table B (has a max of hundred rows)
org_id
org_name
I am trying to get the best join query to obtain a count with a WHERE clause. I need the count of distinct created_bys from table A with an org_name in Table B that contains 'myorg'. I currently have the below query (producing expected results) and wonder if this can be optimized further?
select count(distinct a.created_by)
from a left join
b
on a.org_id = b.org_id
where b.org_name like '%myorg%';
You don't need a left join:
select count(distinct a.created_by)
from a join
b
on a.org_id = b.org_id
where b.org_name like '%myorg%'
For this query, you want an index on b.org_id, which I assume that you have.
I would use exists for this:
select count(distinct a.created_by)
from a
where exists (select 1 from b where b.org_id = a.org_id and b.org_name like '%myorg%')
An index on b(org_id) would help. But in terms of performance, key points are:
searching using like with a wildcard on both sides is not good for performance (this cannot take advantage of an index); it would be far better to search for an exact match, or at least to not have a wildcard on the left side of the string.
count(distinct ...) is more expensive than a regular count(); if you don't really need distinct, then don't use it.
Your query looks good already. Use a plain [INNER] JOIN instead or LEFT [OUTER] JOIN, like Gordon suggested. But that won't change much.
You mention that table B has only ...
a max of hundred rows
while table A has ...
thousands of rows
If there are many rows per created_by (which I'd expect), then there is potential for an emulated index skip scan.
(The need to emulate it might go away in one of the coming Postgres versions.)
Essential ingredient is this multicolumn index:
CREATE INDEX ON a (org_id, created_by);
It can replace a simple index on just (org_id) and works for your simple query as well. See:
Is a composite index also good for queries on the first field?
There are two complications for your case:
DISTINCT
0-n org_id resulting from org_name like '%myorg%'
So the optimization is harder to implement. But still possible with some fancy SQL:
SELECT count(DISTINCT created_by) -- does not count NULL (as desired)
FROM b
CROSS JOIN LATERAL (
WITH RECURSIVE t AS (
( -- parentheses required
SELECT created_by
FROM a
WHERE org_id = b.org_id
ORDER BY created_by
LIMIT 1
)
UNION ALL
SELECT (SELECT created_by
FROM a
WHERE org_id = b.org_id
AND created_by > t.created_by
ORDER BY created_by
LIMIT 1)
FROM t
WHERE t.created_by IS NOT NULL -- stop recursion
)
TABLE t
) a
WHERE b.org_name LIKE '%myorg%';
db<>fiddle here (Postgres 12, but works in Postgres 9.6 as well.)
That's a recursive CTE in a LATERAL subquery, using a correlated subquery.
It utilizes the multicolumn index from above to only retrieve a single row for every (org_id, created_by). With an index-only scans if the table is vacuumed enough.
The main objective of the sophisticated SQL is to completely avoid a sequential scan (or even a bitmap index scan) on the big table and only read very few fast index tuples.
Due to the added overhead it can be a bit slower for an unfavorable data distribution (many org_id and/or only few rows per created_by) But it's much faster for favorable conditions and is scales excellently, even for millions of rows. You'll have to test to find the sweet spot.
Related:
Optimize GROUP BY query to retrieve latest row per user
What is the difference between LATERAL and a subquery in PostgreSQL?
Is there a shortcut for SELECT * FROM?

Join on columns with different indexes

I am using a query the join columns one has a clustered index and the other has a non-clustered index. The query is taking a long time. Is this the reason that I am using different type of indexes ?
SELECT #NoOfOldBills = COUNT(*)
FROM Billing_Detail D, Meter_Info m, Meter_Reading mr
WHERE D.Meter_Reading_ID = mr.id
AND m.id = mr.Meter_Info_ID
AND m.id = #Meter_Info_ID
AND BillType = 'Meter'
IF (#NoOfOldBills > 0) BEGIN
SELECT TOP 1 #PReadingDate = Bill_Date
FROM Billing_Detail D, Meter_Info m, Meter_Reading mr
WHERE D.Meter_Reading_ID = mr.id
AND m.id = mr.Meter_Info_ID
AND m.id = #Meter_Info_ID
AND billtype = 'Meter'
ORDER BY Bill_Date DESC
END
Without knowing more details and the context, it's tricky to advise - but it looks like you are trying to find out the date of the oldest bill. You could probably rewrite those two queries as one, which would improve the performance significantly (assuming that there are some old bills).
I would suggest something like this - which in addition to probably performing better, is a little easier to read!
SELECT count(d.*) NoOfOldBills, MAX(d.Billing_Date) OldestBillingDate FROM Billing_Detail d
INNER JOIN Meter_Reading mr ON mr.id=d.Meter_Reading_ID
INNER JOIN Meter_Info m ON m.Id=mr.Meter_Info_ID
WHERE
m.id = #Meter_Info_ID AND billtype = 'Meter'
The reason is not because you have different types of indices. Since you said you have clustered indices on all primary keys, you should be fine there. To support this query, you would also need an index with two columns on BillType and Bill_Date to cut down the time.
Hopefully they are in the same table. Otherwise, you may need a few indices to finally create one that does have two columns.
OK there are a number of things here. First, are the indexes you have relevant (or as relevant as they can be). There should be a clustered index on each table - a non clustered index does not work effectively if the table does not have a clustered index which non clustered indexes use to identify rows.
Does your index cover only one column? SQL Server uses indexes with the order of the columns for that index being very important. The left most column of the index should (in general) be the column that has the greatest ordinality (divides the data into the smallest amounts) Do either of the indexes cover all the columns referred to in the query (this is known as a covering index (Google this for more info).
In general indexes should be wide, if SQL Server has an index on col1, col2, col3, col4 and another on col1, col2 the later is redundant, as the information from the second index is fully contained in the first and SQL Server understands this.
Are you statistics up to date? SQL Server can / will choose a bad execution plan if the statistics are not up to date. What does query analyser show for plan execution (SSMS Query | show execution plan)?

Optimize self join on millions of rows

I have a table which is a link table from objects in my SQL Server 2012 database, (annonsid, annonsid2). This table is used to create chains of triangle or even rectangles to see who can swap with who.
This is the query I use on the table Matching_IDs which has 1,5 million rows in it, producing 14 million possible chains using this query:
SELECT COUNT(*)
FROM Matching_IDs AS m
INNER JOIN Matching_IDs AS m2
ON m.annonsid2 = m2.annonsid
INNER JOIN Matching_IDs AS m3
ON m2.annonsid2 = m3.annonsid
AND m.annonsid = m3.annonsid2
I must improve performance to take maybe 1 second or less, Is there a faster way to do this? The query takes about 1 minute on my computer. I normally use a WHERE m.annonsid=x, but it takes just the same amount of time, cause it has to go through all possible combinations anyway.
Update: the latest query plan
|--Compute Scalar(DEFINE:([Expr1006]=CONVERT_IMPLICIT(int,[globalagg1011],0)))
|--Stream Aggregate(DEFINE:([globalagg1011]=SUM([partialagg1010])))
|--Parallelism(Gather Streams)
|--Stream Aggregate(DEFINE:([partialagg1010]=Count(*)))
|--Hash Match(Inner Join, HASH:([m2].[annonsid2], [m2].[annonsid])=([m3].[annonsid], [m].[annonsid2]), RESIDUAL:([MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m2].[annonsid2]=[MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m3].[annonsid] AND [MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m].[annonsid2]=[MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m2].[annonsid]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m2].[annonsid2], [m2].[annonsid]))
| |--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133207] AS [m2]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m3].[annonsid], [m].[annonsid2]))
|--Merge Join(Inner Join, MANY-TO-MANY MERGE:([m].[annonsid])=([m3].[annonsid2]), RESIDUAL:([MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m].[annonsid]=[MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m3].[annonsid2]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m].[annonsid]), ORDER BY:([m].[annonsid] ASC))
| |--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133152] AS [m]), ORDERED FORWARD)
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m3].[annonsid2]), ORDER BY:([m3].[annonsid2] ASC))
|--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133207] AS [m3]), ORDERED FORWARD)
Some ideas:
Try two indexes (annonsid,annonsid2) and (annonsid2,annonsid)
Have you tried a column store index? It makes the table read only but it might improve performance.
Also, some variations of the query could help. Examples:
SELECT COUNT(*)
FROM Matching_IDs AS m
INNER JOIN Matching_IDs AS m2
ON m.annonsid2 = m2.annonsid
INNER JOIN Matching_IDs AS m3
ON m2.annonsid2 = m3.annonsid
where m.annonsid = m3.annonsid2
or
SELECT COUNT(*)
FROM Matching_IDs AS m, Matching_IDs AS m2, Matching_IDs AS m3
where m2.annonsid2 = m3.annonsid
and m.annonsid2 = m2.annonsid
and m.annonsid = m3.annonsid2
Did you check the CPU/IO-Load? If IO-Load is high, then the server is not crunching numbers but swapping => more RAM solves the problem.
How fast is this query?
SELECT COUNT(*)
FROM Matching_IDs AS m
INNER JOIN Matching_IDs AS m2
ON m.annonsid2 = m2.annonsid
If this is very fast but adding the next join slows thing down then you propably need more RAM.
It seems like you already indexed this quite well. You can try converting the hash to a merge join by adding the right multi-column index, but it won't give you the desired speedup of 60x.
I think this index would be on annonsid, annonsid2 although I might have made a mistake here.
It would be nice to materialize all of this but indexed views do not support self-joins. You can try to materialize this query (unaggregated) into a new table. Whenever you execute DML against the base table, also update the second table (using either application logic or triggers). That would allow you to query blazingly fast.
You should make this query a bit more separated. I think first you should create a table, where you can store the primary key + annonsid, annonsid2 -if annosid is not the primary key itself.
DECLARE #AnnonsIds TABLE
(
primaryKey int,
-- if you need later more info from the original rows like captions
-- AND it is not (just) the annonsid
annonsid int,
annonsid2 int
)
If you declare a table, and you have index on this column, it is quite fast to get the specified rows by the WHERE annonsid = #annonsid OR annonsid2 = #annosid
After the first step you have a much smaller (I guess), and "thin" table to work with. Then you just have to use the joins here or make a temp table and a CTE on it.
I think it should be faster, depending on the selectivity of the condition in the WHERE, of yourse if 1.1 million rows fits in it, then it does not make sense, but if just a couple hundred or tousend, then you should give it a try!
1 - change select Count(*) to Count(1) or Count(id)
2 - Write set Nocount on at the first of your Stored procedure or at the first of your Query
3 - Use index on annonsid , annonsid2
4 - Having your Indexes after the primary key in your Table
You could denormalize the data by adding a table RelatedIds with AnnonsId, RelatedAnnonId and Distance. For every value of AnnonsId the table would contains rows for each RelatedAnnonId and the number of relations that need to be traversed to reach it, aka Distance. Triggers on the existing MatchingIds table would maintain the new table with some configured maximum value for Distance, e.g. 3 to handle rectangular shares. Index the table on (AnnonsId, Distance).
Edit: An index on (Distance, AnnonsId) will allow you to quickly find rows that have enough related entries to form a particular shape. Adding a column for MaxDistance may be useful if you want to be able to exclude rows based on, for example, having a triangular but not rectangular relationship.
The new query would inner join RelatedIds as RI on RI.AnnonsId = m.AnnonsId and RI.Distance <= #MaxDistance with the desired "shape" dictating the value of #MaxDistance.
It should provide much better performance on the select. Downsides are another table with a large number of rows and overhead of the triggers when altering the MatchingIds table.
Example: There are two entries in Matching_IDs: (1,2) and (2,3).
The new table would contain 3 entries:
1-> 2: distance = 1
1-> 3: distance = 2 (it takes one intermediate 'node' to go from 1 to 3)
2-> 3: distance = 1
adding one more entry to the matching ids (3,1) would result in another entry:
1-> 1: distance = 3
And voilá: You found a triangle (distance=3).
Now, to find all triangles, simply do this:
select *
from RelatedIds
where AnnonsId=RelatedAnnonId
and Distance=3

T-SQL Query Optimization

I'm working on some upgrades to an internal web analytics system we provide for our clients (in the absence of a preferred vendor or Google Analytics), and I'm working on the following query:
select
path as EntryPage,
count(Path) as [Count]
from
(
/* Sub-query 1 */
select
pv2.path
from
pageviews pv2
inner join
(
/* Sub-query 2 */
select
pv1.sessionid,
min(pv1.created) as created
from
pageviews pv1
inner join Sessions s1 on pv1.SessionID = s1.SessionID
inner join Visitors v1 on s1.VisitorID = v1.VisitorID
where
pv1.Domain = isnull(#Domain, pv1.Domain) and
v1.Campaign = #Campaign
group by
pv1.sessionid
) t1 on pv2.sessionid = t1.sessionid and pv2.created = t1.created
) t2
group by
Path;
I've tested this query with 2 million rows in the PageViews table and it takes about 20 seconds to run. I'm noticing a clustered index scan twice in the execution plan, both times it hits the PageViews table. There is a clustered index on the Created column in that table.
The problem is that in both cases it appears to iterate over all 2 million rows, which I believe is the performance bottleneck. Is there anything I can do to prevent this, or am I pretty much maxed out as far as optimization goes?
For reference, the purpose of the query is to find the first page view for each session.
EDIT: After much frustration, despite the help received here, I could not make this query work. Therefore, I decided to simply store a reference to the entry page (and now exit page) in the sessions table, which allows me to do the following:
select
pv.Path,
count(*)
from
PageViews pv
inner join Sessions s on pv.SessionID = s.SessionID
and pv.PageViewID = s.ExitPage
inner join Visitors v on s.VisitorID = v.VisitorID
where
(
#Domain is null or
pv.Domain = #Domain
) and
v.Campaign = #Campaign
group by pv.Path;
This query runs in 3 seconds or less. Now I either have to update the entry/exit page in real time as the page views are recorded (the optimal solution) or run a batch update at some interval. Either way, it solves the problem, but not like I'd intended.
Edit Edit: Adding a missing index (after cleaning up from last night) reduced the query to mere milliseconds). Woo hoo!
For starters,
where pv1.Domain = isnull(#Domain, pv1.Domain)
won't SARG. You can't optimize a match on a function, as I remember.
I'm back. To answer your first question, you could probably just do a union on the two conditions, since they are obviously disjoint.
Actually, you're trying to cover both the case where you provide a domain, and where you don't. You want two queries. They may optimize entirely differently.
What's the nature of the data in these tables? Do you find most of the data is inserted/deleted regularly?
Is that the full schema for the tables? The query plan shows different indexing..
Edit: Sorry, just read the last line of text. I'd suggest if the tables are routinely cleared/insertsed, you could think about ditching the clustered index and using the tables as heap tables.. just a thought
Definately should put non-clustered index(es) on Campaign, Domain as John suggested
Your inner query (pv1) will require a nonclustered index on (Domain).
The second query (pv2) can already find the rows it needs due to the clustered index on Created, but pv1 might be returning so many rows that SQL Server decides that a table scan is quicker than all the locks it would need to take. As pv1 groups on SessionID (and hence has to order by SessionID), a nonclustered index on SessionID, Created, and including path should permit a MERGE join to occur. If not, you can force a merge join with "SELECT .. FROM pageviews pv2 INNER MERGE JOIN ..."
The two indexes listed above will be:
CREATE NONCLUSTERED INDEX ncixcampaigndomain ON PageViews (Domain)
CREATE NONCLUSTERED INDEX ncixsessionidcreated ON PageViews(SessionID, Created) INCLUDE (path)
SELECT
sessionid,
MIN(created) AS created
FROM
pageviews pv
JOIN
visitors v ON pv.visitorid = v.visitorid
WHERE
v.campaign = #Campaign
GROUP BY
sessionid
so that gives you the sessions for a campaign. Now let's see what you're doing with that.
OK, this gets rid of your grouping:
SELECT
campaignid,
sessionid,
pv.path
FROM
pageviews pv
JOIN
visitors v ON pv.visitorid = v.visitorid
WHERE
v.campaign = #Campaign
AND NOT EXISTS (
SELECT 1 FROM pageviews
WHERE sessionid = pv.sessionid
AND created < pv.created
)
To continue from doofledorf.
Try this:
where
(#Domain is null or pv1.Domain = #Domain) and
v1.Campaign = #Campaign
Ok, I have a couple of suggestions
Create this covered index:
create index idx2 on [PageViews]([SessionID], Domain, Created, Path)
If you can amend the Sessions table so that it stores the entry page, eg. EntryPageViewID you will be able to heavily optimise this.