Optimize SQL query with pagination - sql

I have a query running against a SQL Server database that is taking over 10 seconds to execute. The table being queried has over 14 million rows.
I want to display the Text column from a Notes table by a given ServiceUserId in date order. There could be thousands of entries so I want to limit the returned values to a manageable level.
SELECT Text
FROM
(SELECT
ROW_NUMBER() OVER (ORDER BY [DateDone]) AS RowNum, Text
FROM
Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2') AS RowConstrainedResult
WHERE
RowNum >= 40 AND RowNum < 60
ORDER BY
RowNum
Below is the execution plan for the above query.
Nonclustered Index - nonclustered index on the ServiceUserId and DateDone columns in ascending order.
Key lookup - Primary key for the table which is the NoteId
If I run the same query a second time but with different row numbers then I get a response in milliseconds, I assume from a cached execution plan. The query ran for a different ServiceUserId will take ~10 seconds though.
Any suggestions for how to speed up this query?

You should look into Keyset Pagination.
It is far more performant than Rowset Pagination.
It differs fundamentally from it, in that instead of referencing a particular block of row numbers, instead you reference starting point to lookup the index key.
The reason it is much faster is that you don't care about how many rows are before a particular key, you just seek a key and move forward (or backward).
Say you are filtering by a single ServiceUserId, ordering by DateDone. You need an index as follows (you could leave out the INCLUDE if it's too big, it doesn't change the maths very much):
create index IX_DateDone on Notes (ServiceUserId, DateDone) INCLUDE (TEXT);
Now, when you select some rows, instead of giving the start and end row numbers, give the starting key:
SELECT TOP (20)
Text,
DateDone
FROM
Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2'
AND DateDone > #startingDate
ORDER BY
DateDone;
On the next run, you pass the last DateDone value you received. This gets you the next batch.
The one small downside is that you cannot jump pages. However, it is much rarer than some may think (from a UI perspective) for a user to want to jump to page 327. So that doesn't really matter.
The key must be unique. If it is not unique you can't seek to exactly the next row. If you need to use an extra column to guarantee uniqueness, it gets a little more complicated:
WITH NotesFiltered AS
(
SELECT * FROM Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2'
)
SELECT TOP (20)
Text,
DateDone
FROM (
SELECT
Text,
DateDone,
0 AS ordering
FROM NotesFiltered
WHERE
DateDone = #startingDate AND NoteId > #startingNoteId
UNION ALL
SELECT
Text,
DateDone,
1 AS ordering
FROM NotesFiltered
WHERE
DateDone > #startingDate
) n
ORDER BY
ordering, DateDone, NoteId;
Side Note
In RDBMSs that support row-value comparisons, the multi-column example could be simplified back to the original code by writing:
WHERE (DateDone, NoteId) > (#startingDate, #startingNoteId)
Unfortunately SQL Server does not support this currently.
Please vote for the Azure Feedback request for this

I would suggest to use order by offset fetch :
it starts from row no x and fetch z next row, which can be parameterized
SELECT
Text
FROM
Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2'
Order by DateDone
OFFSET 40 ROWS FETCH NEXT 20 ROWS ONLY
also make sure you have proper index for "DateDone" , maybe include it in the index you already have on "Notes" if you have not yet
you may need to include text column to you index :
create index IX_DateDone on Notes(DateDone) INCLUDE (TEXT,ServiceUserId)
however be noticed that adding such huge column to the index will effect your insert/update efficiency and of course It will need disk space

Related

How to speed up this TSQL query?

I have a TSQL select query that is running "slow"
SELECT
CustomerKey
,ProductKey
,RepresentativeKey
,ReportingDateKey
,SUM(i.InvoiceQuantity) AS InvoiceQuantity
,SUM(i.InvoiceQuantityKg) AS InvoiceQuantityKg
,SUM(i.BrutoInvoiceLineAmount) AS BrutoInvoiceLineAmount
,SUM(i.EndOfYearDiscount) AS EndOfYearDiscount
,SUM(i.NettoInvoiceLineAmount) AS NettoInvoiceLineAmount
,SUM(i.TotalLineCostPrice) AS CostPrice
,SUM(i.MarginAmount) AS MarginAmount
FROM FactInvoices i
WHERE
i.DossierKey =2
AND i.ReportingDate BETWEEN '2016-01-01' AND '2017-12-31'
GROUP BY
CustomerKey
,ProductKey
,RepresentativeKey
,ReportingDateKey
I'm running the query in SSMS 32bit.
Execution time is 17-21s, I have tested adding non clustered indexs on DossierKey and ReportingDate, but this is only slowing down the query.
The table has about 6.04M record and this result set is giving back 1M records.
It's running on SQL 2016 Developers edition.
Server specs: 8core 16gb ram and HDD => Virual server.
Looking at the execution plan, I can't find any improvements.
How do I speed up? More hardware? But I don't think that will help because the server is not fully used when running this query.
Edit:
Execution Plan:
Index:
CREATE NONCLUSTERED INDEX [_dx1]
ON [dbo].[FactInvoices] ([DossierKey],[ReportingDate])
INCLUDE ([CustomerKey],[ProductKey],[ReportingDateKey],[RepresentativeKey],[InvoiceQuantity],[InvoiceQuantityKg],[BrutoInvoiceLineAmount],[NettoInvoiceLineAmount],[MarginAmount],[EndOfYearDiscount],[TotalLineCostPrice])
Thanks.
For this query:
SELECT CustomerKey, ProductKey, RepresentativeKey, ReportingDateKey,
. . .
FROM FactInvoices i
WHERE i.DossierKey = 2 AND
i.ReportingDate BETWEEN '2016-01-01' AND '2017-12-31'
GROUP BY CustomerKey, ProductKey, RepresentativeKey, ReportingDateKey;
I would recommend an index on FactInvoices(DossierKey, ReportingDate, CustomerKey, ProductKey, RepresentativeKey). The first two are the primary elements of the index used for the WHERE clause. The remaining three columns may be useful for the aggregation. You could also include all the additional columns used in the query.
This is an article I wrote on speeding up a query.
If your query is slow, you can check the execution plan for possible areas of speed up.
Well, I have done that and find that it does not always help. The same execution plan can take seconds to run or go off into never never land and be killed after 7 minutes.
I solved this recently using a variety of techniques that I haven't seen mentioned in one place before and wanted to help anyone else in the same situation. The solution usually returned within 2 seconds.
Here is what I did.
Starting Query
This is a fairly basic query. It reports sales orders and allows the user to specify up to 6 optional where criteria.
• If the user does not enter a criteria for a value, for example Country, its criteria string is set to '' and Country is not checked.
• If the user does enter a criteria for a value, its criteria string is bracketed by '%..%'. For example, if the user enters 'Tin', strCountry is set to '%Tin%' and all countries with 'Tin' in its name are selected. (Argentina and Martinique for example.)
SELECT Top 1000
SalesHeader.InvoiceNumber
,SalesHeader.CompanyName
,SalesHeader.Street
,SalesHeader.City
,SalesHeader.Region
,SalesHeader.Country
,SalesHeader.SalesDate
,SalesHeader.InvoiceTotal
,SalesLineItem.LineItemNbr
,SalesLineItem.PartNumber
,SalesLineItem.Quantity
,SalesLineItem.UnitPrice
,SalesLineItem.Quantity * SalesLineItem.UnitPrice as ExtPrice
,PartMaster.UnitWeight
,SalesLineItem.Quantity * PartMaster.UnitWeight as ExtWeight
FROM dbo.SalesHeader
left join dbo.SalesLineItem on SalesHeader.InvoiceNumber = SalesLineItem.InvoiceNumber
left join dbo.PartMaster on SalesLineItem.PartNumber = PartMaster.PartNumber
where
(#strCountry = '' or Country like #strCountry)
and
(#strCompanyName = '' or CompanyName like #strCompanyName)
and
(#strPartNumber = '' or SalesLineItem.PartNumber like #strPartNumber)
and
(#strInvoiceNumber = '' or SalesHeader.InvoiceNumber like #strInvoiceNumber)
and
(#strRegion = '' or Region like #strRegion)
and
(#mnyExtPrice = 0 or (SalesLineItem.Quantity * SalesLineItem.UnitPrice) > #mnyExtPrice)
Order By
InvoiceNumber,
Region,
ExtPrice
I am taking this from a data warehouse I worked on. There were 260,000 records in the full query. We limited the return to 1,000 records as a user would never want more than that.
Sometimes the query would take 10 seconds or less and sometimes we would have to kill it after over 7 minutes had gone by. A user is not going to wait 7 minutes.
What We Came Up With
There are different techniques to speed up a query. The following is our resulting query. I go over each of the techniques used below.
This new query generally returned results in 2 seconds or less.
SELECT
InvoiceNumber
,Company
,Street
,City
,Region
,Country
,SalesDate
,InvoiceTotal
,LineItemNbr
,PartNumber
,Quantity
,UnitPrice
,ExtPrice
,UnitWeight
,ExtWeight
FROM
(
SELECT top 1000
IdentityID,
ROW_NUMBER() OVER (ORDER BY [SalesDate], [Country], [Company], [PartNumber]) as RowNbr
FROM dbo.SalesCombined with(index(NCI_SalesDt))
where
(#strCountry = '' or Country like #strCountry)
and
(#strCompany = '' or Company like #strCompany)
and
(#strPartNumber = '' or PartNumber like #strPartNumber)
and
(#strInvoiceNumber = '' or InvoiceNumber like #strInvoiceNumber)
and
(#strRegion = '' or Region like #strRegion)
and
(#mnyExtPrice = 0 or ExtPrice > #mnyExtPrice)
) SubSelect
Inner Join dbo.SalesCombined on SubSelect.IdentityID = SalesCombined.IdentityID
Order By
RowNbr
Technique 1 - Denormalize the data.
I was fortunate in two ways:
• The data was small enough to create a second copy of it.
• The data did not change very often. This meant I could structure the second copy optimized for querying and allow updating to take a while.
The SalesHeader, SalesLineItem and PartMaster tables were merged into the single SalesCombined table.
The calculated values were stored in the SalesCombined table as well.
Note that I left the original tables in place. All code to update those tables was still valid. I had to create additional code to then propagate the changes to the SalesCombined table.
Technique 2 - Created An Integer Identity Value
The first field of this denormalized table is an integer identity value. This was called IdentityID.
Even if we had not denormalized the data, an integer identity value in SalesHeader could have been used for the join between it and SalesLineItem and speeded the original query up a bit.
Technique 3 - Created A Clustered Index On This Integer Identity Value
I created a clustered index on this IdentityID value. This is the fastest way to find a record.
Technique 4 - Created A Unique, Non-Clustered Index On The Sort Fields
The query's output is sorted on four fields, SalesDate, Country, Company, PartNumber. So I created an index on these fields SalesDate, Country, Company and PartNumber.
Then I added the IdentityID to this index. This index was noted as Unique. This allowed SQL Server to go from the sort fields to the address, essentially, of the actual record as quickly as possible.
Technique 5: Include In the Non-Clustered Index All 'Where Clause' Fields
A SQL Server index can include fields that are not part of the sort. (Who thought of this? It's a great idea.) If you include all where clause fields in the index, SQL Server does not have to look up the actual record to obtain this data.
This is the normal look up process:
1) Read the index from disk.
2) Go to the first entry on the index.
3) Find the address of the first record from that entry.
4) Read that record from disk.
5) Find any fields that are part of the where clause and apply the criteria.
6) Decide if that record is included in the query.
If you include the where clause fields in the index:
1) Read the index from disk.
2) Go to the first entry on the index.
3) Find any fields that are part of the where clause (stored in the index) and apply the criteria.
4) Decide if that record is included in the query.
CREATE UNIQUE NONCLUSTERED INDEX [NCI_InvcNbr] ON [dbo].[SalesCombined]
(
[SalesDate] ASC,
[Country] ASC,
[CompanyName] ASC,
[PartNumber] ASC,
[IdentityID] ASC
)
INCLUDE [InvoiceNumber],
[City],
[Region],
[ExtPrice]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
ON [PRIMARY]
The execution plan for the original query.
Click Here To See Original Query Execution Plan
The execution plan for our final query is much simpler - to start, it just reads the index.
Click Here To See Final Query Execution Plan
Technique 6: Created A Sub-Query To Find The IdentityID Of Each Record To Output And Their Sort Order
I created a sub-query to find the records to output and the order in which to output them. Note the following:
• Technique 7 - It explicitly says to use the NCI_InvcNbr index that has all of the fields needed in it.
• Technique 8- It uses the Row_Number function to generate an integer for each row that will be output. These values are generated 1, 2 and 3 in the order given by the fields in the ORDER BY section of that line.
Technique 9: Create An Enclosing Query With All Of The Values
This query specifies the values to print. It uses the Row_Number values to know the order in which to print. Note that the inner join is done on the IdentityID field which uses the Clustered index to find each record to print.
Techniques That Did Not Help
There were two techniques that we tried that did not speed up the query. These statements are both added to the end of a query.
• OPTION (MAXDOP 1) limits the number of processors to one. This will prevent any parallelism from being done. We tried this when we were experimenting with the query and had parallelism in the execution plan.
• OPTION (RECOMPILE) causes the execution plan to be recreated every time the query is run. This can be useful when different user selections can vary the query results.
Hope this can be of use.
If you already indexed this query and it still with a bad performance, you can try partition your table by DossierKey.
and change
WHERE i.DossierKey = 2
to
WHERE $PARTITION.partition_function_name( 2)
https://www.cathrinewilhelmsen.net/2015/04/12/table-partitioning-in-sql-server/
https://learn.microsoft.com/en-us/sql/t-sql/functions/partition-transact-sql

Do clustered index on a column GUARANTEES returning sorted rows according to that column [duplicate]

This question already has an answer here:
Does a SELECT query always return rows in the same order? Table with clustered index
(1 answer)
Closed 8 years ago.
I am unable to get clear cut answers on this contentious question .
MSDN documentation mentions
Clustered
Clustered indexes sort and store the data rows in the table or view
based on their key values. These are the columns included in the
index definition. There can be only one clustered index per table,
because the data rows themselves can be sorted in only one order.
The only time the data rows in a table are stored in sorted order is
when the table contains a clustered index. When a table has a
clustered index, the table is called a clustered table. If a table
has no clustered index, its data rows are stored in an unordered
structure called a heap.
While I see most of the answers
Does a SELECT query always return rows in the same order? Table with clustered index
http://sqlwithmanoj.com/2013/06/02/clustered-index-do-not-guarantee-physically-ordering-or-sorting-of-rows/
answering negative.
What is it ?
Just to be clear. Presumably, you are talking about a simple query such as:
select *
from table t;
First, if all the data on the table fits on a single page and there are no other indexes on the table, it is hard for me to imagine a scenario where the result set is not ordered by the primary key. However, this is because I think the most reasonable query plan would require a full-table scan, not because of any requirement -- documented or otherwise -- in SQL or SQL Server. Without an explicit order by, the ordering in the result set is a consequence of the query plan.
That gets to the heart of the issue. When you are talking about the ordering of the result sets, you are really talking about the query plan. And, the assumption of ordering by the primary key really means that you are assuming that the query uses full-table scan. What is ironic is that people make the assumption, without actually understanding the "why". Furthermore, people have a tendency to generalize from small examples (okay, this is part of the basis of human intelligence). Unfortunately, they see consistently that results sets from simple queries on small tables are always in primary key order and generalize to larger tables. The induction step is incorrect in this example.
What can change this? Off-hand, I think that a full table scan would return the data in primary key order if the following conditions are met:
Single threaded server.
Single file filegroup
No competing indexes
No table partitions
I'm not saying this is always true. It just seems reasonable that under these circumstances such a query would use a full table scan starting at the beginning of the table.
Even on a small table, you can get surprises. Consider:
select NonPrimaryKeyColumn
from table
The query plan would probably decide to use an index on table(NonPrimaryKeyColumn) rather than doing a full table scan. The results would not be ordered by the primary key (unless by accident). I show this example because indexes can be used for a variety of purposes, not just order by or where filtering.
If you use a multi-threaded instance of the database and you have reasonably sized tables, you will quickly learn that results without an order by have no explicit ordering.
And finally, SQL Server has a pretty smart optimizer. I think there is some reluctance to use order by in a query because users think it will automatically do a sort. SQL Server works hard to find the best execution plan for the query. IF it recognizes that the order by is redundant because of the rest of the plan, then the order by will not result in a sort.
And, of course you want to guarantee the ordering of results, you need order by in the outermost query. Even a query like this:
select *
from (select top 100 t.* from t order by col1) t
Does not guarantee that the results are ordered in the final result set. You really need to do:
select *
from (select top 100 t.* from t order by col1) t
order by col1;
to guarantee the results in a particular order. This behavior is documented here.
Without ORDER BY, there is no default sort order even if you have clustered index
in this link there is a good example :
CREATE SCHEMA Data AUTHORIZATION dbo
GO
CREATE TABLE Data.Numbers(Number INT NOT NULL PRIMARY KEY)
GO
DECLARE #ID INT;
SET NOCOUNT ON;
SET #ID = 1;
WHILE #ID < 100000 BEGIN
INSERT INTO Data.Numbers(Number)
SELECT #ID;
SET #ID = #ID+1;
END
CREATE TABLE Data.WideTable(ID INT NOT NULL
CONSTRAINT PK_WideTable PRIMARY KEY,
RandomInt INT NOT NULL,
CHARFiller CHAR(1000))
GO
CREATE VIEW dbo.WrappedRand
AS
SELECT RAND() AS random_value
GO
CREATE ALTER FUNCTION dbo.RandomInt()
RETURNS INT
AS
BEGIN
DECLARE #ret INT;
SET #ret = (SELECT random_value*1000000 FROM dbo.WrappedRand);
RETURN #ret;
END
GO
INSERT INTO Data.WideTable(ID,RandomInt,CHARFiller)
SELECT Number, dbo.RandomInt(), 'asdf'
FROM Data.Numbers
GO
CREATE INDEX WideTable_RandomInt ON Data.WideTable(RandomInt)
GO
SELECT TOP 100 ID FROM Data.WideTable
OUTPUT:
1407
253
9175
6568
4506
1623
581
As you have seen, the optimizer has chosen to use a non-clustered
index to satisfy this SELECT TOP query.
Clearly you cannot assume that your results are ordered unless you
explicitly use ORDER BY clause.
One must specify ORDER BY in the outermost query in order to guarantee rows are returned in a particular order. The SQL Server optimizer will optimize the query and data access to improve performance which may result in rows being returned in a different order. Examples of this are allocation order scans and parallelism. A relational table should always be viewed as an unordered set of rows.
I wish the MSDN documentation were clearer about this "sorting". It is more correct to say that SQL Server b-tree indexes provide ordering by 1) storing adjacent keys in the same page and 2) linking index pages in key order.

SQL Query Slow When Ordering by a PK Column

Why does this query take a long time (30+ seconds) to run?
A:
SELECT TOP 10 * FROM [Workflow] ORDER BY ID DESC
But this query is fast (0 seconds):
B:
SELECT TOP 10 * FROM [Workflow] ORDER BY ReadTime DESC
And this is fast (0 seconds), too:
C:
SELECT TOP 10 * FROM [Workflow] WHERE SubId = '120611250634'
I get why B and C are fast:
B:
C:
But I don’t get why A takes so long when we have this:
Edit: The estimated execution plan using ID:
Edit: The estimated execution plan using ReadTime:
Well, your primary key is for both ID (ASC) and ReadTime (ASC). The order is not important when you're only having a single column index, but it does matter when you have more columns in the index (a composite key).
Composite clustered keys are not really made for ordering. I'd expect that using
SELECT TOP 10 * FROM [Workflow] ORDER BY ID ASC
Will be rather fast, and the best would be
SELECT TOP 10 * FROM [Workflow] ORDER BY ID, ReadTime
Reversing the order is a tricky operation on a composite key.
So in effect, when you order by ReadTime, you have an index ready for that, and that index also knows the exact key of the row involved (both its Id and ReadTime - another good reason to keep the clustered index very narrow). It can look up all the columns rather easily. However, when you order by Id, you don't have an exact fit of an index. The server doesn't trivially know how many rows there are for a given Id, which means the top gets a bit trickier than you'd guess. In effect, your clustered index turns into a waste of space and performance (as far as those sample queries are concerned).
Seeing just the tiny part of your database, I'd say having a clustered index on Id and ReadTime is a bad idea. Why do you do that?
It looks like ID isn't a PK by itself, but along with ReadTime (based on your 3rd picture).
Therefore the index is built on the (ID,ReadTime) pair, and this index isn't used by your query.
Try adding an index on ID only.

processing large table - how do i select the records page by page?

I need to do a process on all the records in a table. The table could be very big so I rather process the records page by page. I need to remember the records that have already been processed so there are not included in my second SELECT result.
Like this:
For first run,
[SELECT 100 records FROM MyTable]
For second run,
[SELECT another 100 records FROM MyTable]
and so on..
I hope you get the picture. My question is how do I write such select statement?
I'm using oracle btw, but would be nice if I can run on any other db too.
I also don't want to use store procedure.
Thank you very much!
Any solution you come up with to break the table into smaller chunks, will end up taking more time than just processing everything in one go. Unless the table is partitioned and you can process exactly one partition at a time.
If a full table scan takes 1 minute, it will take you 10 minutes to break up the table into 10 pieces. If the table rows are physically ordered by the values of an indexed column that you can use, this will change a bit due to clustering factor. But it will anyway take longer than just processing it in one go.
This all depends on how long it takes to process one row from the table of course. You could chose to reduce the load on the server by processing chunks of data, but from a performance perspective, you cannot beat a full table scan.
You are most likely going to want to take advantage of Oracle's stopkey optimization, so you don't end up with a full tablescan when you don't want one. There are a couple ways to do this. The first way is a little longer to write, but let's Oracle automatically figure out the number of rows involved:
select *
from
(
select rownum rn, v1.*
from (
select *
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rownum <= 200
)
where rn >= 101;
You could also achieve the same thing with the FIRST_ROWS hint:
select /*+ FIRST_ROWS(200) */ *
from (
select rownum rn, t.*
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rn between 101 and 200;
I much prefer the rownum method, so you don't have to keep changing the value in the hint (which would need to represent the end value and not the number of rows actually returned to the page to be accurate). You can set up the start and end values as bind variables that way, so you avoid hard parsing.
For more details, you can check out this post

SQL Server slow select from large table

I have a table with about 20+ million records.
Structure is like:
EventId UNIQUEIDENTIFIER
SourceUserId UNIQUEIDENTIFIER
DestinationUserId UNIQUEIDENTIFIER
CreatedAt DATETIME
TypeId INT
MetaId INT
Table is receiving about 100k+ records each day.
I have indexes on each column except MetaId, as it is not used in 'where' clauses
The problem is when i want to pick up eg. latest 100 records for desired SourceUserId
Query sometimes takes up to 4 minutes to execute, which is not acceptable.
Eg.
SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND
(
TypeId IN (2, 3, 4)
OR
(TypeId = 60 AND SrcMemberId != DstMemberId)
)
ORDER BY CreatedAt DESC
I can't do partitioning etc as I am using Standard version of SQL Server and Enterprise is too expensive.
I also think that the table is quite small to be that slow.
I think the problem is with ORDER BY clause as db must go through much bigger set of data.
Any ideas how to make it quicker ?
Perhaps relational database is not a good idea for that kind of data.
Data is always being picked up ordered by CreatedAt DESC
Thank you for reading.
PabloX
You'll likely want to create a composite index for this type of query - when the query runs slowly it is most likely choosing to scan down an index on the CreatedAt column and perform a residual filter on the SourceUserId value, when in reality what you want to happen is to jump directly to all records for a given SourceUserId ordered properly - to achieve this, you'll want to create a composite index primarily on SourceUserId (performing an equality check) and secondarily on CreateAt (to preserve the order within a given SourceUserId value). You may want to try adding the TypeId in as well, depending on the selectivity of this column.
So, the 2 that will most likely give the best repeatable performance (try them out and compare) would be:
Index on (SourceUserId, CreatedAt)
Index on (SourceUserId, TypeId, CreatedAt)
As always, there are also many other considerations to take into account with determining how/what/where to index, as Remus discusses in a separate answer one big consideration is covering the query vs. keeping lookups. Additionally you'll need to consider write volumes, possible fragmentation impact (if any), singleton lookups vs. large sequential scans, etc., etc.
I have indexes on each column except
MetaId
Non-covering indexes will likely hit the 'tipping point' and the query would revert to a table scan. Just adding an index on every column because it is used in a where clause does not equate good index design. To take your query for example, a good 100% covering index would be:
INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId, SrcMemberId, DstMemberId)
Following index is also usefull, altough it still going to cause lookups:
INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId)
and finaly an index w/o any included column may help, but is just as likely will be ignored (depends on the column statistics and cardinality estimates):
INDEX ON (SourceUserId , CreatedAt)
But a separate index on SourceUSerId and one on CreatedAt is basically useless for your query.
See Index Design Basics.
The fact that the table has indexes built on GUID values, indicates a possible series of problems that would affect performance:
High index fragmentation: since new GUIDs are generated randomly, the index cannot organize them in a sequential order and the nodes are spread unevenly.
High number of page splits: the size of a GUID (16 bytes) causes many page splits in the index, since there's a greater chance than a new value wont't fit in the remaining space available in a page.
Slow value comparison: comparing two GUIDs is a relatively slow operation because all 33 characters must be matched.
Here a couple of resources on how to investigate and resolve these problems:
How to Detect Index Fragmentation in SQL Server 2000 and 2005
Reorganizing and Rebuilding Indexes
How Using GUIDs in SQL Server Affect Index Performance
I would recomend getting the data in 2 sep var tables
INSERT INTO #Table1
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND
(
TypeId IN (2, 3, 4)
)
INSERT INTO #Table2
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND
(
(TypeId = 60 AND SrcMemberId != DstMemberId)
)
then apply a unoin from the selects, ordered and top. Limit the data from the get go.
I suggest using a UNION:
SELECT TOP 100 x.*
FROM (SELECT a.*
FROM EVENTS a
WHERE a.typeid IN (2, 3, 4)
UNION ALL
SELECT b.*
FROM EVENTS b
WHERE b.typeid = 60
AND b.srcmemberid != b.dstmemberid) x
WHERE x.sourceuserid = '15b534b17-5a5a-415a-9fc0-7565199c3461'
We've realised a minor gain by moving to a BIGINT IDENTITY key for our event table; by using that as a clustered primary key, we can cheat and use that for date ordering.
I would make sure CreatedAt is indexed properly
you could split the query in two with an UNION to avoid the OR (which can cause your index not to be used), something like
SElect * FROM(
SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND TypeId IN (2, 3, 4)
UNION SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND TypeId = 60 AND SrcMemberId != DstMemberId
)
ORDER BY CreatedAt DESC
Also, check that the uniqueidentifier indexes are not CLUSTERED.
If there are 100K records added each day, you should check your index fragmentation.
And rebuild or reorganize it accordingly.
More info :
SQLauthority