Indexing for tables used in temporal query - sql

I have one tables with structures:
CREATE TABLE [dbo].[rx](
[pat_id] [int] NOT NULL,
[fill_Date] [date] NOT NULL,
[script_End_Date] AS (dateadd(day,[dayssup],[filldate])),
[drug_Name] [varchar](50) NULL,
[days_Sup] [int] NOT NULL,
[quantity] [float] NOT NULL,
[drug_Class] [char](3) NOT NULL,
[ofInterest] bit
CHECK(fill_Date <=script_End_Date
PRIMARY KEY CLUSTERED
(
[clmid] ASC
)
CREATE TABLE [dbo].[Calendar](
[cal_date] [date] PRIMARY KEY,
[Year] AS YEAR(cal_date) PERSISTED,
[Month] AS MONTH(cal_date) PERSISTED,
[Day] AS DAY(cal_date) PERSISTED,
[julian_seq] AS 1+DATEDIFF(DD, CONVERT(DATE, CONVERT(varchar,YEAR(cal_date))+'0101'),cal_date),
id int identity);
I used these tables with this query:
;WITH x
AS (SELECT rx.pat_id,
c.cal_date,
Count(DISTINCT rx.drug_name) AS distinctDrugs
FROM rx,
calendar AS c
WHERE c.cal_date BETWEEN rx.fill_date AND rx.script_end_date
AND rx.ofinterest = 1
GROUP BY rx.pat_id,
c.cal_date
--the query example I used having count(1) =2, but to illustrate the non-contiguous intervals, in practice I need the below having statement
HAVING Count(*) > 1),
y
AS (SELECT x.pat_id,
x.cal_date
--c2.id is the row number in the calendar table.
,
c2.id - Row_number()
OVER(
partition BY x.pat_id
ORDER BY x.cal_date) AS grp_nbr,
distinctdrugs
FROM x,
calendar AS c2
WHERE c2.cal_date = x.cal_date)
SELECT *,
Rank()
OVER(
partition BY pat_id, grp_nbr
ORDER BY distinctdrugs) AS [ranking]
FROM y
The calendar table runs for three years and the rx table has about 800k rows in it. After the preceding query ran for a few minutes I decided to add an index to it to speed things up. The index that I added was
create index ix_rx
on rx (clmid)
include (pat_id,fill_date,script_end_date,ofinterest)
This index had zero affect on the run time on the query. Can anyone help explain why the aforementioned index is not being used? This is a retrospective database and no more data will be added to it. I can add the execution plan if needed.

The clmid field is not used at all in the query. As such, I would be surprised if the optimizer would consider it, just for the include columns.
If you want to speed the query with indexes, start with the query where the table is used. The fields used are pat_id, drug_name, rx_ofinterest, fill_date, and script_end_date. The last two are challenging because of the between. You might try this index: rx(pat_id, drug_name, ofinterest, fill_date, script_end_date).
Having all the referenced fields in the index will make it possible to access the data without loading data pages.

Because it is not an appropriate index. Create two index one on [pat_id] and the other on drug_name. –

Related

Missed row when running SELECT with READCOMMITTEDLOCK

I have a T-SQL code that delta-copies data from the source table (SrcTable) to the destination table (DestTable). The data is inserted into the source table by multiple sessions and copied to the destination table by a SQL Server Agent job.
Here's the snippet which inserts the batch into the destination table:
...
WITH cte
AS (SELECT st.SrcTable_ID,
st.SrcTable_CreatedDateTime
FROM SrcTable st WITH (READCOMMITTEDLOCK, INDEX(PK_SrcTable))
WHERE st.SrcTable_ID
BETWEEN #FromID AND #ToID)
INSERT DestTable
(
DestTable_SrcTableID
)
SELECT cte.SrcTable_ID
FROM cte;
...
both tables are partitioned on CreatedDateTime column which default to SYSUTCDATETIME
CREATE TABLE [dbo].[SrcTable](
[SrcTable_ID] [BIGINT] IDENTITY(1,1) NOT NULL,
[SrcTable_CreatedDateTime] [DATETIME2](3) NOT NULL,
CONSTRAINT [PK_SrcTable] PRIMARY KEY CLUSTERED
(
[SrcTable_ID] ASC,
[SrcTable_CreatedDateTime] ASC
) ON [ps_Daily]([SrcTable_CreatedDateTime])
) ON [ps_Daily]([SrcTable_CreatedDateTime])
GO
CREATE TABLE [dbo].[DestTable](
[DestTable_ID] [BIGINT] IDENTITY(1,1) NOT NULL,
[DestTable_CreatedDateTime] [DATETIME2](3) NOT NULL,
[DestTable_SrcTableID] [BIGINT] NOT NULL,
CONSTRAINT [PK_DestTable] PRIMARY KEY CLUSTERED
(
[DestTable_ID] ASC,
[DestTable_CreatedDateTime] ASC
) ON [ps_Daily]([DestTable_CreatedDateTime])
) ON [ps_Daily]([DestTable_CreatedDateTime])
GO
This code has been running for years copying millions of records a day with no issues.
Recently it started missing a single row every couple of weeks.
Here's an example of such a batch with #FromID=2140 and #ToID=2566 and one missing row (2140)
SELECT * FROM dbo.SrcTable st
LEFT JOIN dbo.DestTable dt ON st.SrcTable_ID=dt.DestTable_SrcTableID
WHERE st.SrcTable_ID BETWEEN 2140 AND 2566
ORDER BY st.SrcTable_ID ASC
The only plausible explanation that I can think of is that the allocation of identity values (SrcTable_ID) happens outside of the transaction which inserts into the source table (which I learned from an excellent answer by Paul White on the related question, but judging by the time stamps in both tables this scenario seems highly unlikely.
The question is:
How likely is it that the missing row was invisible to the SELECT statement because its' identity was allocated outside of the inserting transaction and before the lock was acquired, given the fact that the next row in the batch (2141) was inserted into the source table a couple of seconds later but was successfully picked up?
We're running on Microsoft SQL Server 2019 (RTM-CU16) (KB5011644) - 15.0.4223.1 (X64)

Is it good to have multiple inner joins in SQL select statement?

I have a table which looks like this:
CREATE TABLE [dbo].[Devices]
(
[Device_ID] [nvarchar](10) NOT NULL,
[Series_ID] [int] NOT NULL,
[Start_Date] [date] NULL,
[Room_ID] [int] NOT NULL,
[No_Of_Ports] [int] NULL,
[Description] [text] NULL
);
I want to show this table in a gridview, but instead of showing the [Series_ID] column, I want to show 3 columns Series_Name, Brand_Name, and Type_Name from another 3 columns, and instead of showing the [Room_ID] column, I want to show 3 columns Site_Name, Floor_Name, Room_Name from another 3 columns
I can do that by more than 6 inner joins. I am a beginner in SQL and I want to know is this right to have a lot of inner joins in one statement in point of performance?
Based on your query explanation, I assume it will be 2 inner joins instead of 6.
If Series_Name, Brand_Name and Type_Name are in one table with Series_Id as ForeignKey, then you would need one join.
Similarly, If Site_Name, Floor_Name, Room_Name are in one table with Room_ID as ForeignKey, then you would need another innjer join.
Again, it is difficult to tell the exact number of joins without understanding table structure of the other referential tables.

SQL server query plan

I have 3 tables as listed below
CREATE TABLE dbo.RootTransaction
(
TransactionID int CONSTRAINT [PK_RootTransaction] PRIMARY KEY NONCLUSTERED (TransactionID ASC)
)
GO
----------------------------------------------------------------------------------------------------
CREATE TABLE [dbo].[OrderDetails](
[OrderID] int identity(1,1) not null,
TransactionID int,
OrderDate datetime,
[Status] varchar(50)
CONSTRAINT [PK_OrderDetails] PRIMARY KEY CLUSTERED ([OrderID] ASC),
CONSTRAINT [FK_TransactionID] FOREIGN KEY ([TransactionID]) REFERENCES [dbo].[RootTransaction] ([TransactionID]),
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [ix_OrderDetails_TransactionID]
ON [dbo].[OrderDetails](TransactionID ASC, [OrderID] ASC);
GO
----------------------------------------------------------------------------------------------------
CREATE TABLE dbo.OrderItems
(
ItemID int identity(1,1) not null,
[OrderID] int,
[Name] VARCHAR (50) NOT NULL,
[Code] VARCHAR (9) NULL,
CONSTRAINT [PK_OrderItems] PRIMARY KEY NONCLUSTERED ([ItemID] ASC),
CONSTRAINT [FK_OrderID] FOREIGN KEY ([OrderID]) REFERENCES [dbo].[OrderDetails] ([OrderID])
)
Go
CREATE CLUSTERED INDEX OrderItems
ON [dbo].OrderItems([OrderID] ASC, ItemID ASC) WITH (FILLFACTOR = 90);
GO
CREATE NONCLUSTERED INDEX [IX_Code]
ON [dbo].[OrderItems]([Code] ASC) WITH (FILLFACTOR = 90)
----------------------------------------------------------------------------------------------------
Populated sample data in each table
select COUNT(*) from RootTransaction -- 45851
select COUNT(*) from [OrderDetails] -- 50201
select COUNT(*) from OrderItems --63850
-- Query 1
SELECT o.TransactionID
FROM [OrderDetails] o
JOIN dbo.OrderItems i ON o.OrderID = i.OrderID
WHERE i.Code like '1067461841%'
declare #SearchKeyword varchar(200) = '1067461841'
-- Query 2
SELECT o.TransactionID
FROM [OrderDetails] o
JOIN dbo.OrderItems i ON o.OrderID = i.OrderID
WHERE i.Code like #SearchKeyword + '%'
When running above 2 queries, I could see Query 1 use index seek on OrderDetails, OrderItems which is expected,
However in query 2, query plan use index seek on OrderItems but index scan on OrderDetails.
Only difference in two queries is using direct value vs variable in LIKE and both returns same result.
why the query execution plan change between using direct value vs variable?
I believe the issue is most likely explained through parameter sniffing. SQL Server often identifies and caches query plans for commonly used queries. As part of this caching, it "sniffs" the parameters you use on the most common queries to optimize the creation of the plan.
Query 1 shows a direct string, so SQL creates a specific plan. Query 2 uses an intermediate variable, which is one of the techniques that actually prevents parameter sniffing (often used to provide more predictable performance to stored procs or queries where the parameters have significant variance. These are considered 2 completely different queries to SQL despite the obvious similarities. The observed differences are essentially just optimization.
Furthermore, if your tables had different distributions of row counts, you'd likely potential differences from those 2 scenarios based on existing indexes and potential optimizations. On my server with no sample data loaded, the query 1 and query 2 had same execution plans since the optimizer couldn't find any better paths for the parameters.
For more info: http://blogs.technet.com/b/mdegre/archive/2012/03/19/what-is-parameter-sniffing.aspx
Below queries show similar plan though WHERE clause is different.
select Code from OrderItems WHERE Code like '6662225%'
declare #SearchKeyword varchar(200) = '6662225'
select Code from OrderItems WHERE Code like #SearchKeyword + '%'
The following post/answers offer a good explanation as to why performance is better with hard coded constants than variables, along with a few suggestions you could possibly try out:
Alternative to using local variables in a where clause

Type II dimension joins

I have the following table lookup table in OLTP
CREATE TABLE TransactionState
(
TransactionStateId INT IDENTITY (1, 1) NOT NULL,
TransactionStateName VarChar (100)
)
When this comes into my OLAP, I change the structure as follows:
CREATE TABLE TransactionState
(
TransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
TransactionStateName VarChar (100) NOT NULL,
StartDateTime DateTime NOT NULL,
EndDateTime NULL
)
My question is regarding the TransactionStateId column. Over time, I may have duplicate TransactionStateId values in my OLAP, but with the combination of StartDateTime and EndDateTime, they would be unique.
I have seen samples of Type-2 Dimensions where an OriginalTransactionStateId is added and the incoming TransactionStateId is mapped to it, plus a new TransactionStateId IDENTITY field becomes the PK and is used for the joins.
CREATE TABLE TransactionState
(
TransactionStateId INT IDENTITY (1, 1) NOT NULL,
OriginalTransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
TransactionStateName VarChar (100) NOT NULL,
StartDateTime DateTime NOT NULL,
EndDateTime NULL
)
Should I go with bachellorete #2 or bachellorete #3?
By this phrase:
With the combination of StartDateTime and EndDateTime, they would be unique.
you mean that they never overlap or that they satisfy the database UNIQUE constraint?
If the former, then you can use the StartDateTime in joins, but note that it may be inefficient, since it will use a "<=" condition instead of "=".
If the latter, then just use a fake identity.
Databases in general do not allow an efficient algorithm for this query:
SELECT *
FROM TransactionState
WHERE #value BETWEEN StartDateTime AND EndDateTime
, unless you do arcane tricks with SPATIAL data.
That's why you'll have to use this condition in a JOIN:
SELECT *
FROM factTable
CROSS APPLY
(
SELECT TOP 1 *
FROM TransactionState
WHERE StartDateTime <= factDateTime
ORDER BY
StartDateTime DESC
)
, which will deprive the optimizer of possibility to use HASH JOIN, which is most efficient for such queries in many cases.
See this article for more details on this approach:
Converting currencies
Rewriting the query so that it can use HASH JOIN resulted in 600% times performance gain, though it's only possible if your datetimes have accuracy of a day or lower (or a hash table will grow very large).
Since your time component is stripped of your StartDateTime and EndDateTime, you can create a CTE like this:
WITH cal AS
(
SELECT CAST('2009-01-01' AS DATE) AS cdate
UNION ALL
SELECT DATEADD(day, 1, cdate)
FROM cal
WHERE cdate <= '2009-03-01'
),
state AS
(
SELECT cdate, ts.*
FROM cal
CROSS APPLY
(
SELECT TOP 1 *
FROM TransactionState
WHERE StartDateTime <= cdate
ORDER BY
StartDateTime DESC
) ts
WHERE ts.EndDateTime >= cdate
)
SELECT *
FROM factTable
JOIN state
ON cdate = DATE(factDate)
If your date ranges span more than 100 dates, adjust MAXRECURSION option on CTE.
Please be aware that IDENTITY(1,1) is a declaration for auto-generating values in that column. This is different than PRIMARY KEY, which is a declaration that makes a column into a primary key clustered index. These two declarations mean different things and there are performance implications if you don't say PRIMARY KEY.
You could also use SSIS to load the DW. In the slowly changing dimension (SCD) transformation, you can set how to treat each attribute. If a historical attribute is selected, the type 2 SCD is applied to the whole row, and the transformation takes care of details. You also get to configure if you prefer start_date, end_date or a current/expired column.
The thing to differentiate here is difference between the primary key and a the business (natural) key. Primary key uniquely identifies a row in the table. Business key uniquely identifies a business object/entity and it can be repeated in a dimension table. Each time a SCD 2 is applied, a new row is inserted, with a new primary key, but the same business key; the old row is then marked as expired, while the new one is marked as current -- or start date and end date fields are populated appropriately.
The DW should not expose primary keys, so incoming data from OLTP contains business keys, while assignment of primary keys is under control of the DW; IDENTITY int is good for PKs in dimension tables.
The cool thing is that SCD transformation in SSIS takes care of this.

Query against 250k rows taking 53 seconds

The box this query is running on is a dedicated server running in a datacenter.
AMD Opteron 1354 Quad-Core 2.20GHz
2GB of RAM
Windows Server 2008 x64 (Yes I know I only have 2GB of RAM, I'm upgrading to 8GB when the project goes live).
So I went through and created 250,000 dummy rows in a table to really stress test some queries that LINQ to SQL generates and make sure they're not to terrible and I noticed one of them was taking an absurd amount of time.
I had this query down to 17 seconds with indexes but I removed them for the sake of this answer to go from start to finish. Only indexes are Primary Keys.
Stories table --
[ID] [int] IDENTITY(1,1) NOT NULL,
[UserID] [int] NOT NULL,
[CategoryID] [int] NOT NULL,
[VoteCount] [int] NOT NULL,
[CommentCount] [int] NOT NULL,
[Title] [nvarchar](96) NOT NULL,
[Description] [nvarchar](1024) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[UniqueName] [nvarchar](96) NOT NULL,
[Url] [nvarchar](512) NOT NULL,
[LastActivityAt] [datetime] NOT NULL,
Categories table --
[ID] [int] IDENTITY(1,1) NOT NULL,
[ShortName] [nvarchar](8) NOT NULL,
[Name] [nvarchar](64) NOT NULL,
Users table --
[ID] [int] IDENTITY(1,1) NOT NULL,
[Username] [nvarchar](32) NOT NULL,
[Password] [nvarchar](64) NOT NULL,
[Email] [nvarchar](320) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[LastActivityAt] [datetime] NOT NULL,
Currently in the database there is 1 user, 1 category and 250,000 stories and I tried to run this query.
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
Query takes 52 seconds to run, CPU usage hovers at 2-3%, Membery is 1.1GB, 900MB free but the Disk usage seems out of control. It's # 100MB/sec with 2/3 of that being writes to tempdb.mdf and the rest is reading from tempdb.mdf.
Now for the interesting part...
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
SELECT TOP(10) *
FROM Stories
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
ORDER BY Stories.LastActivityAt
All 3 of these queries are pretty much instant.
Exec plan for first query.
http://i43.tinypic.com/xp6gi1.png
Exec plans for other 3 queries (in order).
http://i43.tinypic.com/30124bp.png
http://i44.tinypic.com/13yjml1.png
http://i43.tinypic.com/33ue7fb.png
Any help would be much appreciated.
Exec plan after adding indexes (down to 17 seconds again).
http://i39.tinypic.com/2008ytx.png
I've gotten a lot of helpful feedback from everyone and I thank you, I tried a new angle at this. I query the stories I need, then in separate queries get the Categories and Users and with 3 queries it only took 250ms... I don't understand the issue but if it works and at 250ms no less for the time being I'll stick with that. Here's the code I used to test this.
DBDataContext db = new DBDataContext();
Console.ReadLine();
Stopwatch sw = Stopwatch.StartNew();
var stories = db.Stories.OrderBy(s => s.LastActivityAt).Take(10).ToList();
var storyIDs = stories.Select(c => c.ID);
var categories = db.Categories.Where(c => storyIDs.Contains(c.ID)).ToList();
var users = db.Users.Where(u => storyIDs.Contains(u.ID)).ToList();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
Try adding an index on Stories.LastActivityAt. I think the clustered index scan in the execution plan may be due to the sorting.
Edit:
Since my query returned in an instant with rows just a few bytes long, but has been running for 5 minutes already and is still going after I added a 2K varchar, I think Mitch has a point. It is the volume of that data that is shuffled around for nothing, but this can be fixed in the query.
Try putting the join, sort and top(10) in a view or in a nested query, and then join back against the story table to get the rest of the data just for the 10 rows that you need.
Like this:
select * from
(
SELECT TOP(10) id, categoryID, userID
FROM Stories
ORDER BY Stories.LastActivityAt
) s
INNER JOIN Stories ON Stories.ID = s.id
INNER JOIN Categories ON Categories.ID = s.CategoryID
INNER JOIN Users ON Users.ID = s.UserID
If you have an index on LastActivityAt, this should run very fast.
So if I read the first part correctly, it responds in 17 seconds with an index. Which is still a while to chug out 10 records. I'm thinking the time is in the order by clause. I would want an index on LastActivityAt, UserID, CategoryID. Just for fun, remove the order by and see if it returns the 10 records quickly. If it does, then you know it is not in the joins to the other tables. Also it would be helpful to replace the * with the columns needed as all 3 table columns are in the tempdb as you are sorting - as Neil mentioned.
Looking at the execution plans you'll notice the extra sort - I believe that is the order by which is going to take some time. I'm assuming you had an index with the 3 and it was 17 seconds... so you may want one index for the join criteria (userid, categoryID) and another for lastactivityat - see if that performs better. Also it would be good to run the query thru the index tuning wizard.
My first suggestion is to remove the *, and replace it with the minimum columns you need.
second, is there a trigger involved? Something that would update the LastActivityAt field?
Based on your problem query, try add a combination index on table Stories (CategoryID, UserID, LastActivityAt)
You are maxing out the Disks in your hardware setup.
Given your comments about your Data/Log/tempDB File placement, I think any amount of tuning is going to be a bandaid.
250,000 Rows is small. Imagine how bad your problems are going to be with 10 million rows.
I suggest you move tempDB onto its own physical drive (preferable a RAID 0).
Ok, so my test machine isn't fast. Actually it's really slow. It 1.6 ghz,n 1 gb of ram, No multiple disks, just a single (read slow) disk for sql server, os, and extras.
I created your tables with primary and foreign keys defined.
Inserted 2 categories, 500 random users, and 250000 random stories.
Running the first query above takes 16 seconds (no plan cache either).
If I index the LastActivityAt column I get results in under a second (no plan cache here either).
Here's the script I used to do all of this.
--Categories table --
Create table Categories (
[ID] [int] IDENTITY(1,1) primary key NOT NULL,
[ShortName] [nvarchar](8) NOT NULL,
[Name] [nvarchar](64) NOT NULL)
--Users table --
Create table Users(
[ID] [int] IDENTITY(1,1) primary key NOT NULL,
[Username] [nvarchar](32) NOT NULL,
[Password] [nvarchar](64) NOT NULL,
[Email] [nvarchar](320) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[LastActivityAt] [datetime] NOT NULL
)
go
-- Stories table --
Create table Stories(
[ID] [int] IDENTITY(1,1) primary key NOT NULL,
[UserID] [int] NOT NULL references Users ,
[CategoryID] [int] NOT NULL references Categories,
[VoteCount] [int] NOT NULL,
[CommentCount] [int] NOT NULL,
[Title] [nvarchar](96) NOT NULL,
[Description] [nvarchar](1024) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[UniqueName] [nvarchar](96) NOT NULL,
[Url] [nvarchar](512) NOT NULL,
[LastActivityAt] [datetime] NOT NULL)
Insert into Categories (ShortName, Name)
Values ('cat1', 'Test Category One')
Insert into Categories (ShortName, Name)
Values ('cat2', 'Test Category Two')
--Dummy Users
Insert into Users
Select top 500
UserName=left(SO.name+SC.name, 32)
, Password=left(reverse(SC.name+SO.name), 64)
, Email=Left(SO.name, 128)+'#'+left(SC.name, 123)+'.com'
, CreatedAt='1899-12-31'
, LastActivityAt=GETDATE()
from sysobjects SO
Inner Join syscolumns SC on SO.id=SC.id
go
--dummy stories!
-- A Count is given every 10000 record inserts (could be faster)
-- RBAR method!
set nocount on
Declare #count as bigint
Set #count = 0
begin transaction
while #count<=250000
begin
Insert into Stories
Select
USERID=floor(((500 + 1) - 1) * RAND() + 1)
, CategoryID=floor(((2 + 1) - 1) * RAND() + 1)
, votecount=floor(((10 + 1) - 1) * RAND() + 1)
, commentcount=floor(((8 + 1) - 1) * RAND() + 1)
, Title=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, Description=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, CreatedAt='1899-12-31'
, UniqueName=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, Url=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, LastActivityAt=Dateadd(day, -floor(((600 + 1) - 1) * RAND() + 1), GETDATE())
If #count % 10000=0
Begin
Print #count
Commit
begin transaction
End
Set #count=#count+1
end
set nocount off
go
--returns in 16 seconds
DBCC DROPCLEANBUFFERS
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
go
--Now create an index
Create index IX_LastADate on Stories (LastActivityAt asc)
go
--With an index returns in less than a second
DBCC DROPCLEANBUFFERS
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
go
The sort is definitely where your slow down is occuring.
Sorting mainly gets done in the tempdb and a large table will cause LOTS to be added.
Having an index on this column will definitely improve performance on an order by.
Also, defining your Primary and Foreign Keys helps SQL Server immensly
Your method that is listed in your code is elegant, and basically the same response that cdonner wrote except in c# and not sql. Tuning the db will probably give even better results!
--Kris
Have you cleared the SQL Server cache before running each of the query?
In SQL 2000, it's something like DBCC DROPCLEANBUFFERS. Google the command for more info.
Looking at the query, I would have an index for
Categories.ID
Stories.CategoryID
Users.ID
Stories.UserID
and possibly
Stories.LastActivityAt
But yeah, sounds like the result could be bogus 'cos of caching.
When you have worked with SQL Server for some time, you will discover that even the smallest changes to a query can cause wildly different response times. From what I have read in the initial question, and looking at the query plan, I suspect that the optimizer has decided that the best approach is to form a partial result and then sort that as a separate step. The partial result is a composite of the Users and Stories tables. This is formed in tempdb. So the excessive disk access is due to the forming and then sorting of this temporary table.
I concur that the solution should be to create a compound index on Stories.LastActivityAt, Stories.UserId, Stories.CategoryId. The order is VERY important, the field LastActivityAt must be first.