SQL Server index and poor execution plan - sql

I have an existing SQL Server database where I cannot modify the structure or the queries ran and am facing an issue with poor execution impacting performance and ultimately cloud database cost.
Kindly note my experience with SQL is quite limited and after multiple googling and trial and errors, still did not achieve acceptable result. Any tips or help is much appreciated, thank you all in advance. If you would like me to provide more information, feel free to comment and I will update the post accordingly.
The issue
I have two tables: Table1 and Table2. Table2 references Table1 via TABLE1_ID field and we run a SQL query extracting info from Table2 while filtering on Table1 ( INNER JOIN I believe).
Using the following query:
DECLARE #P1 datetime
DECLARE #P2 datetime
SELECT
dbo.Table2.VALUE
FROM
dbo.Table2,
dbo.Table1
WHERE
-- joins Table1/Table2
dbo.Table1.ID = dbo.Table2.TABLE1_ID
-- filters on Table1
AND dbo.Table1.TIMESTAMP between #P1 and #P2
My understanding would be that the database engine would first filter on Table1 then do the join with Table2, however, the execution plan I am seeing is using a Merge Join implying Table2 is fully scanned then joined with filtered results from Table1.
What I have tried
I have tried the following, attempting to identify the problem or optimize performance:
Optimization attempt Creating an FK constraint
Optimization attempt Creating other indexes with/without include columns
Issue identification Changing the query to select value from both Table1 and Table2 and see the difference
Re-creating the issue
The following script could allow you to re-create the database structure (please note it will insert 1M records into both tables):
CREATE TABLE [dbo].[Table1] (
[ID] [decimal](10, 0) IDENTITY(1,1) NOT NULL,
[VALUE] [nchar](10) NULL,
[TIMESTAMP] [datetime] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Table1] ADD CONSTRAINT [DF_Table1_TIMESTAMP] DEFAULT (sysdatetime()) FOR [TIMESTAMP]
GO
CREATE UNIQUE CLUSTERED INDEX [IX_Table1_ID] ON [dbo].[Table1]
(
[ID] ASC
)
GO
CREATE NONCLUSTERED INDEX [IX_Table1_TIMESTAMP] ON [dbo].[Table1]
(
[TIMESTAMP] ASC
)
INCLUDE ([ID])
GO
CREATE TABLE [dbo].[Table2] (
[ID] [int] IDENTITY(1,1) NOT NULL,
[TABLE1_ID] [decimal](10, 0) NOT NULL,
[VALUE] [nchar](10) NULL
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [IX_Table2_TABLE1_ID] ON [dbo].[Table2]
(
[TABLE1_ID] ASC
) INCLUDE ([VALUE])
GO
Declare #Id decimal(10,0) = 1
DECLARE #Now datetime = SYSDATETIME()
While #Id <= 1000000
Begin
Insert Into dbo.Table1 values ('T1_' + CAST(#Id as nvarchar(10)), DATEADD (ss, #Id, #Now))
Insert Into dbo.Table2 values (#Id, 'T2_' + CAST(#Id as nvarchar(10)))
Print #Id
Set #Id = #Id + 1
End
GO
Then you can try to run the following query:
DECLARE #P1 datetime
DECLARE #P2 datetime
SELECT
dbo.Table2.VALUE
FROM
dbo.Table2,
dbo.Table1
WHERE
dbo.Table1.ID = dbo.Table2.TABLE1_ID
AND dbo.Table1.TIMESTAMP between #P1 and #P2

My understanding would be that the database engine would first filter on Table1 then do the join with Table2,
Wrong. SQL is a descriptive language, not a procedural language. A SQL query describes the result set, not the methods used for creating it.
The SQL parser and optimizer are responsible for generating the execution plan. The only requirement is that the results from the execution plan match the results described by the query.
If you want to control the execution plan, then SQL Server offers hints, so you can require a nested loop join. In general, such hints are used to avoid nested loop joins.
Actually, your query is reading the index. This is a more efficient way of "filtering" the data than actually reading the data and filtering. This looks like an optimal execution plan.
Further, don't use commas in the FROM clause. Use proper, explicit, standard, readable JOIN syntax.

Related

SQL MOVE Records to another table

I have a number of functions which MOVE records from one table to another (generally for a form of archiving the data) and wondered if there was a "best practice" for doing this, or a more efficient method than I am currently using.
At the moment, I am running something like:
INSERT INTO archive_table
SELECT [ROWID], [COL1], [COL2]
FROM live_table
WHERE <criteria>
DELETE FROM live_table
WHERE [ROWID] IN
(
SELECT [ROWID] FROM archive_table
)
This is also throwing up a warning on the SQL performance software that the query may cause index suppression and performance degradation; due to a SCAN being performed, rather than a SEEK.
Worth adding that the archive_table is an exact copy of the live_table, with the exception that we have removed the identity and primary key off of the [ROWID] column and that this table is not used within the 'live' environment, other than having the old data inserted, as described.
[edit]
Would seem that the answer from Alex provides a really simple resolution to this; the comment about using a trigger doesn't resolve the issue in this instance as the event happens a number of days later and the criteria is dependant on events during that period.
DELETE
FROM live_table
OUTPUT DELETED.* INTO archive_table
WHERE <criteria>
If you have to move large number of records from one table to another, i suggest you check the possibility to partition your "active table". Each time, you copy data from one (or more) partitions to the "achieve table" and drop those partitions. It will be much faster than delete records from an "online" table.
Worth adding that the archive_table is an exact copy of the live_table, with the exception that we have removed the identity and primary key off of the [ROWID] column and that this table is not used within the 'live' environment, other than having the old data inserted, as described.
I can't tell if the reason you are removing the primary key from the archive_table is because you expect the ROWID's to be re-used in the live_table or not.
If I'm understanding the context of your data correctly and that you want to archive days after the data is completed, you can improve the performance of the query by reducing/eliminating the comparison of rows that will not exist in the live_table. Basically, once a ROWID has migrated from live_table to archive_table, there is no reason to look for it again.
Note: This assumes that ROWID's are not re-used in the live_table and are always increasing numbers.
INSERT INTO archive_table
SELECT [ROWID], [COL1], [COL2]
FROM live_table
WHERE <criteria>
DELETE FROM live_table
WHERE [ROWID] IN
(
SELECT [ROWID] FROM archive_table WHERE [ROWID] >= (SELECT MIN(ROWID) FROM live_table)
)
If ROWID's are re-used. If you have a datetime field in your data set that is close to when the record was live or archived it can be used as an alternative to the ROWID. This would mean you are only looking for recently archived rows to delete from the live_table, instead of the entire set. Also, making [somedate] the clustered index on the archive_table could improve performance as the data would be physically ordered to where you are only looking at the tail of the table.
INSERT INTO archive_table
SELECT [ROWID], [COL1], [COL2]
FROM live_table
WHERE <criteria>
DELETE FROM live_table
WHERE [ROWID] IN
(
SELECT [ROWID] FROM archive_table WHERE [somedate] >= DATEADD(dy,-30,GETDATE())
)
Your code snippet does not include a named transaction which MUST be the first consideration. Second design a table variable, temp table or hard table to use as for staging. The designed table should include a column identical in datatype to the identity column from your source table and that column should be indexed. Third design your TSQL to populate the staging table, copy rows from source table to destination table based on a join between the source and staging then remove rows from the source table based on the same join that moved data to the destination table. Below is a working sample
--test setup below
DECLARE #live_table table (rowid int identity (1,1) primary key clustered, col1 varchar(1), col2 varchar(2))
DECLARE #archive_table table (rowid int, col1 varchar(1), col2 varchar(2))
Insert #live_table (col1, col2)
Values
('a','a'),
('a','a'),
('a','a'),
('a','a'),
('b','b')
--test setup above
BEGIN Transaction MoveData
DECLARE #Staging table (ROWID int primary Key)
Insert #Staging
SELECT lt.rowid
FROM #live_table as lt
WHERE lt.col1 = 'a'
INSERT INTO #archive_table
select lt.rowid, lt.col1, lt.col2
FROM #live_table as lt
inner join #Staging as s on lt.rowid = s.ROWID
DELETE #live_table
FROM #live_table as lt
inner join #Staging as s on lt.rowid = s.ROWID
COMMIT Transaction MoveData
select * from #live_table
select * from #archive_table
select * from #Staging
You can use triggers for replace any CRUD Command
Add Trigger to your table for after delete operation
CREATE TRIGGER TRG_MoveToDeletedStaff ON Staff
after delete
as
DECLARE
#staffID int,
#staffTC varchar(11),
#staffName varchar(50),
#staffSurname varchar(50),
#staffbirthDate date,
#staffGSM varchar(11),
#staffstartDate date,
#staffEndDate date,
#staffCategory int,
#staffUsername varchar(20),
#staffPassword varchar(20),
#staffState int,
#staffEmail varchar(50)
set #staffEndDate=GETDATE()
SET #staffState=0
SELECT #staffID=staffID,
#staffTC=staffTC,
#staffName=staffName,
#staffSurname=staffSurname,
#staffbirthDate=staffbirthDate,
#staffGSM=staffGSM,
#staffstartDate=staffstartDate,
#staffCategory=staffCategory,
#staffUsername=staffuserName,
#staffPassword=staffPassword,
#staffEmail=staffeMail
from deleted
INSERT INTO DeletedStaff values
(
#staffID,
#staffTC,
#staffName,
#staffSurname,
#staffbirthDate,
#staffGSM,
#staffstartDate,
#staffEndDate,
#staffCategory,
#staffUsername,
#staffPassword,
#staffState,
#staffEmail
)

Optimize a stored procedure that accepts table parameters as filters against a view

I'm looking for an efficient way to filter a view with optional table parameters.
Examples are best so here is a sample situation:
-- database would contain a view that I want to be able to filter
CREATE VIEW [dbo].[MyView]
AS
BEGIN
-- maybe 20-40 columns
SELECT Column1, Column2, Column3, ...
END
I have user-defined table types like so:
-- single id table for joining purposes (passed from code)
CREATE TYPE [dbo].[SingleIdTable] AS TABLE (
[Id] INT NOT NULL,
PRIMARY KEY CLUSTERED ([Id] ASC) WITH (IGNORE_DUP_KEY = OFF));
-- double id table for joining purposes (passed from code)
CREATE TYPE [dbo].[DoubleIdTable] AS TABLE (
[Id1] INT NOT NULL,
[Id2] INT NOT NULL,
PRIMARY KEY CLUSTERED ([Id1] ASC, [Id2] ASC) WITH (IGNORE_DUP_KEY = OFF));
And I want to create a stored procedure that basically looks like this:
CREATE PROCEDURE [dbo].[FilterMyView]
#Parameter1 dbo.SingleIdTable READONLY,
#Parameter2 dbo.DoubleIdTable READONLY,
#Parameter3 dbo.SingleIdTable READONLY
AS
BEGIN
SELECT *
FROM MyView
INNER-JOIN-IF-NOT-EMPTY #Parameter1 p1 ON p1.Id = MyView.Column1 AND
INNER-JOIN-IF-NOT-EMPTY #Parameter2 p2 ON p2.Id1 = MyView.Column5 AND
p2.Id2 = MyView.Column6 AND
INNER-JOIN-IF-NOT-EMPTY #Parameter3 p3 ON p3.Id = MyView.Column8
END
Now I believe I can do this with WHERE EXISTS but I want to make sure that I am doing this in the most efficient way for the SQL engine. I've always personally felt that the INNER JOIN semantic creates the most optimized execution plans, but I don't actually know.
I also know that I can do this using dynamic SQL, but I always leave this as a last option.

SQL server query plan

I have 3 tables as listed below
CREATE TABLE dbo.RootTransaction
(
TransactionID int CONSTRAINT [PK_RootTransaction] PRIMARY KEY NONCLUSTERED (TransactionID ASC)
)
GO
----------------------------------------------------------------------------------------------------
CREATE TABLE [dbo].[OrderDetails](
[OrderID] int identity(1,1) not null,
TransactionID int,
OrderDate datetime,
[Status] varchar(50)
CONSTRAINT [PK_OrderDetails] PRIMARY KEY CLUSTERED ([OrderID] ASC),
CONSTRAINT [FK_TransactionID] FOREIGN KEY ([TransactionID]) REFERENCES [dbo].[RootTransaction] ([TransactionID]),
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [ix_OrderDetails_TransactionID]
ON [dbo].[OrderDetails](TransactionID ASC, [OrderID] ASC);
GO
----------------------------------------------------------------------------------------------------
CREATE TABLE dbo.OrderItems
(
ItemID int identity(1,1) not null,
[OrderID] int,
[Name] VARCHAR (50) NOT NULL,
[Code] VARCHAR (9) NULL,
CONSTRAINT [PK_OrderItems] PRIMARY KEY NONCLUSTERED ([ItemID] ASC),
CONSTRAINT [FK_OrderID] FOREIGN KEY ([OrderID]) REFERENCES [dbo].[OrderDetails] ([OrderID])
)
Go
CREATE CLUSTERED INDEX OrderItems
ON [dbo].OrderItems([OrderID] ASC, ItemID ASC) WITH (FILLFACTOR = 90);
GO
CREATE NONCLUSTERED INDEX [IX_Code]
ON [dbo].[OrderItems]([Code] ASC) WITH (FILLFACTOR = 90)
----------------------------------------------------------------------------------------------------
Populated sample data in each table
select COUNT(*) from RootTransaction -- 45851
select COUNT(*) from [OrderDetails] -- 50201
select COUNT(*) from OrderItems --63850
-- Query 1
SELECT o.TransactionID
FROM [OrderDetails] o
JOIN dbo.OrderItems i ON o.OrderID = i.OrderID
WHERE i.Code like '1067461841%'
declare #SearchKeyword varchar(200) = '1067461841'
-- Query 2
SELECT o.TransactionID
FROM [OrderDetails] o
JOIN dbo.OrderItems i ON o.OrderID = i.OrderID
WHERE i.Code like #SearchKeyword + '%'
When running above 2 queries, I could see Query 1 use index seek on OrderDetails, OrderItems which is expected,
However in query 2, query plan use index seek on OrderItems but index scan on OrderDetails.
Only difference in two queries is using direct value vs variable in LIKE and both returns same result.
why the query execution plan change between using direct value vs variable?
I believe the issue is most likely explained through parameter sniffing. SQL Server often identifies and caches query plans for commonly used queries. As part of this caching, it "sniffs" the parameters you use on the most common queries to optimize the creation of the plan.
Query 1 shows a direct string, so SQL creates a specific plan. Query 2 uses an intermediate variable, which is one of the techniques that actually prevents parameter sniffing (often used to provide more predictable performance to stored procs or queries where the parameters have significant variance. These are considered 2 completely different queries to SQL despite the obvious similarities. The observed differences are essentially just optimization.
Furthermore, if your tables had different distributions of row counts, you'd likely potential differences from those 2 scenarios based on existing indexes and potential optimizations. On my server with no sample data loaded, the query 1 and query 2 had same execution plans since the optimizer couldn't find any better paths for the parameters.
For more info: http://blogs.technet.com/b/mdegre/archive/2012/03/19/what-is-parameter-sniffing.aspx
Below queries show similar plan though WHERE clause is different.
select Code from OrderItems WHERE Code like '6662225%'
declare #SearchKeyword varchar(200) = '6662225'
select Code from OrderItems WHERE Code like #SearchKeyword + '%'
The following post/answers offer a good explanation as to why performance is better with hard coded constants than variables, along with a few suggestions you could possibly try out:
Alternative to using local variables in a where clause

Derived table with an index

Please see the TSQL below:
DECLARE #TestTable table (reference int identity,
TestField varchar(10),
primary key (reference))
INSERT INTO #TestTable VALUES ('Ian')
select * from #TestTable as TestTable
INNER JOIN LiveTable on LiveTable.Reference=TestTable.Reference
Is it possible to create an index on #Test.TestField? The following webpage suggests it is not. However, I read on another webpage that it is possible.
I know I could create a physical table instead (for #TestTable). However, I want to see if I can do this with a derived table first.
You can create an index on a table variable as described in the top voted answer on this question:
SQL Server : Creating an index on a table variable
Sample syntax from that post:
DECLARE #TEMPTABLE TABLE (
[ID] [INT] NOT NULL PRIMARY KEY,
[Name] [NVARCHAR] (255) COLLATE DATABASE_DEFAULT NULL,
UNIQUE NONCLUSTERED ([Name], [ID])
)
Alternately, you may want to consider using a temp table, which will persist during the scope of the current operation, i.e. during execution of a stored procedure exactly like table variables. Temp tables will be structured and optimized just like regular tables, but they will be stored in tempDb, therefore they can be indexed in the same way as regular table.
Temp tables will generally offer better performance than table variables, but it's worth testing with your dataset.
More in depth details can be found here:
When should I use a table variable vs temporary table in sql server?
You can see a sample of creating a temp table with an index from:
SQL Server Planet - Create Index on Temp Table
One of the most valuable assets of a temp table (#temp) is the ability
to add either a clustered or non clustered index. Additionally, #temp
tables allow for the auto-generated statistics to be created against
them. This can help the optimizer when determining cardinality. Below
is an example of creating both a clustered and non-clustered index on
a temp table.
Sample code from site:
CREATE TABLE #Users
(
ID INT IDENTITY(1,1),
UserID INT,
UserName VARCHAR(50)
)
INSERT INTO #Users
(
UserID,
UserName
)
SELECT
UserID = u.UserID
,UserName = u.UserName
FROM dbo.Users u
CREATE CLUSTERED INDEX IDX_C_Users_UserID ON #Users(UserID)
CREATE INDEX IDX_Users_UserName ON #Users(UserName)

Query against 250k rows taking 53 seconds

The box this query is running on is a dedicated server running in a datacenter.
AMD Opteron 1354 Quad-Core 2.20GHz
2GB of RAM
Windows Server 2008 x64 (Yes I know I only have 2GB of RAM, I'm upgrading to 8GB when the project goes live).
So I went through and created 250,000 dummy rows in a table to really stress test some queries that LINQ to SQL generates and make sure they're not to terrible and I noticed one of them was taking an absurd amount of time.
I had this query down to 17 seconds with indexes but I removed them for the sake of this answer to go from start to finish. Only indexes are Primary Keys.
Stories table --
[ID] [int] IDENTITY(1,1) NOT NULL,
[UserID] [int] NOT NULL,
[CategoryID] [int] NOT NULL,
[VoteCount] [int] NOT NULL,
[CommentCount] [int] NOT NULL,
[Title] [nvarchar](96) NOT NULL,
[Description] [nvarchar](1024) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[UniqueName] [nvarchar](96) NOT NULL,
[Url] [nvarchar](512) NOT NULL,
[LastActivityAt] [datetime] NOT NULL,
Categories table --
[ID] [int] IDENTITY(1,1) NOT NULL,
[ShortName] [nvarchar](8) NOT NULL,
[Name] [nvarchar](64) NOT NULL,
Users table --
[ID] [int] IDENTITY(1,1) NOT NULL,
[Username] [nvarchar](32) NOT NULL,
[Password] [nvarchar](64) NOT NULL,
[Email] [nvarchar](320) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[LastActivityAt] [datetime] NOT NULL,
Currently in the database there is 1 user, 1 category and 250,000 stories and I tried to run this query.
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
Query takes 52 seconds to run, CPU usage hovers at 2-3%, Membery is 1.1GB, 900MB free but the Disk usage seems out of control. It's # 100MB/sec with 2/3 of that being writes to tempdb.mdf and the rest is reading from tempdb.mdf.
Now for the interesting part...
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
SELECT TOP(10) *
FROM Stories
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
ORDER BY Stories.LastActivityAt
All 3 of these queries are pretty much instant.
Exec plan for first query.
http://i43.tinypic.com/xp6gi1.png
Exec plans for other 3 queries (in order).
http://i43.tinypic.com/30124bp.png
http://i44.tinypic.com/13yjml1.png
http://i43.tinypic.com/33ue7fb.png
Any help would be much appreciated.
Exec plan after adding indexes (down to 17 seconds again).
http://i39.tinypic.com/2008ytx.png
I've gotten a lot of helpful feedback from everyone and I thank you, I tried a new angle at this. I query the stories I need, then in separate queries get the Categories and Users and with 3 queries it only took 250ms... I don't understand the issue but if it works and at 250ms no less for the time being I'll stick with that. Here's the code I used to test this.
DBDataContext db = new DBDataContext();
Console.ReadLine();
Stopwatch sw = Stopwatch.StartNew();
var stories = db.Stories.OrderBy(s => s.LastActivityAt).Take(10).ToList();
var storyIDs = stories.Select(c => c.ID);
var categories = db.Categories.Where(c => storyIDs.Contains(c.ID)).ToList();
var users = db.Users.Where(u => storyIDs.Contains(u.ID)).ToList();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
Try adding an index on Stories.LastActivityAt. I think the clustered index scan in the execution plan may be due to the sorting.
Edit:
Since my query returned in an instant with rows just a few bytes long, but has been running for 5 minutes already and is still going after I added a 2K varchar, I think Mitch has a point. It is the volume of that data that is shuffled around for nothing, but this can be fixed in the query.
Try putting the join, sort and top(10) in a view or in a nested query, and then join back against the story table to get the rest of the data just for the 10 rows that you need.
Like this:
select * from
(
SELECT TOP(10) id, categoryID, userID
FROM Stories
ORDER BY Stories.LastActivityAt
) s
INNER JOIN Stories ON Stories.ID = s.id
INNER JOIN Categories ON Categories.ID = s.CategoryID
INNER JOIN Users ON Users.ID = s.UserID
If you have an index on LastActivityAt, this should run very fast.
So if I read the first part correctly, it responds in 17 seconds with an index. Which is still a while to chug out 10 records. I'm thinking the time is in the order by clause. I would want an index on LastActivityAt, UserID, CategoryID. Just for fun, remove the order by and see if it returns the 10 records quickly. If it does, then you know it is not in the joins to the other tables. Also it would be helpful to replace the * with the columns needed as all 3 table columns are in the tempdb as you are sorting - as Neil mentioned.
Looking at the execution plans you'll notice the extra sort - I believe that is the order by which is going to take some time. I'm assuming you had an index with the 3 and it was 17 seconds... so you may want one index for the join criteria (userid, categoryID) and another for lastactivityat - see if that performs better. Also it would be good to run the query thru the index tuning wizard.
My first suggestion is to remove the *, and replace it with the minimum columns you need.
second, is there a trigger involved? Something that would update the LastActivityAt field?
Based on your problem query, try add a combination index on table Stories (CategoryID, UserID, LastActivityAt)
You are maxing out the Disks in your hardware setup.
Given your comments about your Data/Log/tempDB File placement, I think any amount of tuning is going to be a bandaid.
250,000 Rows is small. Imagine how bad your problems are going to be with 10 million rows.
I suggest you move tempDB onto its own physical drive (preferable a RAID 0).
Ok, so my test machine isn't fast. Actually it's really slow. It 1.6 ghz,n 1 gb of ram, No multiple disks, just a single (read slow) disk for sql server, os, and extras.
I created your tables with primary and foreign keys defined.
Inserted 2 categories, 500 random users, and 250000 random stories.
Running the first query above takes 16 seconds (no plan cache either).
If I index the LastActivityAt column I get results in under a second (no plan cache here either).
Here's the script I used to do all of this.
--Categories table --
Create table Categories (
[ID] [int] IDENTITY(1,1) primary key NOT NULL,
[ShortName] [nvarchar](8) NOT NULL,
[Name] [nvarchar](64) NOT NULL)
--Users table --
Create table Users(
[ID] [int] IDENTITY(1,1) primary key NOT NULL,
[Username] [nvarchar](32) NOT NULL,
[Password] [nvarchar](64) NOT NULL,
[Email] [nvarchar](320) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[LastActivityAt] [datetime] NOT NULL
)
go
-- Stories table --
Create table Stories(
[ID] [int] IDENTITY(1,1) primary key NOT NULL,
[UserID] [int] NOT NULL references Users ,
[CategoryID] [int] NOT NULL references Categories,
[VoteCount] [int] NOT NULL,
[CommentCount] [int] NOT NULL,
[Title] [nvarchar](96) NOT NULL,
[Description] [nvarchar](1024) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[UniqueName] [nvarchar](96) NOT NULL,
[Url] [nvarchar](512) NOT NULL,
[LastActivityAt] [datetime] NOT NULL)
Insert into Categories (ShortName, Name)
Values ('cat1', 'Test Category One')
Insert into Categories (ShortName, Name)
Values ('cat2', 'Test Category Two')
--Dummy Users
Insert into Users
Select top 500
UserName=left(SO.name+SC.name, 32)
, Password=left(reverse(SC.name+SO.name), 64)
, Email=Left(SO.name, 128)+'#'+left(SC.name, 123)+'.com'
, CreatedAt='1899-12-31'
, LastActivityAt=GETDATE()
from sysobjects SO
Inner Join syscolumns SC on SO.id=SC.id
go
--dummy stories!
-- A Count is given every 10000 record inserts (could be faster)
-- RBAR method!
set nocount on
Declare #count as bigint
Set #count = 0
begin transaction
while #count<=250000
begin
Insert into Stories
Select
USERID=floor(((500 + 1) - 1) * RAND() + 1)
, CategoryID=floor(((2 + 1) - 1) * RAND() + 1)
, votecount=floor(((10 + 1) - 1) * RAND() + 1)
, commentcount=floor(((8 + 1) - 1) * RAND() + 1)
, Title=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, Description=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, CreatedAt='1899-12-31'
, UniqueName=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, Url=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, LastActivityAt=Dateadd(day, -floor(((600 + 1) - 1) * RAND() + 1), GETDATE())
If #count % 10000=0
Begin
Print #count
Commit
begin transaction
End
Set #count=#count+1
end
set nocount off
go
--returns in 16 seconds
DBCC DROPCLEANBUFFERS
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
go
--Now create an index
Create index IX_LastADate on Stories (LastActivityAt asc)
go
--With an index returns in less than a second
DBCC DROPCLEANBUFFERS
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
go
The sort is definitely where your slow down is occuring.
Sorting mainly gets done in the tempdb and a large table will cause LOTS to be added.
Having an index on this column will definitely improve performance on an order by.
Also, defining your Primary and Foreign Keys helps SQL Server immensly
Your method that is listed in your code is elegant, and basically the same response that cdonner wrote except in c# and not sql. Tuning the db will probably give even better results!
--Kris
Have you cleared the SQL Server cache before running each of the query?
In SQL 2000, it's something like DBCC DROPCLEANBUFFERS. Google the command for more info.
Looking at the query, I would have an index for
Categories.ID
Stories.CategoryID
Users.ID
Stories.UserID
and possibly
Stories.LastActivityAt
But yeah, sounds like the result could be bogus 'cos of caching.
When you have worked with SQL Server for some time, you will discover that even the smallest changes to a query can cause wildly different response times. From what I have read in the initial question, and looking at the query plan, I suspect that the optimizer has decided that the best approach is to form a partial result and then sort that as a separate step. The partial result is a composite of the Users and Stories tables. This is formed in tempdb. So the excessive disk access is due to the forming and then sorting of this temporary table.
I concur that the solution should be to create a compound index on Stories.LastActivityAt, Stories.UserId, Stories.CategoryId. The order is VERY important, the field LastActivityAt must be first.