Query against 250k rows taking 53 seconds - sql

The box this query is running on is a dedicated server running in a datacenter.
AMD Opteron 1354 Quad-Core 2.20GHz
2GB of RAM
Windows Server 2008 x64 (Yes I know I only have 2GB of RAM, I'm upgrading to 8GB when the project goes live).
So I went through and created 250,000 dummy rows in a table to really stress test some queries that LINQ to SQL generates and make sure they're not to terrible and I noticed one of them was taking an absurd amount of time.
I had this query down to 17 seconds with indexes but I removed them for the sake of this answer to go from start to finish. Only indexes are Primary Keys.
Stories table --
[ID] [int] IDENTITY(1,1) NOT NULL,
[UserID] [int] NOT NULL,
[CategoryID] [int] NOT NULL,
[VoteCount] [int] NOT NULL,
[CommentCount] [int] NOT NULL,
[Title] [nvarchar](96) NOT NULL,
[Description] [nvarchar](1024) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[UniqueName] [nvarchar](96) NOT NULL,
[Url] [nvarchar](512) NOT NULL,
[LastActivityAt] [datetime] NOT NULL,
Categories table --
[ID] [int] IDENTITY(1,1) NOT NULL,
[ShortName] [nvarchar](8) NOT NULL,
[Name] [nvarchar](64) NOT NULL,
Users table --
[ID] [int] IDENTITY(1,1) NOT NULL,
[Username] [nvarchar](32) NOT NULL,
[Password] [nvarchar](64) NOT NULL,
[Email] [nvarchar](320) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[LastActivityAt] [datetime] NOT NULL,
Currently in the database there is 1 user, 1 category and 250,000 stories and I tried to run this query.
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
Query takes 52 seconds to run, CPU usage hovers at 2-3%, Membery is 1.1GB, 900MB free but the Disk usage seems out of control. It's # 100MB/sec with 2/3 of that being writes to tempdb.mdf and the rest is reading from tempdb.mdf.
Now for the interesting part...
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
SELECT TOP(10) *
FROM Stories
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
ORDER BY Stories.LastActivityAt
All 3 of these queries are pretty much instant.
Exec plan for first query.
http://i43.tinypic.com/xp6gi1.png
Exec plans for other 3 queries (in order).
http://i43.tinypic.com/30124bp.png
http://i44.tinypic.com/13yjml1.png
http://i43.tinypic.com/33ue7fb.png
Any help would be much appreciated.
Exec plan after adding indexes (down to 17 seconds again).
http://i39.tinypic.com/2008ytx.png
I've gotten a lot of helpful feedback from everyone and I thank you, I tried a new angle at this. I query the stories I need, then in separate queries get the Categories and Users and with 3 queries it only took 250ms... I don't understand the issue but if it works and at 250ms no less for the time being I'll stick with that. Here's the code I used to test this.
DBDataContext db = new DBDataContext();
Console.ReadLine();
Stopwatch sw = Stopwatch.StartNew();
var stories = db.Stories.OrderBy(s => s.LastActivityAt).Take(10).ToList();
var storyIDs = stories.Select(c => c.ID);
var categories = db.Categories.Where(c => storyIDs.Contains(c.ID)).ToList();
var users = db.Users.Where(u => storyIDs.Contains(u.ID)).ToList();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);

Try adding an index on Stories.LastActivityAt. I think the clustered index scan in the execution plan may be due to the sorting.
Edit:
Since my query returned in an instant with rows just a few bytes long, but has been running for 5 minutes already and is still going after I added a 2K varchar, I think Mitch has a point. It is the volume of that data that is shuffled around for nothing, but this can be fixed in the query.
Try putting the join, sort and top(10) in a view or in a nested query, and then join back against the story table to get the rest of the data just for the 10 rows that you need.
Like this:
select * from
(
SELECT TOP(10) id, categoryID, userID
FROM Stories
ORDER BY Stories.LastActivityAt
) s
INNER JOIN Stories ON Stories.ID = s.id
INNER JOIN Categories ON Categories.ID = s.CategoryID
INNER JOIN Users ON Users.ID = s.UserID
If you have an index on LastActivityAt, this should run very fast.

So if I read the first part correctly, it responds in 17 seconds with an index. Which is still a while to chug out 10 records. I'm thinking the time is in the order by clause. I would want an index on LastActivityAt, UserID, CategoryID. Just for fun, remove the order by and see if it returns the 10 records quickly. If it does, then you know it is not in the joins to the other tables. Also it would be helpful to replace the * with the columns needed as all 3 table columns are in the tempdb as you are sorting - as Neil mentioned.
Looking at the execution plans you'll notice the extra sort - I believe that is the order by which is going to take some time. I'm assuming you had an index with the 3 and it was 17 seconds... so you may want one index for the join criteria (userid, categoryID) and another for lastactivityat - see if that performs better. Also it would be good to run the query thru the index tuning wizard.

My first suggestion is to remove the *, and replace it with the minimum columns you need.
second, is there a trigger involved? Something that would update the LastActivityAt field?

Based on your problem query, try add a combination index on table Stories (CategoryID, UserID, LastActivityAt)

You are maxing out the Disks in your hardware setup.
Given your comments about your Data/Log/tempDB File placement, I think any amount of tuning is going to be a bandaid.
250,000 Rows is small. Imagine how bad your problems are going to be with 10 million rows.
I suggest you move tempDB onto its own physical drive (preferable a RAID 0).

Ok, so my test machine isn't fast. Actually it's really slow. It 1.6 ghz,n 1 gb of ram, No multiple disks, just a single (read slow) disk for sql server, os, and extras.
I created your tables with primary and foreign keys defined.
Inserted 2 categories, 500 random users, and 250000 random stories.
Running the first query above takes 16 seconds (no plan cache either).
If I index the LastActivityAt column I get results in under a second (no plan cache here either).
Here's the script I used to do all of this.
--Categories table --
Create table Categories (
[ID] [int] IDENTITY(1,1) primary key NOT NULL,
[ShortName] [nvarchar](8) NOT NULL,
[Name] [nvarchar](64) NOT NULL)
--Users table --
Create table Users(
[ID] [int] IDENTITY(1,1) primary key NOT NULL,
[Username] [nvarchar](32) NOT NULL,
[Password] [nvarchar](64) NOT NULL,
[Email] [nvarchar](320) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[LastActivityAt] [datetime] NOT NULL
)
go
-- Stories table --
Create table Stories(
[ID] [int] IDENTITY(1,1) primary key NOT NULL,
[UserID] [int] NOT NULL references Users ,
[CategoryID] [int] NOT NULL references Categories,
[VoteCount] [int] NOT NULL,
[CommentCount] [int] NOT NULL,
[Title] [nvarchar](96) NOT NULL,
[Description] [nvarchar](1024) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[UniqueName] [nvarchar](96) NOT NULL,
[Url] [nvarchar](512) NOT NULL,
[LastActivityAt] [datetime] NOT NULL)
Insert into Categories (ShortName, Name)
Values ('cat1', 'Test Category One')
Insert into Categories (ShortName, Name)
Values ('cat2', 'Test Category Two')
--Dummy Users
Insert into Users
Select top 500
UserName=left(SO.name+SC.name, 32)
, Password=left(reverse(SC.name+SO.name), 64)
, Email=Left(SO.name, 128)+'#'+left(SC.name, 123)+'.com'
, CreatedAt='1899-12-31'
, LastActivityAt=GETDATE()
from sysobjects SO
Inner Join syscolumns SC on SO.id=SC.id
go
--dummy stories!
-- A Count is given every 10000 record inserts (could be faster)
-- RBAR method!
set nocount on
Declare #count as bigint
Set #count = 0
begin transaction
while #count<=250000
begin
Insert into Stories
Select
USERID=floor(((500 + 1) - 1) * RAND() + 1)
, CategoryID=floor(((2 + 1) - 1) * RAND() + 1)
, votecount=floor(((10 + 1) - 1) * RAND() + 1)
, commentcount=floor(((8 + 1) - 1) * RAND() + 1)
, Title=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, Description=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, CreatedAt='1899-12-31'
, UniqueName=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, Url=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, LastActivityAt=Dateadd(day, -floor(((600 + 1) - 1) * RAND() + 1), GETDATE())
If #count % 10000=0
Begin
Print #count
Commit
begin transaction
End
Set #count=#count+1
end
set nocount off
go
--returns in 16 seconds
DBCC DROPCLEANBUFFERS
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
go
--Now create an index
Create index IX_LastADate on Stories (LastActivityAt asc)
go
--With an index returns in less than a second
DBCC DROPCLEANBUFFERS
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
go
The sort is definitely where your slow down is occuring.
Sorting mainly gets done in the tempdb and a large table will cause LOTS to be added.
Having an index on this column will definitely improve performance on an order by.
Also, defining your Primary and Foreign Keys helps SQL Server immensly
Your method that is listed in your code is elegant, and basically the same response that cdonner wrote except in c# and not sql. Tuning the db will probably give even better results!
--Kris

Have you cleared the SQL Server cache before running each of the query?
In SQL 2000, it's something like DBCC DROPCLEANBUFFERS. Google the command for more info.
Looking at the query, I would have an index for
Categories.ID
Stories.CategoryID
Users.ID
Stories.UserID
and possibly
Stories.LastActivityAt
But yeah, sounds like the result could be bogus 'cos of caching.

When you have worked with SQL Server for some time, you will discover that even the smallest changes to a query can cause wildly different response times. From what I have read in the initial question, and looking at the query plan, I suspect that the optimizer has decided that the best approach is to form a partial result and then sort that as a separate step. The partial result is a composite of the Users and Stories tables. This is formed in tempdb. So the excessive disk access is due to the forming and then sorting of this temporary table.
I concur that the solution should be to create a compound index on Stories.LastActivityAt, Stories.UserId, Stories.CategoryId. The order is VERY important, the field LastActivityAt must be first.

Related

Adding primary constraint slows join performance on MS SQL Server 2016?

Main issue: adding primary key constraint to two tables leads to query times increasing from 1 min 30 to 40 minutes.
I have two tables with int identity keys:
Addresses
(
[ID] [int] NOT NULL,
...
)
Routes
(
[ID] [int] NOT NULL,
[OriginAddressID] [int] NOT NULL,
[DestinationAddressID] [int] NOT NULL,
...
)
I have a view which joins addresses and routes on their IDs:
select *
from Routes
inner join Addresses Origin on Routes.OriginAddressID = Origin.ID
inner join Addresses Destination on Routes.DestinationAddressID = Destination.ID
When I query all rows from this joined view, the execution takes 1 minutes 30 seconds with the primary key constraint removed from both tables.
When I add primary key on column ID, the execution time jumps up to 40 minutes.
Both tables contain 400.000 rows. I tested this multiple times, the results are always the same. I'm at loss on how to proceed, as the 40 minutes query time is unacceptable. Primary keys should not slow query performance, as far as I know?

Dealing with huge table - 100M+ rows

I have table with around 100 million rows and it is only getting larger, as table is queried pretty frequently I have to come up with some solution to optimise this.
Firstly here is the model:
CREATE TABLE [dbo].[TreningExercises](
[TreningExerciseId] [uniqueidentifier] NOT NULL,
[NumberOfRepsForExercise] [int] NOT NULL,
[CycleNumber] [int] NOT NULL,
[TreningId] [uniqueidentifier] NOT NULL,
[ExerciseId] [int] NOT NULL,
[RoutineExerciseId] [uniqueidentifier] NULL)
Here is Trening table:
CREATE TABLE [dbo].[Trenings](
[TreningId] [uniqueidentifier] NOT NULL,
[DateTimeWhenTreningCreated] [datetime] NOT NULL,
[Score] [int] NOT NULL,
[NumberOfFinishedCycles] [int] NOT NULL,
[PercentageOfCompleteness] [int] NOT NULL,
[IsFake] [bit] NOT NULL,
[IsPrivate] [bit] NOT NULL,
[UserId] [nvarchar](128) NOT NULL,
[AllRoutinesId] [bigint] NOT NULL,
[Name] [nvarchar](max) NULL,
)
Indexes (other than PK which are clustered):
TreningExercises:
TreningId (also FK)
ExerciseId (also FK)
Trenings:
UserId (also FK)
AllRoutinesId (also FK)
Score
DateTimeWhenTreningCreated (ordered by DateTimeWhenTreningCreated DESC)
And here is the example of the most commonly executed query:
DECLARE #userId VARCHAR(40)
,#exerciseId INT;
SELECT TOP (1) R.[TreningExerciseId] AS [TreningExerciseId]
,R.[NumberOfRepsForExercise] AS [NumberOfRepsForExercise]
,R.[TreningId] AS [TreningId]
,R.[ExerciseId] AS [ExerciseId]
,R.[RoutineExerciseId] AS [RoutineExerciseId]
,R.[DateTimeWhenTreningCreated] AS [DateTimeWhenTreningCreated]
FROM (
SELECT TE.[TreningExerciseId] AS [TreningExerciseId]
,TE.[NumberOfRepsForExercise] AS [NumberOfRepsForExercise]
,TE.[TreningId] AS [TreningId]
,TE.[ExerciseId] AS [ExerciseId]
,TE.[RoutineExerciseId] AS [RoutineExerciseId]
,T.[DateTimeWhenTreningCreated] AS [DateTimeWhenTreningCreated]
FROM [dbo].[TreningExercises] AS TE
INNER JOIN [dbo].[Trenings] AS T ON TE.[TreningId] = T.[TreningId]
WHERE (T.[UserId] = #userId)
AND (TE.[ExerciseId] = #exerciseId)
) AS R
ORDER BY R.[DateTimeWhenTreningCreated] DESC
Execution plan: link
Please accept my apologies if it is bit unreadable or unoptimised, it was generated by ORM (Entity Framework), I just edited it a bit.
According to Azure's SQL Analytics tool this query has the most impact on my DB and even though it usually doesn't take too long to execute, from time to time there are spikes in DB I/O due to it.
Also there is a bit business logic involved in this, to simplify it: 99% of the time I need data which is less then a year old.
What are my best options regarding querying and table size?
My thoughts on querying, either:
Create indexed view OR
Add Date and UserId fields to the TreningExerciseId table OR
Some option that I haven't thought of :)
Regarding table size, either:
Partition table (probably by date) OR
Move most of the data (or all of it) to some NoSQL key-value store OR
Some option that I haven't thought of :)
What are your thoughts about these problems, how should I approach solving them?
If you add the following columns to the index "ix_TreninID":
NoOfRepsForExecercise
ExerciseID
RoutineExerciseID
That will make the index a "covering index" and eliminate the need for the lookup which is taking 95% of the plan.
Give it a go, and post back.

Table cell to count rows in other table

I'm not entirely sure if I'm even going about this in the right manner.
MVC+EF site so i could do this in the controller but I would prefer it in the DB if possible.
I have two tables. One contains entries and one contains a list of members. I want the list of members to have a column that contains the count of how many times the member name appears in the list of entries. Can I do this in the definition of the table itself?
I know this query works:
select count(*)
from dbo.Entries
where dbo.Entries.AssignedTo = 'Bob Smith'
But is there any way of doing this? What is the correct syntax?
CREATE TABLE [dbo].[Members] (
[ID] INT IDENTITY (1, 1) NOT NULL,
[Name] NVARCHAR (100) NOT NULL,
[Email] NVARCHAR (500) NOT NULL,
[Count] INT = select count(*) from dbo.Entries where dbo.Entries.AssignedTo = [Name]
PRIMARY KEY CLUSTERED ([ID] ASC)
I've done some searching and have tried a few different syntax's but I'm completely lost at this point so if anyone can get me headed in the correct direction I would really appreciate it.
Thanks in advance.
You could create Members as a view combining both the Entries data and another table. (Warning: Syntax not tested)
CREATE TABLE [_member_data] (
[ID] INT IDENTITY(1, 1) NOT NULL,
[Name] NVARCHAR (100) NOT NULL,
[Email] NVARCHAR (500) NOT NULL,
PRIMARY KEY CLUSTERED ([ID] ASC)
);
CREATE VIEW [dbo].[Members] AS
SELECT ID, Name, Email, COUNT(*)
FROM _member_data JOIN dbo.Entries ON [Name] = [AssignedTo]
GROUP BY 1, 2, 3;
It is possible to take this even further with triggers/rules that rewrite attempted inserts into Members as inserts to the appropriate backing table. But to get the kind of expressive information that you are looking for, you really want to explore using a view.

Update Query with NVARCHAR(max) in SQL Server

Have an issue with SQL Server performance and wanted to see if anyone can give some tips about improving the performance of an update query.
What I'm doing is updating one table with data from another table. Here's some of the basics:
SQL Server 2008 R2
Data is pumped to WO table originally from other system (pumped in using datareader and sqlbulkcopy in ADO.NET)
Additional data is pumped to TEMP_REMARKS (pumped in using datareader and sqlbulkcopy in ADO.NET)
Unfortunately, combining the WO and REMARKS in the originating system (via the reader query) is not possible (mainly performance reasons)
Update to WO occurs using value from TEMP_REMARKS where two columns are updated
Note that the column being transferred from TEMP_REMARKS to REMARKS is a nvarchar(max) and is being placed into another nvarchar(max) column (actually two - see query)
WO has 4m+ records
TEMP_REMARKS has 7m+ records
For the join between the two, the following is what is being used:
/* === UPDATE THE DESCRIPTION */
UPDATE WO
SET WO_DESCRIPTION = TEMP_REMARKS.REMARKS
FROM WO
INNER JOIN TEMP_REMARKS ON WO.WO_DESCRIPTION_ID = TEMP_REMARKS.REMARKS_ID;
/* === UPDATE THE FINDINGS */
UPDATE WO
SET FINDINGS = TEMP_REMARKS.REMARKS
FROM WO
INNER JOIN TEMP_REMARKS ON WO.FINDINGS_ID = TEMP_REMARKS.REMARKS_ID;
The problem at this point is that the update to the WO table is taking over two hours to complete. I've tried using the MERGE statement with no success. I've got other more completed procedures in the db that don't take nearly as long, so I'm convinced that it is not the configuration of the SQL Server itself.
Is there something that should be done when updating nvarchar(max) columns?
What can be done to improve the performance of this query?
Here are the table definitions:
CREATE TABLE [dbo].[WO](
[DOCUMENT_ID] [decimal](18, 0) NOT NULL,
[WO_DESCRIPTION_ID] [decimal](18, 0) NULL,
[WO_DESCRIPTION] [nvarchar](max) NULL,
[FINDINGS_ID] [decimal](18, 0) NULL,
[FINDINGS] [nvarchar](max) NULL,
.... bunch of other fields
CONSTRAINT [PK_WO] PRIMARY KEY CLUSTERED
(
[DOCUMENT_ID] ASC
)
This is the table definition for the TEMP_REMARKS:
CREATE TABLE [dbo].[TEMP_REMARKS](
[REMARKS_ID] [decimal](18, 0) NOT NULL,
[REMARKS] [nvarchar](max) NULL
) ON [PRIMARY]
I think, first of all you should consider to create primary key on TEMP_REMARKS, or at least some index on REMARKS_ID

Slow SQL performance

I have a query as follows;
SELECT COUNT(Id) FROM Table
The table contains 33 million records - it contains a primary key on Id and no other indices.
The query takes 30 seconds.
The actual execution plan shows it uses a clustered index scan.
We have analysed the table and found it isn't fragmented using the first query shown in this link: http://sqlserverpedia.com/wiki/Index_Maintenance.
Any ideas as to why this query is so slow and how to fix it.
The Table Definition:
CREATE TABLE [dbo].[DbConversation](
[ConversationID] [int] IDENTITY(1,1) NOT NULL,
[ConversationGroupID] [int] NOT NULL,
[InsideIP] [uniqueidentifier] NOT NULL,
[OutsideIP] [uniqueidentifier] NOT NULL,
[ServerPort] [int] NOT NULL,
[BytesOutbound] [bigint] NOT NULL,
[BytesInbound] [bigint] NOT NULL,
[ServerOutside] [bit] NOT NULL,
[LastFlowTime] [datetime] NOT NULL,
[LastClientPort] [int] NOT NULL,
[Protocol] [tinyint] NOT NULL,
[TypeOfService] [tinyint] NOT NULL,
CONSTRAINT [PK_Conversation_1] PRIMARY KEY CLUSTERED
(
[ConversationID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
One thing I have noticed is the database is set to grow in 1Mb chunks.
It's a live system so we restricted in what we can play with - any ideas?
UPDATE:
OK - we've improved performance in the actual query of interest by adding new non-clustered indices on appropriate columns so it's not a critical issue anymore.
SELECT COUNT is still slow though - tried it with NOLOCK hints - no difference.
We're all thinking it's something to do with the Autogrowth set to 1Mb rather than a larger number, but surprised it has this effect. Can MDF fragmentation on the disk be a possible cause?
Is this a frequently read/inserted/updated table? Is there update/insert activity concurrent with your select?
My guess is the delay is due to contention.
I'm able to run a count on 189m rows in 17 seconds on my dev server, but there's nothing else hitting that table.
If you aren't too worried about contention or absolute accuracy you can do:
exec sp_spaceused 'MyTableName' which will give a count based on meta-data.
If you want a more exact count but don't necessarily care if it reflect concurrent DELETE or INSERT activity you can do your current query with a NOLOCK hint:
SELECT COUNT(id) FROM MyTable WITH (NOLOCK) which will not get row-level locks for your query and should run faster.
Thoughts:
Use SELECT COUNT(*) which is correct for "how many rows" (as per ANSI SQL). Even if ID is the PK and thus not nullable, SQL Server will count ID. Not rows.
If you can live with approximate counts, then use sys.dm_db_partition_stats. See my answer here: Fastest way to count exact number of rows in a very large table?
If you can live with dirty reads use WITH (NOLOCK)
use [DatabaseName]
select tbl.name, dd.rows from sysindexes dd
inner join sysobjects tbl on dd.id = tbl.id where dd.indid < 2 and tbl.xtype = 'U'
select sum(dd.rows)from sysindexes dd
inner join sysobjects tbl on dd.id = tbl.id where dd.indid < 2 and tbl.xtype = 'U'
By using these queries you can fetch all tables' count within 0-5 seconds
use where clause according to your requirement.....
Another idea: When the files grow with 1MB parts, it may be fragmented on the file system. You can't see this by SQL, you see it using a disk defragmentation tool.