SQL Query Slow When Ordering by a PK Column - sql

Why does this query take a long time (30+ seconds) to run?
A:
SELECT TOP 10 * FROM [Workflow] ORDER BY ID DESC
But this query is fast (0 seconds):
B:
SELECT TOP 10 * FROM [Workflow] ORDER BY ReadTime DESC
And this is fast (0 seconds), too:
C:
SELECT TOP 10 * FROM [Workflow] WHERE SubId = '120611250634'
I get why B and C are fast:
B:
C:
But I don’t get why A takes so long when we have this:
Edit: The estimated execution plan using ID:
Edit: The estimated execution plan using ReadTime:

Well, your primary key is for both ID (ASC) and ReadTime (ASC). The order is not important when you're only having a single column index, but it does matter when you have more columns in the index (a composite key).
Composite clustered keys are not really made for ordering. I'd expect that using
SELECT TOP 10 * FROM [Workflow] ORDER BY ID ASC
Will be rather fast, and the best would be
SELECT TOP 10 * FROM [Workflow] ORDER BY ID, ReadTime
Reversing the order is a tricky operation on a composite key.
So in effect, when you order by ReadTime, you have an index ready for that, and that index also knows the exact key of the row involved (both its Id and ReadTime - another good reason to keep the clustered index very narrow). It can look up all the columns rather easily. However, when you order by Id, you don't have an exact fit of an index. The server doesn't trivially know how many rows there are for a given Id, which means the top gets a bit trickier than you'd guess. In effect, your clustered index turns into a waste of space and performance (as far as those sample queries are concerned).
Seeing just the tiny part of your database, I'd say having a clustered index on Id and ReadTime is a bad idea. Why do you do that?

It looks like ID isn't a PK by itself, but along with ReadTime (based on your 3rd picture).
Therefore the index is built on the (ID,ReadTime) pair, and this index isn't used by your query.
Try adding an index on ID only.

Related

MS SQL: Performance for querying ID descending

This question relates to a table in Microsoft SQL Server which is usually queried with ORDER BY Id DESC.
Would there be a performance benefit from setting the primary key to PRIMARY KEY CLUSTERED (Id DESC)? Or would there be a need for an index? Or is it as fast as it gets without any of it?
Table:
CREATE TABLE [dbo].[Items] (
[Id] INT IDENTITY (1, 1) NOT NULL,
[Category] INT NOT NULL,
[Name] NVARCHAR(255) NULL,
CONSTRAINT [PK_Items] PRIMARY KEY CLUSTERED ([Id] ASC)
)
Query:
SELECT TOP 1 * FROM [dbo].[Items]
WHERE Catgory = 123
ORDER BY [Id] DESC
Would there be a performance benefit from setting the primary key to PRIMARY KEY
CLUSTERED (Id DESC)?
Given as you show is: IT DEPENDS.
The filter is on Category = 123. To find all entries of Category 123, because there is NO INDEX defined, the server has to do a table scan. Unless you havea VERY large result set, and / or some awfully comically bad configured tempdb and very low memory (because disc is only used when running out of memory for tempdb) the sorting of hte result will be irrelevant compared to the table scan.
You are literally following the wrong tail. You are WAY more likely to speed up the query by adding a non-unique index to Cateogory so that the query can prefilter the data fast based on your query condition.
If you would analzy the query plan for this query (which you should - technically we should not even ANSWER this quesstion without you showing SOME effort, and a look at the query plan is like the FIRST thing you do) you would very likely see that the time is spent on on the query, NOT the result sort.
Creating an index in asc or desc order does not make a big difference in “ORDER BY” when there is only one column, but when there is a need to sort data in two different directions one column in ascending order and the other column in descending order the way the index is created does make a big difference.
Look this article that do many example:
https://www.mssqltips.com/sqlservertip/1337/building-sql-server-indexes-in-ascending-vs-descending-order/
In your scenario I advise you to create an index on Category Column without include “Id” because the clustered index is always included in non-clustered index.
There is no difference according to the following
I'd suggest defining an index on (category, id desc).
It will give you best performance for your query.
As others have indicated, an index on Category (assuming you don't have one) is the biggest performance boost possible here.
But as for your actual question. For a single order by query like you have, it does not matter if the query/index is ordered by desc or asc as far as performance goes. SQL Server can swap those easily (starting a the beginning or the end of the data structure)
Where performance becomes an issue for performance is when you:
Have more than order by column
Your index has more than one column
Your order by is opposing the order on the index.
So, say your Primary Key had ID asc and Category asc, and then you query by ID asc and Category desc. Then SQL Server can't use the order on the index to do the search.
There are a few caveats and gotchas. After searching a bit, this answer seems to have them listed:
SQL Server indexes - ascending or descending, what difference does it make?

db2 10.5 multi-column index explanation

My first time working with indexes in database and so far I've learn that if you have a multi-column index such as index('col1', 'col2', 'col3'), and if you do a query that uses where col2='col2' and col3='col3', that index would not be use.
I also learn that if a column is very low selectivity column. Indexing is useless.
However, from my test, it seems none of the above is true at all. Can someone explain more on this?
I have a table with more than 16 million records. Let's say claimID is the primary key, then there're a historynumber column that only have 3 distinct values (1,2,3), and a last column with storeNumber that has about 1 million distinct values.
I have an index for claimID alone, another index(historynumber, claimID), and other index with index(historynumber, storeNumber), and finally index(storeNumber, historynumber).
My guess was that if I do:
select * from my_table where claimId='123456' and historynumber = 1
would be much faster than
select * from my_table where historynumber = 1 and claimId = '123456'
However, the 2 have exactly the same performance (instant). So I thought the primary key index can work on any column order. Therefore, I tried the same thing but on historynumber and storeNumber instead. The result is exactly the same. Then I start trying out on columns that has no indexes and of course the result is the same also.
Finally, I do a
select * from my_table where historynumber = 1
and the query takes so long I had to cancel it.
So my conclusion is that the column order in where clause is completely useless, and so is the column order in the index definition since it seems like the database is smart enough to tell which column is the highest selectivity column.
Could someone give me an example that could prove otherwise?
Index explanation is a huge topic.
Don't worry about the sequence of different attributes in the SQL - it has no effect whether you specify
...where claimId='123456' and historynumber = 1
or the other way round. Each SQL is checked and optimized by the optimizer. To proove how the data gets accessed you could do a EXPLAIN. Check the documentation for more details.
For your other problem
select * from my_table where historynumber = 1
with an index of (storeNumber, historynumber).
Have you ever tried to lookup the name of a caller (having the telephone number) in a telephone book?
Well it is pretty much the same for an index - so the column order when creatin the index matters!
There are techniques which could help - i.e. index jump scan - but there is no guarantee.
Check out following sites to learn a little bit more about DB2 indexes:
http://db2commerce.com/2013/09/19/db2-luw-basics-indexes/
http://use-the-index-luke.com/sql/where-clause/the-equals-operator/concatenated-keys

SQL Server Index Usage with an Order By

I have a table named Workflow. There are 38M rows in the table. There is a PK on the following columns:
ID: Identity Int
ReadTime: dateTime
If I perform the following query, the PK is not used. The query plan shows an index scan being performed on one of the nonclustered indexes plus a sort. It takes a very long time with 38M rows.
Select TOP 100 ID From Workflow
Where ID > 1000
Order By ID
However, if I perform this query, a nonclustered index (on LastModifiedTime) is used. The query plan shows an index seek being performed. The query is very fast.
Select TOP 100 * From Workflow
Where LastModifiedTime > '6/12/2010'
Order By LastModifiedTime
So, my question is this. Why isn't the PK used in the first query, but the nonclustered index in the second query is used?
Without being able to fish around in your database, there are a few things that come to my mind.
Are you certain that the PK is (id, ReadTime) as opposed to (ReadTime, id)?
What execution plan does SELECT MAX(id) FROM WorkFlow yield?
What about if you create an index on (id, ReadTime) and then retry the test, or your query?
Since Id is an identity column, having ReadTime participate in the index is superfluous. The clustered key already points to the leaf data. I recommended you modify your indexes
CREATE TABLE Workflow
(
Id int IDENTITY,
ReadTime datetime,
-- ... other columns,
CONSTRAINT PK_WorkFlow
PRIMARY KEY CLUSTERED
(
Id
)
)
CREATE INDEX idx_LastModifiedTime
ON WorkFlow
(
LastModifiedTime
)
Also, check that statistics are up to date.
Finally, If there are 38 million rows in this table, then the optimizer may conclude that specifying criteria > 1000 on a unique column is non selective, because > 99.997% of the Ids are > 1000 (if your identity seed started at 1). In order for an index to considered helpful, the optimizer must conclude that < 5% of the records would be selected. You can use an index hint to force the issue (as already stated by Dan Andrews). What is the structure of the non-clustered index that was scanned?

In SQL Server, is TOP deterministic by default when used on a table with a clustered index?

So I was trying to explain to some people why this query is a bad idea:
SELECT z.ReportDate, z.Zipcode, SUM(z.Sales) AS Sales,
COALESCE(
(SELECT TOP (1) GroupName
FROM dbo.zipGroups
WHERE (Zipcode = z.Zipcode)), 'Unknown') AS GroupName,
COALESCE(
(SELECT TOP (1) GroupCode
FROM dbo.zipGroups
WHERE (Zipcode = z.Zipcode)), 0) AS GroupNumber
FROM dbo.Report_ByZipcode AS z
GROUP BY z.ReportDate, z.Zipcode
and suggesting a better way to write it, when my boss ended the discussion with, "Well, it's been returning the right data for the last year and we haven't had any problems with it, so it's fine."
At which point I thought to myself, how in the world is that even possible?
After some digging, I discovered these facts:
This query is supposed to group sales by Zipcode and date, and link those to the largest Group (by population size) that a Zipcode is assigned to by way of the zipGroups table.
Each Zipcode can be assigned to 0 to many Groups, and if a Zipcode is assigned to 0 Groups, it's simply not in the zipGroups table.
A Group is a geographical area, and the GroupNumbers are ranked by largest to smallest by population (for example, the group covering the NY-NJ-CT tri-state area is GroupNumber 1, and North Platte, Nebraska is GroupNumber 209).
The zipGroups table has not changed in at least 2 years.
The zipGroups table has a clustered index with Zipcode, GroupNumber (ascending) as the keys.
The combination of Zipcode, GroupNumber is unique in zipGroups.
So my question has 2 parts.
A) Even though there are no ORDER BY clauses in those SELECT TOP queries, are they actually deterministic because the clustered index is basically providing it a default ORDER BY?
B1) If that is true, is the query, however precariously, actually doing what it's supposed to do?
B2) If that is not true, can you help me prove it?
Note: I've already re-written this to use joins, so I don't need the SQL to fix it, I need to get it into production so I stop worrying about it breaking.
SQL Server makes no guarantees about the ordering of records in the absence of ORDER BY. It might yield the correct results 999,999 times and then fail on the millionth try. Don't do it.
Always use an order by with a TOP statement. The order is not guaranteed to be in the order of the clustered index as demonstrate in this blog post (complete with a query that disproves it):
Without ORDER BY, there is no default sort order.
Even if it did go by the clustered index, I wouldn't write queries that depend on undocumented behavior of the DB engine and it is better to be explicit for readability.
If you're relying on a clustered index rather than the collation, then getting the right order is coincidental, not deterministic.
In the real world, indexes can be changed from one kind to another, for good reasons, bad reasons, or no reason at all. And, in the real world, you don't necessarily get to choose which index SQL Server will use in executing a query. (Or whether it will use an index at all.)
Technically, the collation can also be changed for good reasons, bad reasons, or no reason at all. But everybody knows changing the collation will change the sort order--that's its job, after all--so it's not a surprise. (Ever heard of "the principle of least surprise"?)
The link by JohnFx is good, although long and hard to follow. Here's a small snippet on it's own that will show the data returning in non-clustered index order.
CREATE TABLE t1 (x INT NOT NULL PRIMARY KEY CLUSTERED, z INT NOT NULL UNIQUE);
INSERT INTO t1 (x,z) VALUES (1,4);
INSERT INTO t1 (x,z) VALUES (3,3);
INSERT INTO t1 (x,z) VALUES (2,2);
INSERT INTO t1 (x,z) VALUES (4,1);
SELECT x, z FROM t1;
Output (you should get)
x z
----------- -----------
4 1
2 2
3 3
1 4
The execution plan shows it using the Unique (or other) index instead of the clustered index.
Even if the clustered index is chosen, it may not sort correctly if the data is being merged from parallelism, if the TOP N count is high enough.
Having said that, since you are only using TOP(1) and if the table has only one index available, it can be considered deterministic since it will only use that index and pick the first entry in the index pages.
A) Even though there are no ORDER BY clauses in those SELECT TOP queries, are they actually deterministic because the clustered index is basically providing it a default ORDER BY?
B1) If that is true, is the query, however precariously, actually doing what it's supposed to do?
When top is specified without ordering, the ordering is a side effect of the method of access chosen by the query optimizer. Since the query optimizer would use the clustered index to resolve this query, you get a quite nice side effect.
I wouldn't use the word deterministic, as the query optimizer might not be deterministic. However in the case where the optimizer choses the clustered index, yes - the query does what it is supposed to do.
ORDER should still be specified, so as to lock the correctness into the query. One should separate correctness ("What do you want") and implementation ("How do you get it") into query and optimizer plan, respectively.
B2) If that is not true, can you help me prove it?
Assuming there are more columns in the ZipGroups table, a Nonclustered index containing the only two relevant columns could be added that would be preferred over the clustered index. If the nonclustered index had a different ordering (Zipcode asc, GroupNumber desc), then the query would break.

SQL Server slow select from large table

I have a table with about 20+ million records.
Structure is like:
EventId UNIQUEIDENTIFIER
SourceUserId UNIQUEIDENTIFIER
DestinationUserId UNIQUEIDENTIFIER
CreatedAt DATETIME
TypeId INT
MetaId INT
Table is receiving about 100k+ records each day.
I have indexes on each column except MetaId, as it is not used in 'where' clauses
The problem is when i want to pick up eg. latest 100 records for desired SourceUserId
Query sometimes takes up to 4 minutes to execute, which is not acceptable.
Eg.
SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND
(
TypeId IN (2, 3, 4)
OR
(TypeId = 60 AND SrcMemberId != DstMemberId)
)
ORDER BY CreatedAt DESC
I can't do partitioning etc as I am using Standard version of SQL Server and Enterprise is too expensive.
I also think that the table is quite small to be that slow.
I think the problem is with ORDER BY clause as db must go through much bigger set of data.
Any ideas how to make it quicker ?
Perhaps relational database is not a good idea for that kind of data.
Data is always being picked up ordered by CreatedAt DESC
Thank you for reading.
PabloX
You'll likely want to create a composite index for this type of query - when the query runs slowly it is most likely choosing to scan down an index on the CreatedAt column and perform a residual filter on the SourceUserId value, when in reality what you want to happen is to jump directly to all records for a given SourceUserId ordered properly - to achieve this, you'll want to create a composite index primarily on SourceUserId (performing an equality check) and secondarily on CreateAt (to preserve the order within a given SourceUserId value). You may want to try adding the TypeId in as well, depending on the selectivity of this column.
So, the 2 that will most likely give the best repeatable performance (try them out and compare) would be:
Index on (SourceUserId, CreatedAt)
Index on (SourceUserId, TypeId, CreatedAt)
As always, there are also many other considerations to take into account with determining how/what/where to index, as Remus discusses in a separate answer one big consideration is covering the query vs. keeping lookups. Additionally you'll need to consider write volumes, possible fragmentation impact (if any), singleton lookups vs. large sequential scans, etc., etc.
I have indexes on each column except
MetaId
Non-covering indexes will likely hit the 'tipping point' and the query would revert to a table scan. Just adding an index on every column because it is used in a where clause does not equate good index design. To take your query for example, a good 100% covering index would be:
INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId, SrcMemberId, DstMemberId)
Following index is also usefull, altough it still going to cause lookups:
INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId)
and finaly an index w/o any included column may help, but is just as likely will be ignored (depends on the column statistics and cardinality estimates):
INDEX ON (SourceUserId , CreatedAt)
But a separate index on SourceUSerId and one on CreatedAt is basically useless for your query.
See Index Design Basics.
The fact that the table has indexes built on GUID values, indicates a possible series of problems that would affect performance:
High index fragmentation: since new GUIDs are generated randomly, the index cannot organize them in a sequential order and the nodes are spread unevenly.
High number of page splits: the size of a GUID (16 bytes) causes many page splits in the index, since there's a greater chance than a new value wont't fit in the remaining space available in a page.
Slow value comparison: comparing two GUIDs is a relatively slow operation because all 33 characters must be matched.
Here a couple of resources on how to investigate and resolve these problems:
How to Detect Index Fragmentation in SQL Server 2000 and 2005
Reorganizing and Rebuilding Indexes
How Using GUIDs in SQL Server Affect Index Performance
I would recomend getting the data in 2 sep var tables
INSERT INTO #Table1
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND
(
TypeId IN (2, 3, 4)
)
INSERT INTO #Table2
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND
(
(TypeId = 60 AND SrcMemberId != DstMemberId)
)
then apply a unoin from the selects, ordered and top. Limit the data from the get go.
I suggest using a UNION:
SELECT TOP 100 x.*
FROM (SELECT a.*
FROM EVENTS a
WHERE a.typeid IN (2, 3, 4)
UNION ALL
SELECT b.*
FROM EVENTS b
WHERE b.typeid = 60
AND b.srcmemberid != b.dstmemberid) x
WHERE x.sourceuserid = '15b534b17-5a5a-415a-9fc0-7565199c3461'
We've realised a minor gain by moving to a BIGINT IDENTITY key for our event table; by using that as a clustered primary key, we can cheat and use that for date ordering.
I would make sure CreatedAt is indexed properly
you could split the query in two with an UNION to avoid the OR (which can cause your index not to be used), something like
SElect * FROM(
SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND TypeId IN (2, 3, 4)
UNION SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND TypeId = 60 AND SrcMemberId != DstMemberId
)
ORDER BY CreatedAt DESC
Also, check that the uniqueidentifier indexes are not CLUSTERED.
If there are 100K records added each day, you should check your index fragmentation.
And rebuild or reorganize it accordingly.
More info :
SQLauthority