This question relates to a table in Microsoft SQL Server which is usually queried with ORDER BY Id DESC.
Would there be a performance benefit from setting the primary key to PRIMARY KEY CLUSTERED (Id DESC)? Or would there be a need for an index? Or is it as fast as it gets without any of it?
Table:
CREATE TABLE [dbo].[Items] (
[Id] INT IDENTITY (1, 1) NOT NULL,
[Category] INT NOT NULL,
[Name] NVARCHAR(255) NULL,
CONSTRAINT [PK_Items] PRIMARY KEY CLUSTERED ([Id] ASC)
)
Query:
SELECT TOP 1 * FROM [dbo].[Items]
WHERE Catgory = 123
ORDER BY [Id] DESC
Would there be a performance benefit from setting the primary key to PRIMARY KEY
CLUSTERED (Id DESC)?
Given as you show is: IT DEPENDS.
The filter is on Category = 123. To find all entries of Category 123, because there is NO INDEX defined, the server has to do a table scan. Unless you havea VERY large result set, and / or some awfully comically bad configured tempdb and very low memory (because disc is only used when running out of memory for tempdb) the sorting of hte result will be irrelevant compared to the table scan.
You are literally following the wrong tail. You are WAY more likely to speed up the query by adding a non-unique index to Cateogory so that the query can prefilter the data fast based on your query condition.
If you would analzy the query plan for this query (which you should - technically we should not even ANSWER this quesstion without you showing SOME effort, and a look at the query plan is like the FIRST thing you do) you would very likely see that the time is spent on on the query, NOT the result sort.
Creating an index in asc or desc order does not make a big difference in “ORDER BY” when there is only one column, but when there is a need to sort data in two different directions one column in ascending order and the other column in descending order the way the index is created does make a big difference.
Look this article that do many example:
https://www.mssqltips.com/sqlservertip/1337/building-sql-server-indexes-in-ascending-vs-descending-order/
In your scenario I advise you to create an index on Category Column without include “Id” because the clustered index is always included in non-clustered index.
There is no difference according to the following
I'd suggest defining an index on (category, id desc).
It will give you best performance for your query.
As others have indicated, an index on Category (assuming you don't have one) is the biggest performance boost possible here.
But as for your actual question. For a single order by query like you have, it does not matter if the query/index is ordered by desc or asc as far as performance goes. SQL Server can swap those easily (starting a the beginning or the end of the data structure)
Where performance becomes an issue for performance is when you:
Have more than order by column
Your index has more than one column
Your order by is opposing the order on the index.
So, say your Primary Key had ID asc and Category asc, and then you query by ID asc and Category desc. Then SQL Server can't use the order on the index to do the search.
There are a few caveats and gotchas. After searching a bit, this answer seems to have them listed:
SQL Server indexes - ascending or descending, what difference does it make?
Related
This question already has an answer here:
Does a SELECT query always return rows in the same order? Table with clustered index
(1 answer)
Closed 8 years ago.
I am unable to get clear cut answers on this contentious question .
MSDN documentation mentions
Clustered
Clustered indexes sort and store the data rows in the table or view
based on their key values. These are the columns included in the
index definition. There can be only one clustered index per table,
because the data rows themselves can be sorted in only one order.
The only time the data rows in a table are stored in sorted order is
when the table contains a clustered index. When a table has a
clustered index, the table is called a clustered table. If a table
has no clustered index, its data rows are stored in an unordered
structure called a heap.
While I see most of the answers
Does a SELECT query always return rows in the same order? Table with clustered index
http://sqlwithmanoj.com/2013/06/02/clustered-index-do-not-guarantee-physically-ordering-or-sorting-of-rows/
answering negative.
What is it ?
Just to be clear. Presumably, you are talking about a simple query such as:
select *
from table t;
First, if all the data on the table fits on a single page and there are no other indexes on the table, it is hard for me to imagine a scenario where the result set is not ordered by the primary key. However, this is because I think the most reasonable query plan would require a full-table scan, not because of any requirement -- documented or otherwise -- in SQL or SQL Server. Without an explicit order by, the ordering in the result set is a consequence of the query plan.
That gets to the heart of the issue. When you are talking about the ordering of the result sets, you are really talking about the query plan. And, the assumption of ordering by the primary key really means that you are assuming that the query uses full-table scan. What is ironic is that people make the assumption, without actually understanding the "why". Furthermore, people have a tendency to generalize from small examples (okay, this is part of the basis of human intelligence). Unfortunately, they see consistently that results sets from simple queries on small tables are always in primary key order and generalize to larger tables. The induction step is incorrect in this example.
What can change this? Off-hand, I think that a full table scan would return the data in primary key order if the following conditions are met:
Single threaded server.
Single file filegroup
No competing indexes
No table partitions
I'm not saying this is always true. It just seems reasonable that under these circumstances such a query would use a full table scan starting at the beginning of the table.
Even on a small table, you can get surprises. Consider:
select NonPrimaryKeyColumn
from table
The query plan would probably decide to use an index on table(NonPrimaryKeyColumn) rather than doing a full table scan. The results would not be ordered by the primary key (unless by accident). I show this example because indexes can be used for a variety of purposes, not just order by or where filtering.
If you use a multi-threaded instance of the database and you have reasonably sized tables, you will quickly learn that results without an order by have no explicit ordering.
And finally, SQL Server has a pretty smart optimizer. I think there is some reluctance to use order by in a query because users think it will automatically do a sort. SQL Server works hard to find the best execution plan for the query. IF it recognizes that the order by is redundant because of the rest of the plan, then the order by will not result in a sort.
And, of course you want to guarantee the ordering of results, you need order by in the outermost query. Even a query like this:
select *
from (select top 100 t.* from t order by col1) t
Does not guarantee that the results are ordered in the final result set. You really need to do:
select *
from (select top 100 t.* from t order by col1) t
order by col1;
to guarantee the results in a particular order. This behavior is documented here.
Without ORDER BY, there is no default sort order even if you have clustered index
in this link there is a good example :
CREATE SCHEMA Data AUTHORIZATION dbo
GO
CREATE TABLE Data.Numbers(Number INT NOT NULL PRIMARY KEY)
GO
DECLARE #ID INT;
SET NOCOUNT ON;
SET #ID = 1;
WHILE #ID < 100000 BEGIN
INSERT INTO Data.Numbers(Number)
SELECT #ID;
SET #ID = #ID+1;
END
CREATE TABLE Data.WideTable(ID INT NOT NULL
CONSTRAINT PK_WideTable PRIMARY KEY,
RandomInt INT NOT NULL,
CHARFiller CHAR(1000))
GO
CREATE VIEW dbo.WrappedRand
AS
SELECT RAND() AS random_value
GO
CREATE ALTER FUNCTION dbo.RandomInt()
RETURNS INT
AS
BEGIN
DECLARE #ret INT;
SET #ret = (SELECT random_value*1000000 FROM dbo.WrappedRand);
RETURN #ret;
END
GO
INSERT INTO Data.WideTable(ID,RandomInt,CHARFiller)
SELECT Number, dbo.RandomInt(), 'asdf'
FROM Data.Numbers
GO
CREATE INDEX WideTable_RandomInt ON Data.WideTable(RandomInt)
GO
SELECT TOP 100 ID FROM Data.WideTable
OUTPUT:
1407
253
9175
6568
4506
1623
581
As you have seen, the optimizer has chosen to use a non-clustered
index to satisfy this SELECT TOP query.
Clearly you cannot assume that your results are ordered unless you
explicitly use ORDER BY clause.
One must specify ORDER BY in the outermost query in order to guarantee rows are returned in a particular order. The SQL Server optimizer will optimize the query and data access to improve performance which may result in rows being returned in a different order. Examples of this are allocation order scans and parallelism. A relational table should always be viewed as an unordered set of rows.
I wish the MSDN documentation were clearer about this "sorting". It is more correct to say that SQL Server b-tree indexes provide ordering by 1) storing adjacent keys in the same page and 2) linking index pages in key order.
I have a schema that looks like this:
create table image_tags (
image_tag_id serial primary key,
image_id int not null
);
create index on image_tags(image_id);
When I execute a query with two columns, it is ridiculously slow (eg, select * from image_tags order by image_id desc, image_tag_id desc limit 10;). If I drop one of those columns in the sort (doesn't matter which), it is super fast.
I used explain on both queries, but it didn't help me understand why two columns in the order by clause were so slow, it just showed me how much slower using two columns was.
For order by image_id desc, image_tag_id desc sorting to be optimized via indexes you need to have this index:
create index image_tags_id_tag on image_tags(image_id, image_tag_id);
Only having a composite index (with little exceptions I presume, but not in this case) would help optimizer to use it to determine the order straight away.
create index on image_tags(image_id, image_tag_id);
try indexing..
You only have an index for one of the columns associated with the query you want to execute, for a better speed you should create a two column index such as
create index on image_tags(image_id, image_tag_id);
I have a table that stored blog information and the information is identified by a column bId.
Such like if the first blog had a bId 1 the latest blog after 10 posts will have bId 10.
However, this information can be accessed from anywhere as any post can be visited. But as a general understanding latest post will be visited more often than the older ones.
Which means there are more chances of fetching of information with bId 10 rather than 1.
So If I create an index over this blog table’s column bId does it matter in my above case if I declare the index ASC or DESC or it will be same?
And which one will be more appropriate in my case: CREATE INDEX [IDX_bId] ON [blog] ([bId] ASC); or CREATE INDEX [IDX_bId] ON [blog] ([bId] DESC);?
First you probably want to create that index as CLUSTERED. A clustered index defines the order for the actual data rows; it's essentially the order of the table itself. If you do not define a clustered index at all you table is a 'heap' meaning that pages are stored in no particular order on disk.
To answer your question: if the only access pattern will be by ID, e.g.:
SELECT *
FROM blog
WHERE bId = 10
than it doesn't matter if your index is defined as ASC or DESC.
However for a query like this:
SELECT TOP (100) *
FROM blog
ORDER BY bID DESC
the clustered index should be defined as bID DESC. The reason for this is that if the index is ASC, then this query will need to perform a backward scan, and backward scans cannot use parallelism (as explained here: http://sqlmag.com/t-sql/descending-indexes). If your table is large, this may impact the performance.
In short not really. The index on the column will provide the most important information. The optimiser will then decide how best to retrieve your rows. If you have a composite index then it may have an impact.
I have a table
Archive(VarId SMALLINT, Timestamp DATETIME, Value FLOAT)
VarId is not unique. The table contains measurements. I have a clustered index on Timestamp. Now i have the requirement of finding a measurement for a specific VarId before a specific date. So I do:
SELECT TOP(1) *
FROM Archive
WHERE VarId = 135
AND Timestamp < '2012-06-01 14:21:00'
ORDER BY Timestamp DESC;
If there is no such measurement this query searches the whole table. So I introduced another index on (VarId, Timestamp).
My problem is: SQL Server doesn't seem to care about it, the query still takes forever. When I explicitly state 'WITH (INDEX = <id>)' it works as it should. What can I do so SQL Server uses my index automatically?
I'm using SQL Server 2005.
There are different possibilities with this.
I'll try help you to isolate them:
It could be SQL Server is favouring your Clustered Index (very likely it's the Primary Key) over your newly created index. One way to solve this is to have a NonClustered Primary Key and cluster the index on the other two fields (varid and timestamp). That is, if you don't want varid and timestamp to be the PK.
Also, looking at the (estimated) execution plan might help.
But I believe #1 only works nicely if those 2 fields are the most commonly used (queried) index. To find out if this is the case, it would be good to analyse which index users are most likely use (from http://sqlblog.com/blogs/louis_davidson/archive/2007/07/22/sys-dm-db-index-usage-stats.aspx):
select
ObjectName = object_schema_name(indexes.object_id) + '.' + object_name(indexes.object_id),
indexes.name,
case when is_unique = 1 then 'UNIQUE ' else '' end + indexes.type_desc,
ddius.user_seeks,
ddius.user_scans,
ddius.user_lookups,
ddius.user_updates
from
sys.indexes
left join sys.dm_db_index_usage_stats ddius on (
indexes.object_id = ddius.object_id
and indexes.index_id = ddius.index_id
and ddius.database_id = db_id()
)
WHERE
object_schema_name(indexes.object_id) != 'sys' -- exclude sys objects
AND object_name(indexes.object_id) LIKE 'Archive'
order by
ddius.user_seeks + ddius.user_scans + ddius.user_lookups
desc
Good luck
My guess is that your index design is the issue. You have a CLUSTERED index on a DATETIME field and I suspect that it is not unique data, much like VarId, and hence you did not declare it as UNIQUE. Because it is not unique there is a hidden, 4-byte "uniqueifier" field (so that each row can by physically unique regardless of you not giving it unique data) and the rows with the same DATETIME value are essentially random within the group of same DATETIME values (so even narrowing down a time still requires scanning through that grouping). You also have a NONCLUSTERED index on VarId, Timestamp. NONCLUSTERED indexes include the data from the CLUSTERED index so internally your NONCLUSTERED index is really: VarId, Timestamp, Timestamp (from the CLUSTERED index). So you could have left off the Timestamp column in the NONCLUSTERED index and it would have all been the same to the optimizer, but in a sense it would have been better as it would be a smaller index.
So your physical layout is based on a date while the VarId values are spread across those dates. Hence VarId = 135 can be spread very far apart in terms of data pages. Yes, your non-clustered index does group them together, but the optimizer is probably looking at the fact that you are wanting all fields (the "SELECT *" part) and the Timestamp < '2012-06-01 14:21:00' condition in addition to that seems to get most of what you need as opposed to finding a few rows and doing a bookmark lookup to get the "Value" field to fulfill the "SELECT *". Quite possibly if you do just "SELECT TOP(1) VarId, Timestamp" it would more likely use your NONCLUSTERED index without needing the "INDEX =" hint.
Another issue affecting performance overall could be that the ORDER BY is requesting the Timestamp in DESC order and if you have the CLUSTERED index in ASC order then it would be the opposite direction of what you are looking for (at least in this query). Of course, in that case then it might be ok to have Timestamp in the NONCLUSTERED index if it was in DESC order.
My advice is to rethink the CLUSTERED index. Judging on just this query alone (other queries/uses might alter the recommendation), try dropping the NONCLUSTERED index and recreate the CLUSTERED index with the Timestamp field first, in DESC order, and also with the VarId so it can be delcared UNIQUE. So:
CREATE UNIQUE CLUSTERED INDEX [UIX_Archive_Timestamp_VarId]
ON Archive (Timestamp DESC, VarId ASC)
This, of course, assumes that the Timestamp and VarId combination is unique. If not, then still try this without the UNIQUE keyword.
Update:
To pull all of this info and advice together:
When designing indexes you need to consider the distribution of the data and the use-cases for interacting with it. More often than not there is A LOT to consider and several different approaches will appear good in theory. You need to try a few approaches, profile/test them, and see which works best in reality. There is no "always do this" approach without knowing all aspects of what you are doing and what else is going on and what else is planned to use and/or modify this table which I suspect has not been presented in the original question.
So to start the journey, you are ordering records by date and are looking at ranges of dates AND dates naturally occur in order so putting Timestamp first benefits more of what you are doing and has less fragmentation, especially if defined as DESC in the CREATE. Having an NC index on just VarId at that point will then be fine, even if spread out, for looking at a set of rows for a particular VarId. So maybe start there (change order of direction of CLUSTERED index and remove Timestamp from the NC index). See how those changes compare to the existing structure. Then try moving the VarId field into the CLUSTERED index and remove the NC index. You say that the combination is also not unique but does increase the predictability of the ordering of the rows. See how that works. Does this table ever get updated? If not and if the Value field along with Timestamp and VarId would be unique, then try adding that to the CLUSTERED index and be sure to create with the UNIQUE keyword. See how these different approaches work by looking at the Actual Execution Plan and use SET STATISTICS IO ON before running the query and see how the logical reads between the different approaches compare.
Hope this helps :)
You might need to analyze your table to collect statistics, so the optimizer can determine whether to use the index or not.
EDIT: I have added "Slug" column to address performance issues on specific record selection.
I have following columns in my table.
Id Int - Primary key (identity, clustered by default)
Slug varchar(100)
...
EntryDate DateTime
Most of the time, I'm ordering the select statement by EntryDate like below.
Select T.Id, T.Slug, ..., T.EntryDate
From (
Select Id, Slug, ..., EntryDate,
Row_Number() Over (Order By EntryDate Desc, Id Desc) AS RowNum
From TableName
Where ...
) As T
Where T.RowNum Between ... And ...
I'm ordering it by EntryDate and Id in case there are duplicate EntryDates.
When I'm selecting A record, I do the following.
Select Id, Slug, ..., EntryDate
From TableName
Where Slug = #slug And Year(EntryDate) = #entryYear
And Month(EntryDate) = #entryMonth
I have a unique key of Slug & EntryDate.
What would be a smart choice of keys and indexes in my situation? I'm facing performance issues probably because I'm ordering by a column that is not clustered indexed.
Should I have Id set as non-clustered primary key and EntryDate as clustered index?
I appreciate all your help. Thanks.
EDIT:
I haven't tried adding non-clustered index on the EntryDate. Data inserted from back-end, so performance for insert isn't a big deal for me. Also, EntryDate is not always the date when it is inserted. It can be a past date. Back-end user picks the date.
Based on the current table layout you want some indexes like this.
CREATE INDEX IX_YourTable_1 ON dbo.YourTable
(EntryDate, Id)
INCLUDE (SLug)
WITH (FILLFACTOR=90)
CREATE INDEX IX_YourTable_2 ON dbo.YourTable
(EntryDate, Slug)
INCLUDE (Id)
WITH (FILLFACTOR=80)
Add any other columns you are returning to the INCLUDE line.
Change your second query to something like this.
Select Id, Slug, ..., EntryDate
From TableName
Where Slug = #slug
AND EntryDate BETWEEN CAST(CAST(#EntryYear AS VARCHAR(4) + CAST(#EntryMonth AS VARCHAR(2)) + '01' AS DATE) AND DATEADD(mm, 1, CAST(CAST(#EntryYear AS VARCHAR(4) + CAST(#EntryMonth AS VARCHAR(2)) + '01' AS DATE))
The way your second query is currently written the index will never be used. If you can change the Slug column to a related table it will increase your performance and decrease your storage requirements.
Have you tried simply adding a non-clustered index on the entrydate to see what kind of performance gain you get?
Also, how often is new data added? and will new data that is added always be >= the last EntryDate?
You want to keep ID as a clustered index, as you will most likely join to the table off your id, and not entry date.
A simple non clustered index with just the date field would be fine to speed things up.
Clustering is a bit like "index paging", the index is "chunked" instead of simply being a long list. This is helpful when you've got a lot of data. The DB can search within cluster ranges, then find the individual record. It makes the index smaller, therefore faster to search, but less specific. Once if finds the correct spot in the cluster it then needs to search within the cluster.
It's faster with a lot of data, but slower with smaller data sets.
If you're not searching a lot using the primary key, then cluster the date and leave the primary key non-clustered. It really depends on how complex your queries are with joining other tables.
A clustered index will only make any difference at all, if you are returning a bunch of records, and some the fields you return are not part of the index. Otherwise there's no benefit.
You need first to find out what the query plan tells you about why your current queries are slow. Without that, it's mostly idle speculation (which is usually counterproductive when optimizing queries.)
I wouldn't try anything (suggested by me or anyone else) without having a solid queryplan to compare with, to at least know if you're doing good or harm.