Does index helps in BETWEEN clause in SQL Server?
If I have table with 20000 rows and query is:
select *
from employee
where empid between 10001 and 20000
Yes. This is sargable.
It can do a range seek on an index with leading column empid. Navigating the B-tree to find the first row >= 10001 and then reading all rows in key order until the end of the range is reached.
You might not get this plan unless the index is covering though. Your query has select * so an index that only contains empid may potentially need to do 10,000 lookups to get the missing columns.
If empid is the primary key of employee then by default it will be the clustered index key (unless you specified otherwise) so this will automatically be covering and you should expect to see a clustered index seek.
Related
We have a query where the table is partitioned on column Adate.
Row count: 56595943, partition scheme - yearly, no of partitions - 300
Clustered index columns : empid, Adate
Query :
select top 1 Adate
from emp
where empid = 134556 and Adate <= {ts '7485-09-01 00:00:00.0'}
order by Adate desc
The actual execution plan returns a clustered index seek operation with 93% of the total query cost on clustered index key.
But why is the optimizer recommending a missing index with 92% of cost?
missing index details: Improve query cost:92%
create nonclustered index IDX_NC on dbo.emp([empid], [Adate])
The missing index has an improvement measure of 14755268, as per Microsoft the improvement measure baseline is 1,000,000
Why is this happening? Do you recommend to have a nonclustered index on already clustered index columns?
Well - consider this:
you do have the clustered index on (empid, adate)
the clustered index contains the whole data, e.g. the leaf level pages of the clustered index contain the whole data records (all the columns in your table)
If you are searching and the query uses the clustered index, it might still need to load much more data than is actually needed.... the whole record, as many times as your criteria is found.
If you have a non-clustered index on just (empid, Adate), and your query really only requires Adate (in its SELECT list of columns), then this index will be much smaller - it contains only those two columns (none of the overhead of all the other columns, which are not needed for your current query). So scanning this index, or loading these index pages, will load much less data compared to the clustered index.
From that point of view, yes, even having a nonclustered index on the same columns that make up your clustered index can be beneficial for certain query scenarios - that's probably what the SQL Server query optimizer picks up here.
I am experiencing very slow performance when trying to join 2 tables: one has 39M rows, the other 10k (35 sec). This runs on Azure SQL Premium instance, which is very decent server
select m39.*
from [Table_With_39M_Rows] m39
inner join [Table_With_10K_Rows] k10 on m39.[Id] = k10.[Id]
even a count(*) takes around 10 seconds
select count(*)
from [Table_With_39M_Rows] m39
inner join [Table_With_10K_Rows] k10 on m39.[Id] = k10.[Id]
Here are the table details:
Table [Table_With_39M_Rows] has around 39 million rows (50 columns) with a clustered columnstore index:
CREATE CLUSTERED COLUMNSTORE INDEX CCI_Table_With_39M_Rows
ON Table_With_39M_Rows
CREATE UNIQUE NONCLUSTERED UNCI_Table_With_39M_Rows_Id (Id ASC)
Table [Table_With_10K_Rows] has around 10k rows (50 columns) and Id as the primary key
ALTER TABLE Table_With_10K_Rows
ADD CONSTRAINT PK_Table_With_10K_Rows
PRIMARY KEY CLUSTERED([Id] ASC)
Clustered ColumnsStore index scan takes 99% and slows everything down.
How can I optimize this particular join? What indexing strategy should I employ?
Clustered column store indexes are helpful if row group elimination works(you can think of this skipping entire segment of rows which don't satisfy predicate) and if queries are analytical in nature.
To check whether segment elimination is occurring, you can use below queries
Below is a sample demo for a query i have(since we don't have your test data) which may help you understand more
query:
select s.* from sales s
join
numbers n
on n.number=s.id
Numbers table only has 65356 rows and sales table has more than 3 million rows.Each segment can have only one million rows.If you can observe the output of statistics IO,SQLSERVER reads 2 segments(2 million rows) and 2 segments are skipped,which is not great and i expect only one segment to be read and remaining three segments to be skipped..But 2 are read as shown below
Table 'sales'. Segment reads 2, segment skipped 2.
This is happening because you might have created clustered column store from a heap ,so try doing below
drop your existsing clustered column store index,in my case it is
drop index nci on sales
now try creating clustered index first and clustered column store next,this helps sqlserver in inserting the rows in order into clustered column store index.. you might also want to use maxdop 1 to avoid parallelism and unordered rows
create clustered index nci on sales(id)
create clustered columnstore index nci on sales
with (drop_existing=on,maxdop =1)
if you run the query now, you can see segement elimiation occurs and query is fast
Table 'sales'. Segment reads 1, segment skipped 2.
References and further reading:
https://www.sqlpassion.at/archive/2017/01/30/columnstore-segment-elimination/
https://blogs.msdn.microsoft.com/sqlserverstorageengine/2016/07 /17/columnstore-index-how-do-they-defer-from-traditional-btree-indices-on-rowstore-tables/
https://blogs.msdn.microsoft.com/sql_server_team/columnstore-index-performance-rowgroup-elimination/
I suggest you be consistent on use of [].
ID for a foreign key is not a good name.
Columnstore Indexes Described
Columnstore indexes give high performance gains for queries that use
full table scans, and are not well-suited for queries that seek into
the data, searching for a particular value.
Just because you need columnstore for other purposes does not make it a good applications for this.
Try regular nonclustered index on [Table_With_39M_Rows].[ID]
At my work we currently have a table with 50 million rows that has an index on two Varbinary(16) columns which are ip_start and ip_end.
PRIMARY KEY CLUSTERED
(
[ip_end] ASC,
[ip_start] ASC
)
The first few rows in the table are like this:
ip_start ip_end id
0x00000000 0x00000000 0
0x00000001 0x000000FF 1
0x00000100 0x00FFFFFF 2
0x01000000 0x010000FF 3
The query we use to find matches is:
SELECT TOP 1 id
FROM dbo.ip_ranges WITH (NOLOCK)
WHERE #lookup <= ip_end AND #lookup >= ip_start
When I lookup an ip like 0x00000002 it returns id 1 instantly, but if I search for a range that is in between a range like 0x000000000000001 it takes several seconds to return NULL. Shouldn't SQL Server understand that the varbinary index is ordered and therefore return quickly if there are no matches?
Is there a better way to query this with the expectation that some ip's will be between ranges or a better way to index the table so that misses don't cause such a large hit?
Shouldn't SQL Server understand that the varbinary index is ordered and therefore return quickly if there are no matches?
SQL Server understands that the index is ordered, but it does not understand that the ranges do not overlap. This condition #lookup >= ip_start is true for a bunch of ip ranges (about half on average), and that is the performance that you see for a non-match. The B-Tree index does not use the second key for an index lookup when the first key has an inequality.
Unfortunately, standard B-Tree indexes are not optimal for this type of search (inequalities along two dimensions). An R-tree (which I originally learned as RD-tree) is better suited. Those are used primarily for spatial indexes.
I think I have had success with a query such as this:
SELECT ir.*
FROM (SELECT TOP 1 ir.*
FROM dbo.ip_ranges ir
WHERE #lookup >= ip_start
ORDER BY ip_start
) ir
WHERE #lookup <= ir.ip_end ;
SQL Server should use an index for the subquery, quickly finding the first matching row. You can then check separately if the end of the range is on this row. This works because IP address ranges do not overlap.
Create nonclustered index on ip_start with include column id
Or update clustered index on one column ip_start
and create nonclustered on ip_end with include column id
Recently I was put into database fine tuning. I have some ideas about SQL Server and decided to create some index.
Referred this http://sqlserverplanet.com/ddl/create-index
But i don't understand how other types of Index like INCLUDE, WITH options will help. I tried google to but failed to see a simple description when to use those.
CREATE NONCLUSTERED INDEX IX_NC_PresidentNumber
ON dbo.Presidents (PresidentNumber)
INCLUDE (President,YearsInOffice,RatingPoints)
WHERE ElectoralVotes IS NOT NULL
CREATE NONCLUSTERED INDEX IX_NC_PresidentNumber
ON dbo.Presidents (PresidentNumber)
WITH ( DATA_COMPRESSION = ROW )
CREATE NONCLUSTERED INDEX IX_NC_PresidentNumber
ON dbo.Presidents (PresidentNumber)
WITH ( DATA_COMPRESSION = PAGE )
What scenario I should use the above? Will they increase performance?
Data compression will help your query performance too, since after compression, when you run a query, less page/extent will be loaded, since I/O is reduced, reducing I/O is always a good choice.
I can't speak to the with datacompression option, but the Include option can definitely improve performance. If you select only the PresidentNumber and one or more of President, YearsInOffice, or RatingPoints columns, and the ElectoralVotes is not null, then your query will get values from the index itself and not have to touch the underlying table. If your table has additional columns and you include one of those in your query then it will have to retrieve values from the table as well as the index.
Select top 20 PresidentNumber, President, YearsInOffice, RatingPoints
From Presidents
where ElectoralVotes IS NOT NULL
The above query will only read from IX_NC_PresidentNumber and not have to pull data from the Presidents table because all columns from the query are included in the index
Select top 20 PresidentNumber, President, YearsInOffice, PoliticalParty
From Presidents
where ElectoralVotes IS NOT NULL
This query will use the index IX_NC_PresidentNumber and the Presidents table as well because the PoliticalParty column in the query is not included in the index.
Select PresidentNumber, President, YearsInOffice, RatingPoints
From Presidents
Where RatingPoints > 50
This query will most likely end up doing a table scan because the where clause in the query versus the where clause used in the index don't match, and there no limit on the rowcount.
I have a table named Workflow. There are 38M rows in the table. There is a PK on the following columns:
ID: Identity Int
ReadTime: dateTime
If I perform the following query, the PK is not used. The query plan shows an index scan being performed on one of the nonclustered indexes plus a sort. It takes a very long time with 38M rows.
Select TOP 100 ID From Workflow
Where ID > 1000
Order By ID
However, if I perform this query, a nonclustered index (on LastModifiedTime) is used. The query plan shows an index seek being performed. The query is very fast.
Select TOP 100 * From Workflow
Where LastModifiedTime > '6/12/2010'
Order By LastModifiedTime
So, my question is this. Why isn't the PK used in the first query, but the nonclustered index in the second query is used?
Without being able to fish around in your database, there are a few things that come to my mind.
Are you certain that the PK is (id, ReadTime) as opposed to (ReadTime, id)?
What execution plan does SELECT MAX(id) FROM WorkFlow yield?
What about if you create an index on (id, ReadTime) and then retry the test, or your query?
Since Id is an identity column, having ReadTime participate in the index is superfluous. The clustered key already points to the leaf data. I recommended you modify your indexes
CREATE TABLE Workflow
(
Id int IDENTITY,
ReadTime datetime,
-- ... other columns,
CONSTRAINT PK_WorkFlow
PRIMARY KEY CLUSTERED
(
Id
)
)
CREATE INDEX idx_LastModifiedTime
ON WorkFlow
(
LastModifiedTime
)
Also, check that statistics are up to date.
Finally, If there are 38 million rows in this table, then the optimizer may conclude that specifying criteria > 1000 on a unique column is non selective, because > 99.997% of the Ids are > 1000 (if your identity seed started at 1). In order for an index to considered helpful, the optimizer must conclude that < 5% of the records would be selected. You can use an index hint to force the issue (as already stated by Dan Andrews). What is the structure of the non-clustered index that was scanned?