At my work we currently have a table with 50 million rows that has an index on two Varbinary(16) columns which are ip_start and ip_end.
PRIMARY KEY CLUSTERED
(
[ip_end] ASC,
[ip_start] ASC
)
The first few rows in the table are like this:
ip_start ip_end id
0x00000000 0x00000000 0
0x00000001 0x000000FF 1
0x00000100 0x00FFFFFF 2
0x01000000 0x010000FF 3
The query we use to find matches is:
SELECT TOP 1 id
FROM dbo.ip_ranges WITH (NOLOCK)
WHERE #lookup <= ip_end AND #lookup >= ip_start
When I lookup an ip like 0x00000002 it returns id 1 instantly, but if I search for a range that is in between a range like 0x000000000000001 it takes several seconds to return NULL. Shouldn't SQL Server understand that the varbinary index is ordered and therefore return quickly if there are no matches?
Is there a better way to query this with the expectation that some ip's will be between ranges or a better way to index the table so that misses don't cause such a large hit?
Shouldn't SQL Server understand that the varbinary index is ordered and therefore return quickly if there are no matches?
SQL Server understands that the index is ordered, but it does not understand that the ranges do not overlap. This condition #lookup >= ip_start is true for a bunch of ip ranges (about half on average), and that is the performance that you see for a non-match. The B-Tree index does not use the second key for an index lookup when the first key has an inequality.
Unfortunately, standard B-Tree indexes are not optimal for this type of search (inequalities along two dimensions). An R-tree (which I originally learned as RD-tree) is better suited. Those are used primarily for spatial indexes.
I think I have had success with a query such as this:
SELECT ir.*
FROM (SELECT TOP 1 ir.*
FROM dbo.ip_ranges ir
WHERE #lookup >= ip_start
ORDER BY ip_start
) ir
WHERE #lookup <= ir.ip_end ;
SQL Server should use an index for the subquery, quickly finding the first matching row. You can then check separately if the end of the range is on this row. This works because IP address ranges do not overlap.
Create nonclustered index on ip_start with include column id
Or update clustered index on one column ip_start
and create nonclustered on ip_end with include column id
Related
Does index helps in BETWEEN clause in SQL Server?
If I have table with 20000 rows and query is:
select *
from employee
where empid between 10001 and 20000
Yes. This is sargable.
It can do a range seek on an index with leading column empid. Navigating the B-tree to find the first row >= 10001 and then reading all rows in key order until the end of the range is reached.
You might not get this plan unless the index is covering though. Your query has select * so an index that only contains empid may potentially need to do 10,000 lookups to get the missing columns.
If empid is the primary key of employee then by default it will be the clustered index key (unless you specified otherwise) so this will automatically be covering and you should expect to see a clustered index seek.
I have a table with definition somewhat like the following:
create table offset_table (
id serial primary key,
offset numeric NOT NULL,
... other fields...
);
The table has about 70 million rows in it.
I envision doing the following query many times
select * from offset_table where offset > 0;
For speed issues, I am wondering whether it would be advised to create an index like:
create index on offset_table(offset);
I am trying to avoid creation of unnecessary indices on this table as it is pretty big already.
As you mentioned in the comments, it would be ~70% of rows that match the offset > 0 predicate.
In that case the index would not be beneficial, since postgresql (and basically every other DBMS) would prefer a full table scan instead. It happens because it would be faster than jumping between reading the index consequently and the table randomly.
I have the following SQL statement, which I would like to make more efficient. Looking through the execution plan I can see that there is a Clustered Index Scan on #newWebcastEvents. Is there a way I can make this into a seek? Or are there any other ways I can make the below more efficient?
declare #newWebcastEvents table (
webcastEventId int not null primary key clustered (webcastEventId) with (ignore_dup_key=off)
)
insert into #newWebcastEvents
select wel.WebcastEventId
from WebcastChannelWebcastEventLink wel with (nolock)
where wel.WebcastChannelId = 1178
Update WebcastEvent
set WebcastEventTitle = LEFT(WebcastEventTitle, CHARINDEX('(CLONE)',WebcastEventTitle,0) - 2)
where
WebcastEvent.WebcastEventId in (select webcastEventId from #newWebcastEvents)
The #newWebcastEvents table variable only contains the only single column, and you're asking for all rows of that table variable in this where clause:
where
WebcastEvent.WebcastEventId in (select webcastEventId from #newWebcastEvents)
So doing a seek on this clustered index is usually rather pointless - SQL Server query optimizer will need all columns, all rows of that table variable anyway..... so it chooses an index scan.
I don't think this is a performance issue, anyway.
An index seek is useful if you need to pick a very small number of rows (<= 1-2% of the original number of rows) from a large table. Then, going through the clustered index navigation tree and finding those few rows involved makes a lot more sense than scanning the whole table. But here, with a single int column and 15 rows --> it's absolutely pointless to seek, it will be much faster to just read those 15 int values in a single scan and be done with it...
Update: no sure if it makes any difference in terms of performance, but I personally typically prefer to use joins rather than subselects for "connecting" two tables:
UPDATE we
SET we.WebcastEventTitle = LEFT(we.WebcastEventTitle, CHARINDEX('(CLONE)', we.WebcastEventTitle, 0) - 2)
FROM dbo.WebcastEvent we
INNER JOIN #newWebcastEvents nwe ON we.WebcastEventId = nwe.webcastEventId
I found that the if I query the table with less than or greater than operator, sql server indexes do not work properly.
Say I have a simple table (TestTable) with only 2 columns like this:
Column Name, column type, primary Key, index
iID, int, yes, cluster index
iCount, int, no, non-cluster index
name, nvarchar(255), no, no index
Now, I query the table by this:
SELECT * FROM TestTable WHERE iCount = 10.
Very good, Sql server will use the non-cluster index for column iCount to retrieve the result.
However, if I query the table by this:
SELECT * FROM TestTable WHERE iCount < 10,
Sql server will do a index scan over the cluster index for the iID to retrieve the result.
I am wondering why sql server is not able to use proper index when I use less than or greater than operator in the query.
If the table has very few rows, it is cheaper for SQL Server to scan the clustered index rather than using the non-clustered index and then doing a lookup for the rest of the columns in the clustered index. If that's the case, change the query to SELECT iCount FROM... and you should see the query plan change to using the index as you are expecting.
I have a table named Workflow. There are 38M rows in the table. There is a PK on the following columns:
ID: Identity Int
ReadTime: dateTime
If I perform the following query, the PK is not used. The query plan shows an index scan being performed on one of the nonclustered indexes plus a sort. It takes a very long time with 38M rows.
Select TOP 100 ID From Workflow
Where ID > 1000
Order By ID
However, if I perform this query, a nonclustered index (on LastModifiedTime) is used. The query plan shows an index seek being performed. The query is very fast.
Select TOP 100 * From Workflow
Where LastModifiedTime > '6/12/2010'
Order By LastModifiedTime
So, my question is this. Why isn't the PK used in the first query, but the nonclustered index in the second query is used?
Without being able to fish around in your database, there are a few things that come to my mind.
Are you certain that the PK is (id, ReadTime) as opposed to (ReadTime, id)?
What execution plan does SELECT MAX(id) FROM WorkFlow yield?
What about if you create an index on (id, ReadTime) and then retry the test, or your query?
Since Id is an identity column, having ReadTime participate in the index is superfluous. The clustered key already points to the leaf data. I recommended you modify your indexes
CREATE TABLE Workflow
(
Id int IDENTITY,
ReadTime datetime,
-- ... other columns,
CONSTRAINT PK_WorkFlow
PRIMARY KEY CLUSTERED
(
Id
)
)
CREATE INDEX idx_LastModifiedTime
ON WorkFlow
(
LastModifiedTime
)
Also, check that statistics are up to date.
Finally, If there are 38 million rows in this table, then the optimizer may conclude that specifying criteria > 1000 on a unique column is non selective, because > 99.997% of the Ids are > 1000 (if your identity seed started at 1). In order for an index to considered helpful, the optimizer must conclude that < 5% of the records would be selected. You can use an index hint to force the issue (as already stated by Dan Andrews). What is the structure of the non-clustered index that was scanned?