Index for table in SQL Server 2012 - sql

I had a question on indexes. I have a table like this:
id BIGINT PRIMARY KEY NOT NULL,
cust_id VARCHAR(8) NOT NULL,
dt DATE NOT NULL,
sale_type VARCHAR(10) NOT NULL,
sale_type_sub VARCHAR(40),
amount DOUBLE PRECISION NOT NULL
The table has several million rows. Assuming that queries will often filter results by date ranges, sale types, amounts above and below certain values, and that joins will occur on cust_id... what do you all think is the ideal index structure?
I wasn't sure if a clustered index would be best, or individual indexes on each column? Both?

Any serious table in SQL Server should always have a well-chosen, good clustering key - it makes so many things faster and more efficient. From you table structure, I'd use the ID as the clustering key.
Next, you say joins occur on cust_id - so I would put an index on cust_id. This speeds up joins in general and is a generally accepted recommendation.
Next, it really depends on your queries. Are they all using the same columns in their WHERE clauses? Or do you get queries that use dt, and others that use sale_type separately?
The point is: the fewer indices the better - so if ever possible, I'd try to find one compound index that covers all your needs. But if you have an index on three columns (e.g. on (sale_type, dt, amount), then that index can be used for queries
using all three columns in the WHERE clause
using sale_type and dt in the WHERE clause
using only sale_type in the WHERE clause
but it could NOT be used for queries that use dt or amount alone. A compound index always requires you to use the n left-most columns in the index definition - otherwise it cannot be used.
So my recommendation would be:
define the clustering key on ID
define a nonclustered index on cust_id for the JOINs
examine your system to see what other queries you have - what criteria is being used for selection, how often do those queries execute? Don't over-optimize a query that's executed once a month - but do spend time on those that are executed dozens of times every hour.
Add one index at a time - let the system run for a bit - do you measure an improvement in query times? Does it feel faster? If so: leave that index. If not: drop it again. Iterate until you're happy with the overall system performance.

The best way to find indexes for your table is sql server profiler.

Related

SQL index for date range query

For a few days, I've been struggling with improving the performance of my database and there are some issues that I'm still kind a confused about regarding indexing in a SQL Server database.
I'll try to be as informative as I can.
My database currently contains about 100k rows and will keep growing, therfore I'm trying to find a way to make it work faster.
I'm also writing to this table, so if you suggestion will drastically reduce the writing time please let me know.
Overall goal is to select all rows with a specific names that are in a date range.
That will usually be to select over 3,000 rows out of a lot lol ...
Table schema:
CREATE TABLE [dbo].[reports]
(
[id] [int] IDENTITY(1,1) NOT NULL,
[IsDuplicate] [bit] NOT NULL,
[IsNotValid] [bit] NOT NULL,
[Time] [datetime] NOT NULL,
[ShortDate] [date] NOT NULL,
[Source] [nvarchar](350) NULL,
[Email] [nvarchar](350) NULL,
CONSTRAINT [PK_dbo.reports]
PRIMARY KEY CLUSTERED ([id] ASC)
) ON [PRIMARY]
This is the SQL query I'm using:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-13' AND '2017-10-15'
As I understood, my best approach to improve efficency without hurting the writing time as much would be to create a nonclustered index on the Source and ShortDate.
Which I did like such, index schema:
CREATE NONCLUSTERED INDEX [Source&Time]
ON [dbo].[reports]([Source] ASC, [ShortDate] ASC)
Now we are getting to the tricky part which got me completely lost, the index above sometimes works, sometime half works and sometime doesn't work at all....
(not sure if it matters but currently 90% of the database rows has the same Source, although this won't stay like that for long)
With the query below, the index isn't used at all, I'm using SQL Server 2014 and in the Execution Plan it says it only uses the clustered index scan:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-10' AND '2017-10-15'
With this query, the index isn't used at all, although I'm getting a suggestion from SQL Server to create an index with the date first and source second... I read that the index should be made by the order the query is? Also it says to include all the columns Im selecting, is that a must?... again I read that I should include in the index only the columns I'm searching.
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate = '2017-10-13'
SQL Server index suggestion -
/* The Query Processor estimates that implementing the following
index could improve the query cost by 86.2728%. */
/*
USE [db]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[reports] ([ShortDate], [Source])
INCLUDE ([id], [IsDuplicate], [IsNotValid], [Time], [Email])
GO
*/
Now I tried using the index SQL Server suggested me to make and it works, seems like it uses 100% of the nonclustered index using both the queries above.
I tried to use this index but deleting the included columns and it doesn't work... seems like I must include in the index all the columns I'm selecting?
BTW it also work when using the index I made if I include all the columns.
To summarize: seems like the order of the index didn't matter, as it worked both when creating Source + ShortDate and ShortDate + Source
But for some reason its a must to include all the columns... (which will drastically affect the writing to this table?)
Thanks a lot for reading, My goal is to understand why this stuff happens and what I should do otherwise (not just the solution as I'll need to apply it on other projects as well ).
Cheers :)
Indexing in SQL Server is part know-how from long experience (and many hours of frustration), and part black magic. Don't beat yourself up over that too much - that's what a place like SO is ideal for - lots of brains, lots of experience from many hours of optimizing, that you can tap into.
I read that the index should be made by the order the query is?
If you read this - it is absolutely NOT TRUE - the order of the columns is relevant - but in a different way: a compound index (made up from multiple columns) will only ever be considered if you specify the n left-most columns in the index definition in your query.
Classic example: a phone book with an index on (city, lastname, firstname). Such an index might be used:
in a query that specifies all three columns in its WHERE clause
in a query that uses city and lastname (find all "Miller" in "Detroit")
or in a query that only filters by city
but it can NEVER EVER be used if you want to search only for firstname ..... that's the trick about compound indexes you need to be aware of. But if you always use all columns from an index, their ordering is typically not really relevant - the query optimizer will handle this for you.
As for the included columns - those are stored only in the leaf level of the nonclustered index - they are NOT part of the search structure of the index, and you cannot specify filter values for those included columns in your WHERE clause.
The main benefit of these included columns is this: if you search in a nonclustered index, and in the end, you actually find the value you're looking for - what do you have available at that point? The nonclustered index will store the columns in the non-clustered index definition (ShortDate and Source), and it will store the clustering key (if you have one - and you should!) - but nothing else.
So in this case, once a match is found, and your query wants everything from that table, SQL Server has to do what is called a Key lookup (often also referred to as a bookmark lookup) in which it takes the clustered key and then does a Seek operation against the clustered index, to get to the actual data page that contains all the values you're looking for.
If you have included columns in your index, then the leaf level page of your non-clustered index contains
the columns as defined in the nonclustered index
the clustering key column(s)
all those additional columns as defined in your INCLUDE statement
If those columns "cover" your query, e.g. provide all the values that your query needs, then SQL Server is done once it finds the value you searched for in the nonclustered index - it can take all the values it needs from that leaf-level page of the nonclustered index, and it does NOT need to do another (expensive) key lookup into the clustering index to get the actual values.
Because of this, trying to always explicitly specify only those columns you really need in your SELECT can be beneficial - in this case, you might be able to create an efficient covering index that provides all the values for your SELECT - always using SELECT * makes that really hard or next to impossible.....
In general, you want the index to be from most selective (i.e. filtering out the most possible records) to least selective; if a column has low cardinality, the query optimizer may ignore it.
That makes intuitive sense - if you have a phone book, and you're looking for people called "smith", with the initial "A", you want to start with searching for "smith" first, and then the "A"s, rather than all people whose initial is "A" and then filter out those called "Smith". After all, the odds are that one in 26 people have the initial "A".
So, in your example, I guess you have a wide range of values in short date - so that's the first column the query optimizer is trying to filter out. You say you have few different values in "source", so the query optimizer may decide to ignore it; in that case, the second column in that index is no use either.
The order of where clauses in the index is irrelevant - you can swap them round and achieve the exact same results, so the query optimizer ignores them.
EDIT:
So, yes, make the index. Imagine you have a pile of cards to sort - in your first run, you want to remove as many cards as possible. Assuming it's all evenly spread - if you have 1000 separate short_dates over a million rows, that means you end up with 1000 items if your first run starts on short_date; if you sort by source, you have 100000 rows.
The included columns of an index is for the columns you are selecting.
Due to the fact that you do select * (which isn't good practice), the index won't be used, because it would have to lookup the whole table to get the values for the columns.
For your scenario, I would drop the default clustered index (if there is one) and create a new clustered index with the following statement:
USE [db]
GO
CREATE CLUSTERED INDEX CIX_reports
ON [dbo].[reports] ([ShortDate],[Source])
GO

Optimizing my SQL queries - picking the right indexes

I have a basic table as follows.
create table Orders
(
ID INT IDENTITY(1,1) PRIMARY KEY,
Company VARCHAR(3),
ItemID INT,
BoxID INT,
OrderNum VARCHAR(5),
Status VARCHAR(5),
--about 10 more columns, varchars and ints and dates
)
I'm trying to optimize all my SQL since I am getting a fair few deadlocks and some slowness - but I'm no expert on this sort of thing!
I created a few indexes:
Clustered on the ID (Primary Key).
Non-Clustered index on ([ItemID])
Non-Clustered index on ([BoxID])
Non-Clustered index on ([Company],[OrderNum],[Status])
Maybe 1 or 2 more on some other columns
But I'm not 100% happy with the results.
SELECT * FROM Orders WHERE ItemID=100
Gives me an index seek + a key lookup and a Nested loop (Inner join).
I can see why - but don't know if I should do anything about it. They key lookup is 97% of the batch which seems bad!
Every query used will pull back every column in the table, but I don't like the idea of including every column in the index.
I'm making a change now to query everything on the [Company] field. Every query will be using it, because results should never contain more than 1 value. So they will all change:
SELECT * FROM Orders WHERE ItemID=100 --Old
SELECT * FROM Orders WHERE Company='a' and ItemID=100 --New
But the execution plan of that gives me exactly the same as not including company (which does surprise me!).
Why are the two execution plans above the same? (I have no index on [company] at the moment)
Is it worth adding [Company] to all my indexes since it seems to make
0 different to the execution plan?
Should I instead just add 1 single index to [Company] and keep the original indexes? - but will that
mean every query will have 2 seeks?
Is it worth 'including' all other columns in my indexes to avoid the
key lookup? (making the index a tonne bigger, but potentially
speeding it up?) i.e.
CREATE NONCLUSTERED INDEX [IX_Orders_MyIndex] ON [Orders]
( [Company] ASC, [OrderNum] ASC, [Status] ASC )
INCLUDE ([ID],[ItemID],[BoxID],
[Column5],[Column6],[Column7],[Column8],[Column9],[Column10],etc)
That seems messy if I did it on 4 or 5 indexes.
Basically I have 4-5 queries which run quite often (some selects and updates) so I want to make it as efficient as possible.
All queries will use the [company] field, and at least 1 other. How should I go about it.
Any help appreciated :)
In your execution plan, you say that lookup takes 97% of the batch.
In this case it doesn't mean anything because an index seek is very fast and you didn't have that much operation to be done.
That lookup is actually the record you read based on the index you have specified.
Why are the two execution plans above the same? (I have no index on [company] at the moment)
Non-Clustered index on ([Company],[OrderNum],[Status])
This index will be considered only if Company, OrderNum and Status appear in your where clause.
Concatenated indexes generates a key that would look like this 0000000000000 when you pass only company it creates an incomplete key that requires using wildcard for the other to values.
It would look a little like this : key like 'XXX%' this logic will require an index scan which is time consuming.
The optimizer will determine that it's preferable to first seek and rows from the ItemID index and then scan these to match any with the required company.
Is it worth adding [Company] to all my indexes since it seems to make 0 different to the execution plan?
You should consider having a Company index instead of adding it to all your indexes.
Composite index could speed things up by reducing the number of nested loops, but you have to think then thoroughly.
The order of the fields you add to such an index is very important, they should be ordered by uniqueness to allow a better seek. Also, you should never add a field that might not be used in a query.
Should I instead just add 1 single index to [Company] and keep the original indexes? - but will that mean every query will have 2 seeks?
Having more than one index seek is not all that bad, they are usually paralleled and only the result of both are matched together.
Is it worth 'including' all other columns in my indexes to avoid the key lookup? (making the index a tonne bigger, but potentially speeding it up?)
It is worth when it's only a few fields that could be optional in the where clause or when you have queries that select only those fields when you are using the specified index.
Last notes
All indexes are not equal, comparing string (varchar) is not the same as comparing numbers (integer, datetime, bytes, etc).
Also, keeping them clean helps a lot, if your indexes are fragmented, they will be next to useless in terms of performance gain.

Is the addition of a second ID column beneficial to index?

Let's say I have a table tbl_FacilityOrders with two foreign keys fk_FacilityID and fk_OrderID in SQL Server 2005. It could contain orders from a few hundred facilities. I need to query single records and will have both the facilityID and the orderID available to me. Is it better to define an index on fk_FacilityID then fk_OrderID and pass the both to the query or to just use fk_OrderID. Since there will be less facility IDs than order IDs, I could see weeding out the other facilities' records first possibly being beneficial.
A second question is, if I were using the two columnn query above, does the order I write my WHERE clause columns in matter or is is the engine smart enough to evaluate them in the order of the index?
E.G. Is:
WHERE fk_facilityID = #FacilityID AND fk_OrderID = #OrderID
equivalent to:
WHERE fk_OrderID = #OrderID AND fk_FacilityID = #FacilityID
?
Is it better to define an index on fk_FacilityID then fk_OrderID and pass the both to the query or to just use fk_OrderID.
If OrderId is unique, there's no real added benefit to adding the other field for the scenario given. It is a good idea to index your FKs, though, since they will always been a JOIN key.
if I were using the two columnn query above, does the order I write my WHERE clause columns in matter or is is the engine smart enough to evaluate them in the order of the index?
Nope, order is irrelevant here. All that matters is that the SETS of fields match, i.e. FieldA and FieldB are both in the index and in the WHERE clause.
The order of fields in the index DOES matter, though. You can't use the second field in an index without knowing the value of the first field.
You should create an index for each of your foreign keys... not just the purpose of this question, but because indexing your foreign keys is good practice in general.
To answer your second question, the two statements are equivalent. SQL Server should internally re-order the statements to arrive at the optimal execution plan... however, you should always validate the generated execution plan just to make sure that its behaving as you would expect.

SQL Server index included columns

I need help understanding how to create indexes. I have a table that looks like this
Id
Name
Age
Location
Education,
PhoneNumber
My query looks like this:
SELECT *
FROM table1
WHERE name = 'sam'
What's the correct way to create an index for this with included columns?
What if the query has a order by statement?
SELECT *
FROM table1
WHERE name = 'sam'
ORDER BY id DESC
What if I have 2 parameters in my where statement?
SELECT *
FROM table1
WHERE name = 'sam'
AND age > 12
The correct way to create an index with included columns? Either via Management Studio/Toad/etc, or SQL (documentation):
CREATE INDEX idx_table_1 ON db.table_1 (name) INCLUDE (id)
What if the Query has an ORDER BY
The ORDER BY can use indexes, if the optimizer sees fit to (determined by table statistics & query). It's up to you to test if a composite index or an index with INCLUDE columns works best by reviewing the query cost.
If id is the clustered key (not always the primary key though), I probably wouldn't INCLUDE the column...
What if I have 2 parameters in my where statement?
Same as above - you need to test what works best for your query. Might be composite, or include, or separate indexes.
But keep in mind that:
tweaking for one query won't necessarily benefit every other query
indexes do slow down INSERT/UPDATE/DELETE statements, and require maintenance
You can use the Database Tuning Advisor (DTA) for index recommendations, including when some are redundant
Recommended reading
I highly recommend reading Kimberly Tripp's "The Tipping Point" for a better understanding of index decisions and impacts.
Since I do not know which exactly tasks your DB is going to implement and how many records in it, I would suggest that you take a look at the Index Basics MSDN article. It will allow you to decide yourself which indexes to create.
If ID is your primary and/or clustered index key, just create an index on Name, Age. This will cover all three queries.
Included fields are best used to retrieve row-level values for columns that are not in the filter list, or to retrieve aggregate values where the sorted field is in the GROUP BY clause.
If inserts are rare, create as much indexes as You want.
For first query create index for name column.
Id column I think already is primary key...
Create 2nd index with name and age. You can keep only one index: 'name, ag'e and it will not be much slower for 1st query.

Why is this query faster without index?

I inherited a new system and I am trying to make some improvements on the data. I am trying to improve this table and can't seem to make sense of my findings.
I have the following table structure:
CREATE TABLE [dbo].[Calls](
[CallID] [varchar](8) NOT NULL PRIMARY KEY,
[RecvdDate] [varchar](10) NOT NULL,
[yr] [int] NOT NULL,
[Mnth] [int] NOT NULL,
[CallStatus] [varchar](50) NOT NULL,
[Category] [varchar](100) NOT NULL,
[QCall] [varchar](15) NOT NULL,
[KOUNT] [int] NOT NULL)
This table has about 220k records in it. I need to return all records that have a date greater than specific date. In this case 12/1/2009. This query will return about 66k records and it takes about 4 seconds to run. From past systems I have worked on this seems high. Especially given how few records are in the table. So I would like to bring that time down.
So I'm wondering what would be some good ways to bring that down? I tried adding a date column to the table and converting the string date to an actual date column. Then I added an index on that date column but the time stayed the same. Given that there aren't that many records I can see how a table scan could be fast but I would think that an index could bring that time down.
I have also considered just querying off the month and year columns. But I haven't tried it yet. And would like to keep it off the date column if possible. But if not I can change it.
Any help is appreciated.
EDIT: Here is the query I am trying to run and test the speed of the table. I usually put out the columns but just for simplicity I used * :
SELECT *
FROM _FirstSlaLevel_Tickets_New
WHERE TicketRecvdDateTime >= '12/01/2009'
EDIT 2: So I mentioned that I had tried to create a table with a date column that contained the recvddate data but as a date rather than a varchar. That is what TicketRecvdDateTime column is in the query above. The original query I am running against this table is:
SELECT *
FROM Calls
WHERE CAST(RecvdDate AS DATE) >= '12/01/2009'
You may be encountering what is referred to as the Tipping Point in SQL Server. Even though you have the appropriate index on the column, SQL Server may decided to do a table scan anyway if the expected number of rows returned exceeds some threshold (the 'tipping point').
In your example, this seems likely since your is turning 1/4 of the number of rows in the database. The following is a good article that explains this: http://www.sqlskills.com/BLOGS/KIMBERLY/category/The-Tipping-Point.aspx
SELECT * will usually give a poor performance.
Either the index will be ignored or you'll end up with a key/bookmark lookup into the clustered index. No matter: both can run badly.
For example, if you had this query, and the index on TicketRecvdDateTime INCLUDEd CallStatus, then it would most likely run as expected. This would be covering
SELECT CallStatus
FROM _FirstSlaLevel_Tickets_New
WHERE TicketRecvdDateTime >= '12/01/2009'
This is in addition to Randy Minder's answer: a key/bookmark lookup may be cheap enough for a handful of rows but not for a large chunk of the table data.
Your query is faster w/o an index (or, more precisly, is the same speed w/ or w/o the indeX) because and index on RecvdDate will always be ignored in an expression like CAST(RecvdDate AS DATE) >= '12/01/2009'. This is a non-SARG-able expression, as it requires the column to be transformed trough a function. In order for this index event to be considered, you have to express your filter criteria exactly on the column being indexed, not on an expression based on it. This would be the first step.
There are more steps:
Get rid of the VARCHAR(10) column for dates and replace it with the appropriate DATE or DATETIME column. Storing date and/or time as strings is riddled with problems. Not only for indexing, but also for correctness.
A table that is frequently scanned on a range based on a column (as most such call log tables are) should be clustered by that column.
It is highly unlikely you really need the yr and mnth columns. If you really do need them, then you probably need them as computed columns.
.
CREATE TABLE [dbo].[Calls](
[CallID] [varchar](8) NOT NULL,
[RecvdDate] [datetime](10) NOT NULL,
[CallStatus] [varchar](50) NOT NULL,
[Category] [varchar](100) NOT NULL,
[QCall] [varchar](15) NOT NULL,
[KOUNT] [int] NOT NULL,
CONSTRAINT [PK_Calls_CallId] PRIMARY KEY NONCLUSTERED ([CallID]));
CREATE CLUSTERED INDEX cdxCalls ON Calls(RecvDate);
SELECT *
FROM Calls
WHERE RecvDate >= '12/01/2009';
Of course, the proper structure of the table and indexes should be the result of careful analysis, considering all factors involved, including update performance, other queries etc. I recommend you start by going through all the topics included in Designing Indexes.
Can you alter your query? If few columns are needed, you can alter the SELECT clause to return fewer columns. And, then you can create a covering index that includes all columns referenced, including TicketRecvdDateTime.
You might create the index on TicketRecvdDateTime, but you may not avoid the tipping point that #Randy Minder discusses. However, a scan on the smaller index (smaller than table scan) would return fewer pages.
Assuming RecvdDate is the TicketRecvdDateTime you are talking about:
SQL Server only compares dates in single quotes if the field type is DATE. Your query is probably comparing them as VARCHAR. try adding a row with '99/99/0001' and see if it shows at the bottom.
If so, your query results are incorrect. Change type to DATE.
Note that VARCHAR does not index well , DATETIME does.
Check the query plan to see if its using indices. If the DB is small compared to available RAM, it may simply table scan and hold everything in memory.
EDIT: On seeing your CAST/DATETIME edit, let me point out that parsing a date from a VARCHAR is a very expensive operation. You are doing this 220k times. This will kill performance.
Also you are no longer checking on an indexed field. a compare with an expression involving an index field does not use the index.