SQL Index - are both statements going to do the same? - sql

I was wondering if in SQL server these two statements to create a non-clustered index will have the same behavior?
create nonclustered index EmpLastname_Incl_Firstname
on employee(lastname) include (firstname);
create nonclustered index EmpLastnameFirstname
on employee(lastname, firstname)

No. The key columns are optimized for things like filtering and grouping, while the included columns are optimized for retrieval of the column only. So if a lot of your queries look like the following:
SELECT firstname, lastname
FROM mytable
WHERE lastname = 'Doe' AND firstname = 'John'
then the second index you showed would be preferred. If you only use lastname in your SELECT such as the following query:
SELECT firstname, lastname
FROM mytable
WHERE lastname = 'Doe'
Then the first query would be preferred.
If you have a mix of both queries you should take the second index only as the second query is also able to make use of the first index.

absolutely no
INCLUDE means that the data from the column is stored in the index but it is not part of the index sorting

Those statements will not have the same behavior. The index with the include will only allow key lookups on the lastname field, while the index without the include will allow key lookups on both the lastname and firstname fields. Microsoft documentation for indexes with includes. This bit is especially important to your question:
Redesign nonclustered indexes with a large index key size so that only columns used for searching and lookups are key columns. Make all other columns that cover the query into nonkey columns. In this way, you will have all columns needed to cover the query, but the index key itself is small and efficient.
If you ever need to search by the firstname field, your index should include it as a key lookup.

Adding columns to include will store the respective data only on the leaf-node level of the b-tree (not in the tree itself).
Almost everything that can be accomplished with include can also be accomplished by putting the respective columns in the key part of the index. The exceptions are related to the length limits of the key. In doubt, it might be best to leave it in the key columns.
Having that said, there are some benefits when putting a column in include rather than the key part:
the resulting index is slightly smaller (a few percent)
The tree of the index might be a one level smaller
It is documented what the column of that index is used for. That makes extending this index more easy in the future.
I find the last one the most important one.
Have a look at my recent article about this topic for a better understanding:
https://use-the-index-luke.com/blog/2019-04/include-columns-in-btree-indexes

Related

SQL index for date range query

For a few days, I've been struggling with improving the performance of my database and there are some issues that I'm still kind a confused about regarding indexing in a SQL Server database.
I'll try to be as informative as I can.
My database currently contains about 100k rows and will keep growing, therfore I'm trying to find a way to make it work faster.
I'm also writing to this table, so if you suggestion will drastically reduce the writing time please let me know.
Overall goal is to select all rows with a specific names that are in a date range.
That will usually be to select over 3,000 rows out of a lot lol ...
Table schema:
CREATE TABLE [dbo].[reports]
(
[id] [int] IDENTITY(1,1) NOT NULL,
[IsDuplicate] [bit] NOT NULL,
[IsNotValid] [bit] NOT NULL,
[Time] [datetime] NOT NULL,
[ShortDate] [date] NOT NULL,
[Source] [nvarchar](350) NULL,
[Email] [nvarchar](350) NULL,
CONSTRAINT [PK_dbo.reports]
PRIMARY KEY CLUSTERED ([id] ASC)
) ON [PRIMARY]
This is the SQL query I'm using:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-13' AND '2017-10-15'
As I understood, my best approach to improve efficency without hurting the writing time as much would be to create a nonclustered index on the Source and ShortDate.
Which I did like such, index schema:
CREATE NONCLUSTERED INDEX [Source&Time]
ON [dbo].[reports]([Source] ASC, [ShortDate] ASC)
Now we are getting to the tricky part which got me completely lost, the index above sometimes works, sometime half works and sometime doesn't work at all....
(not sure if it matters but currently 90% of the database rows has the same Source, although this won't stay like that for long)
With the query below, the index isn't used at all, I'm using SQL Server 2014 and in the Execution Plan it says it only uses the clustered index scan:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-10' AND '2017-10-15'
With this query, the index isn't used at all, although I'm getting a suggestion from SQL Server to create an index with the date first and source second... I read that the index should be made by the order the query is? Also it says to include all the columns Im selecting, is that a must?... again I read that I should include in the index only the columns I'm searching.
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate = '2017-10-13'
SQL Server index suggestion -
/* The Query Processor estimates that implementing the following
index could improve the query cost by 86.2728%. */
/*
USE [db]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[reports] ([ShortDate], [Source])
INCLUDE ([id], [IsDuplicate], [IsNotValid], [Time], [Email])
GO
*/
Now I tried using the index SQL Server suggested me to make and it works, seems like it uses 100% of the nonclustered index using both the queries above.
I tried to use this index but deleting the included columns and it doesn't work... seems like I must include in the index all the columns I'm selecting?
BTW it also work when using the index I made if I include all the columns.
To summarize: seems like the order of the index didn't matter, as it worked both when creating Source + ShortDate and ShortDate + Source
But for some reason its a must to include all the columns... (which will drastically affect the writing to this table?)
Thanks a lot for reading, My goal is to understand why this stuff happens and what I should do otherwise (not just the solution as I'll need to apply it on other projects as well ).
Cheers :)
Indexing in SQL Server is part know-how from long experience (and many hours of frustration), and part black magic. Don't beat yourself up over that too much - that's what a place like SO is ideal for - lots of brains, lots of experience from many hours of optimizing, that you can tap into.
I read that the index should be made by the order the query is?
If you read this - it is absolutely NOT TRUE - the order of the columns is relevant - but in a different way: a compound index (made up from multiple columns) will only ever be considered if you specify the n left-most columns in the index definition in your query.
Classic example: a phone book with an index on (city, lastname, firstname). Such an index might be used:
in a query that specifies all three columns in its WHERE clause
in a query that uses city and lastname (find all "Miller" in "Detroit")
or in a query that only filters by city
but it can NEVER EVER be used if you want to search only for firstname ..... that's the trick about compound indexes you need to be aware of. But if you always use all columns from an index, their ordering is typically not really relevant - the query optimizer will handle this for you.
As for the included columns - those are stored only in the leaf level of the nonclustered index - they are NOT part of the search structure of the index, and you cannot specify filter values for those included columns in your WHERE clause.
The main benefit of these included columns is this: if you search in a nonclustered index, and in the end, you actually find the value you're looking for - what do you have available at that point? The nonclustered index will store the columns in the non-clustered index definition (ShortDate and Source), and it will store the clustering key (if you have one - and you should!) - but nothing else.
So in this case, once a match is found, and your query wants everything from that table, SQL Server has to do what is called a Key lookup (often also referred to as a bookmark lookup) in which it takes the clustered key and then does a Seek operation against the clustered index, to get to the actual data page that contains all the values you're looking for.
If you have included columns in your index, then the leaf level page of your non-clustered index contains
the columns as defined in the nonclustered index
the clustering key column(s)
all those additional columns as defined in your INCLUDE statement
If those columns "cover" your query, e.g. provide all the values that your query needs, then SQL Server is done once it finds the value you searched for in the nonclustered index - it can take all the values it needs from that leaf-level page of the nonclustered index, and it does NOT need to do another (expensive) key lookup into the clustering index to get the actual values.
Because of this, trying to always explicitly specify only those columns you really need in your SELECT can be beneficial - in this case, you might be able to create an efficient covering index that provides all the values for your SELECT - always using SELECT * makes that really hard or next to impossible.....
In general, you want the index to be from most selective (i.e. filtering out the most possible records) to least selective; if a column has low cardinality, the query optimizer may ignore it.
That makes intuitive sense - if you have a phone book, and you're looking for people called "smith", with the initial "A", you want to start with searching for "smith" first, and then the "A"s, rather than all people whose initial is "A" and then filter out those called "Smith". After all, the odds are that one in 26 people have the initial "A".
So, in your example, I guess you have a wide range of values in short date - so that's the first column the query optimizer is trying to filter out. You say you have few different values in "source", so the query optimizer may decide to ignore it; in that case, the second column in that index is no use either.
The order of where clauses in the index is irrelevant - you can swap them round and achieve the exact same results, so the query optimizer ignores them.
EDIT:
So, yes, make the index. Imagine you have a pile of cards to sort - in your first run, you want to remove as many cards as possible. Assuming it's all evenly spread - if you have 1000 separate short_dates over a million rows, that means you end up with 1000 items if your first run starts on short_date; if you sort by source, you have 100000 rows.
The included columns of an index is for the columns you are selecting.
Due to the fact that you do select * (which isn't good practice), the index won't be used, because it would have to lookup the whole table to get the values for the columns.
For your scenario, I would drop the default clustered index (if there is one) and create a new clustered index with the following statement:
USE [db]
GO
CREATE CLUSTERED INDEX CIX_reports
ON [dbo].[reports] ([ShortDate],[Source])
GO

In a nonclustered index, how are the second, third, fourth ... columns sorted?

I have this question about SQL Server indexes that has been bugging me of late.
Imagine a table like this:
CREATE TABLE TelephoneBook (
FirstName nvarchar(50),
LastName nvarchar(50),
PhoneNumber nvarchar(50)
)
with an index like this:
CREATE NONCLUSTERED INDEX IX_LastName ON TelephoneBook (
LastName,
FirstName,
PhoneNumber
)
and imagine that this table has hundreds of thousands of rows.
Let's say I want to select everyone whose last name starts with a B and the firstname is 'John'. I would write the following query:
SELECT
*
FROM TelephoneBook
WHERE LastName like 'B%'
AND FirstName='John'
Since the index can help to reduce the number of rows we need to scan because it groups all of the LastNames that start with a B anyway, does it also do this for the FirstName? Or does the database scan every row that starts with a B to find the ones with the first name 'John'?
In other words, how are the second, third, fourth, ... columns sorted in an index? Are they alphabetical in this case as well, so it's pretty easy to find Johanna? Or are they in some sort of a random or different order?
EDIT: why I ask, is because I have just read that in the above SELECT statement, the index will only be used to narrow down the search to the records where the lastname starts with a B, but that the index will NOT be used to find all of the rows with Johanna in it (and will resort to scanning all of the 'B' rows). And I'm wondering why that is? What am I not getting?
As a convenient shorthand, the keys of an index are used for the where clause up to the first inequality. like with a wildcard is considered an inequality.
So, the index will only be used for looking up the first value. However, the entries will probably be scanned to match on the first name, so you will still get index usage.
Of course, the optimizer may decide not to use the index at all, if it decides that a full-table scan is more appropriate.
Gordon's answer is correct in this instance with the specified query. In general, you should be aware that it's not so much grouping records together in "buckets" based on the values of the columns, but rather ordering them according to the index's key columns. In other words, your records in this index will be ordered according to LastName, and for records that share the same LastName value they will be further ordered by FirstName value, and then by PhoneNumber value. You didn't specify a sort order for your columns on this index, but SQL Server defaults unspecified sort orders to ASC(ending), so those columns are indeed lexically sorted in the index .
In your particular case, the query optimizer has decided to look at the index for the first column to determine which records to grab, as Gordon's answer mentions, but SQL Server will reorder predicates if the optimizer decides that would be better, and may use more columns of the index or none at all, depending on the query itself and statistics on the records you are querying.
Logically speaking, the index is sorted by key values in the order of the key. So in this case, LastName (sorted as text), FirstName (sorded as text) and then PhoneNumber (sorted as text)... Any included columns are not sorted at all.
In your case, we know that trailing wildcards are still SARGable, so we'd expect to see an index seek narrowing the data down to all data w/ LastNames starting w/ "B", from that data pool, it will be further filtered to include only those rows that have FirstName = 'John'. You can think of it as an index seek followed by a range seek.

SQL Server index included columns

I need help understanding how to create indexes. I have a table that looks like this
Id
Name
Age
Location
Education,
PhoneNumber
My query looks like this:
SELECT *
FROM table1
WHERE name = 'sam'
What's the correct way to create an index for this with included columns?
What if the query has a order by statement?
SELECT *
FROM table1
WHERE name = 'sam'
ORDER BY id DESC
What if I have 2 parameters in my where statement?
SELECT *
FROM table1
WHERE name = 'sam'
AND age > 12
The correct way to create an index with included columns? Either via Management Studio/Toad/etc, or SQL (documentation):
CREATE INDEX idx_table_1 ON db.table_1 (name) INCLUDE (id)
What if the Query has an ORDER BY
The ORDER BY can use indexes, if the optimizer sees fit to (determined by table statistics & query). It's up to you to test if a composite index or an index with INCLUDE columns works best by reviewing the query cost.
If id is the clustered key (not always the primary key though), I probably wouldn't INCLUDE the column...
What if I have 2 parameters in my where statement?
Same as above - you need to test what works best for your query. Might be composite, or include, or separate indexes.
But keep in mind that:
tweaking for one query won't necessarily benefit every other query
indexes do slow down INSERT/UPDATE/DELETE statements, and require maintenance
You can use the Database Tuning Advisor (DTA) for index recommendations, including when some are redundant
Recommended reading
I highly recommend reading Kimberly Tripp's "The Tipping Point" for a better understanding of index decisions and impacts.
Since I do not know which exactly tasks your DB is going to implement and how many records in it, I would suggest that you take a look at the Index Basics MSDN article. It will allow you to decide yourself which indexes to create.
If ID is your primary and/or clustered index key, just create an index on Name, Age. This will cover all three queries.
Included fields are best used to retrieve row-level values for columns that are not in the filter list, or to retrieve aggregate values where the sorted field is in the GROUP BY clause.
If inserts are rare, create as much indexes as You want.
For first query create index for name column.
Id column I think already is primary key...
Create 2nd index with name and age. You can keep only one index: 'name, ag'e and it will not be much slower for 1st query.

Composite database indexes

I'm looking for confirmation of my understanding of composite indexes in databases - specifically in relation to SQL Server 2008 R2, if that makes a difference.
I think I understand that the order of the columns of the index is crucial in that if I have an index of { [Name], [Date] }, then a SELECT based on a WHERE clause based on [Date] won't be able to use the index, but an index of { [Date], [Name] } would. If the SELECT is based on both columns, either index would be usable.
Is that right? What are the benefits of using a composite index like this, over two indexes on each column (i.e. { [Date] }, and { [Name] }).
Thanks!
Not quite, a selection on date could still use the index but not as effective as a query including name as name would limit how much of the index has to be searched.
If you often have queries on name + date and date and name seperate, use 3 indexes one for each combo.
Also having the most varied field first in an index also faster limits the index seach amound making it faster.
You can also have included columns, data thats not indexed but that is ofter fetched based on the index.
That is correct.
A composite index is useful when the combined selectivity of the composite columns prunes the result set effectively.
If you add 'INCLUDED' columns to an index (composite or non-composite), you can create a 'covering' index to cover a query (or queries), which is desireable as it removes the need to perform a second lookup to obtain those columns (from the clustered index).
The choice of two single column indexes OR a composite index of the combined columns is determined by the total query workload against that table.

When to use composite indexes?

What are the general rules in regards to using composite indexes? When should you use them, and when should you avoid them?
Composite indexes are useful when your SELECT queries use those columns frequently as criteria in your WHERE clauses. It improves retrieval speed. You should avoid them if they are not necessary.
This article provides some really good information.
A query that selects only a few fields can run completely on an index. For example, if you have an index on (OrderId) this query would require a table lookup:
select Status from Orders where OrderId = 42
But if you add a composite index on (OrderId,Status) the engine can retrieve all information it needs from the index.
A sort on multiple columns can benefit from a composite index. For example, an index on (LastName, FirstName) would benefit this query:
select * from Orders order by LastName, FirsName
Sometimes you have a unique constrant on multiple columns. Say for example that you restart order numbers every day. Then OrderNumber is not unique, but (OrderNumber, OrderDayOfYear) is. You can enforce that with a unique composite index.
I'm sure there are many more uses for a composite index, just listing a few examples.