What does `INCLUDE` do in an index? - sql

What does INCLUDE in an unclustered index?
CREATE NONCLUSTERED INDEX [MyIndex] ON [dbo].[Individual]
(
[IndivID] ASC
)
INCLUDE ( [LastName], [FirstName])
WITH (SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
I know the first part is used for the WHERE clause, but what do the INCLUDE columns do? What's the benefit of having them "added to the leaf level of the nonclustered index"?
edit Also if i already have a clustered PK index for IndivID, why does Tuning Advisor recommend this index?

INCLUDE columns include the associated fields WITH the index. They are not used FOR indexing, but they are placed in the leaf node of the B-tree that makes up the index.
In essence: The index is still ON [IndivID] and [IndivID] alone. However, if your query only needs a subset of [IndivID], [LastName] and [FirstName], SQL doesn't need to go back to the table after it's found the [IndivID] it's searching for in the Index.
SEE: Covering Index
EDIT: B-tree assumes MS SQL Server. I'm not positive other implementations use the same data structure
Tuning Advisor (Speculation):: A clustered index places the entire data row at the leaf node of the index's B-tree, and this takes up a lot of space. If the Tuning Advisor sees that you're never accessing more than those three fields ([IndivID] + INCLUDEs), it will attempt to save you space (and insert/update time) by downgrading it to a non-clustered index with the only "important" fields present.

INCLUDE adds those fields at the leaf-level of the index. Basically the bt-ree is not sorted by those fields, but once the index finds the row with the indexed field(s) it's looking for, it also has the other fields immediately.
If you use the phone book analogy, the INCLUDED fields in the phone book index (which is sorted by Lastname, Firstname) would be Phone Number and Address - you can't look up a person by those fields but once you have their name you can find them.
CLUSTERED indexes have all fields included already by design, so INCLUDE is invalid in a CLUSTER. You also shouldn't bother INCLUDEing the clustered field in a non-clustered index since it is already implicitly there as the row key.
I most often use the INCLUDE fields for aggregation. For instance, if I have an index on CalendarDate and CustomerID I can include PaidAmt and get
MAX(PAidAmt) Where CustomerId = x AND CalendarDate = 1/1/2011
At the most basic level they are used to avoid a bookmark or key lookup.

That is data that is included as payload in the index. It won't be used to filter, but it can be returned.
If you for example have a query that filters on age and return name:
select name
from persons
where age = 42
Then you could create an index for the age field, with the name field included. That way the database could use only the index to run the entire query, and doesn't have to read anything at all from the actual table.

From MSDN - CREATE INDEX (Transact-SQL):
INCLUDE (column [ ,... n ] )
Specifies the non-key columns to be added to the leaf level of the nonclustered index.
Meaning, you can add more columns to the unclustered index - if you are returning several fields every time you query on the key column, adding them to the index will improve performance as they are stored with it, aka a covering index.

Related

SQL Index - are both statements going to do the same?

I was wondering if in SQL server these two statements to create a non-clustered index will have the same behavior?
create nonclustered index EmpLastname_Incl_Firstname
on employee(lastname) include (firstname);
create nonclustered index EmpLastnameFirstname
on employee(lastname, firstname)
No. The key columns are optimized for things like filtering and grouping, while the included columns are optimized for retrieval of the column only. So if a lot of your queries look like the following:
SELECT firstname, lastname
FROM mytable
WHERE lastname = 'Doe' AND firstname = 'John'
then the second index you showed would be preferred. If you only use lastname in your SELECT such as the following query:
SELECT firstname, lastname
FROM mytable
WHERE lastname = 'Doe'
Then the first query would be preferred.
If you have a mix of both queries you should take the second index only as the second query is also able to make use of the first index.
absolutely no
INCLUDE means that the data from the column is stored in the index but it is not part of the index sorting
Those statements will not have the same behavior. The index with the include will only allow key lookups on the lastname field, while the index without the include will allow key lookups on both the lastname and firstname fields. Microsoft documentation for indexes with includes. This bit is especially important to your question:
Redesign nonclustered indexes with a large index key size so that only columns used for searching and lookups are key columns. Make all other columns that cover the query into nonkey columns. In this way, you will have all columns needed to cover the query, but the index key itself is small and efficient.
If you ever need to search by the firstname field, your index should include it as a key lookup.
Adding columns to include will store the respective data only on the leaf-node level of the b-tree (not in the tree itself).
Almost everything that can be accomplished with include can also be accomplished by putting the respective columns in the key part of the index. The exceptions are related to the length limits of the key. In doubt, it might be best to leave it in the key columns.
Having that said, there are some benefits when putting a column in include rather than the key part:
the resulting index is slightly smaller (a few percent)
The tree of the index might be a one level smaller
It is documented what the column of that index is used for. That makes extending this index more easy in the future.
I find the last one the most important one.
Have a look at my recent article about this topic for a better understanding:
https://use-the-index-luke.com/blog/2019-04/include-columns-in-btree-indexes

SQL Server non-clustered index

I have two different queries in SQL Server and I want to clarify
how the execution plan would be different, and
which of them is more efficient
Queries:
SELECT *
FROM table_name
WHERE column < 2
and
SELECT column
FROM table_name
WHERE column < 2
I have a non-clustered index on column.
I used to use Postgresql and I am not familiar with SQL Server and these kind of indexes.
As I read many questions here I kept two notes:
When I have a non-clustered index, I need one more step in order to have access to data
With a non-clustered index I could have a copy of part of the table and I get a quicker response time.
So, I got confused.
One more question is that when I have "SELECT *" which is the influence of a non-clustered index?
1st query :
Depending on the size of the data you might face lookup issues such as Key lookup and RID lookups .
2nd query :
It will be faster because it will not fetch columns that are not part of the index , though i recommend using covering index ..
I recommend you check this blog post
The first select will use the non-clustered index to find the clustering key [clustered index exists] or page and slot [no clustered index]. Then that will be used to get the row. The query plan will be different depending on your STATS (the data).
The second query is "covered" by the non-clustered index. What that means is that the non-clustered index contains all of the data that you are selecting. The clustering key is not needed, and the clustered index and/or heap is not needed to provide data to the select list.

SQL index for date range query

For a few days, I've been struggling with improving the performance of my database and there are some issues that I'm still kind a confused about regarding indexing in a SQL Server database.
I'll try to be as informative as I can.
My database currently contains about 100k rows and will keep growing, therfore I'm trying to find a way to make it work faster.
I'm also writing to this table, so if you suggestion will drastically reduce the writing time please let me know.
Overall goal is to select all rows with a specific names that are in a date range.
That will usually be to select over 3,000 rows out of a lot lol ...
Table schema:
CREATE TABLE [dbo].[reports]
(
[id] [int] IDENTITY(1,1) NOT NULL,
[IsDuplicate] [bit] NOT NULL,
[IsNotValid] [bit] NOT NULL,
[Time] [datetime] NOT NULL,
[ShortDate] [date] NOT NULL,
[Source] [nvarchar](350) NULL,
[Email] [nvarchar](350) NULL,
CONSTRAINT [PK_dbo.reports]
PRIMARY KEY CLUSTERED ([id] ASC)
) ON [PRIMARY]
This is the SQL query I'm using:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-13' AND '2017-10-15'
As I understood, my best approach to improve efficency without hurting the writing time as much would be to create a nonclustered index on the Source and ShortDate.
Which I did like such, index schema:
CREATE NONCLUSTERED INDEX [Source&Time]
ON [dbo].[reports]([Source] ASC, [ShortDate] ASC)
Now we are getting to the tricky part which got me completely lost, the index above sometimes works, sometime half works and sometime doesn't work at all....
(not sure if it matters but currently 90% of the database rows has the same Source, although this won't stay like that for long)
With the query below, the index isn't used at all, I'm using SQL Server 2014 and in the Execution Plan it says it only uses the clustered index scan:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-10' AND '2017-10-15'
With this query, the index isn't used at all, although I'm getting a suggestion from SQL Server to create an index with the date first and source second... I read that the index should be made by the order the query is? Also it says to include all the columns Im selecting, is that a must?... again I read that I should include in the index only the columns I'm searching.
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate = '2017-10-13'
SQL Server index suggestion -
/* The Query Processor estimates that implementing the following
index could improve the query cost by 86.2728%. */
/*
USE [db]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[reports] ([ShortDate], [Source])
INCLUDE ([id], [IsDuplicate], [IsNotValid], [Time], [Email])
GO
*/
Now I tried using the index SQL Server suggested me to make and it works, seems like it uses 100% of the nonclustered index using both the queries above.
I tried to use this index but deleting the included columns and it doesn't work... seems like I must include in the index all the columns I'm selecting?
BTW it also work when using the index I made if I include all the columns.
To summarize: seems like the order of the index didn't matter, as it worked both when creating Source + ShortDate and ShortDate + Source
But for some reason its a must to include all the columns... (which will drastically affect the writing to this table?)
Thanks a lot for reading, My goal is to understand why this stuff happens and what I should do otherwise (not just the solution as I'll need to apply it on other projects as well ).
Cheers :)
Indexing in SQL Server is part know-how from long experience (and many hours of frustration), and part black magic. Don't beat yourself up over that too much - that's what a place like SO is ideal for - lots of brains, lots of experience from many hours of optimizing, that you can tap into.
I read that the index should be made by the order the query is?
If you read this - it is absolutely NOT TRUE - the order of the columns is relevant - but in a different way: a compound index (made up from multiple columns) will only ever be considered if you specify the n left-most columns in the index definition in your query.
Classic example: a phone book with an index on (city, lastname, firstname). Such an index might be used:
in a query that specifies all three columns in its WHERE clause
in a query that uses city and lastname (find all "Miller" in "Detroit")
or in a query that only filters by city
but it can NEVER EVER be used if you want to search only for firstname ..... that's the trick about compound indexes you need to be aware of. But if you always use all columns from an index, their ordering is typically not really relevant - the query optimizer will handle this for you.
As for the included columns - those are stored only in the leaf level of the nonclustered index - they are NOT part of the search structure of the index, and you cannot specify filter values for those included columns in your WHERE clause.
The main benefit of these included columns is this: if you search in a nonclustered index, and in the end, you actually find the value you're looking for - what do you have available at that point? The nonclustered index will store the columns in the non-clustered index definition (ShortDate and Source), and it will store the clustering key (if you have one - and you should!) - but nothing else.
So in this case, once a match is found, and your query wants everything from that table, SQL Server has to do what is called a Key lookup (often also referred to as a bookmark lookup) in which it takes the clustered key and then does a Seek operation against the clustered index, to get to the actual data page that contains all the values you're looking for.
If you have included columns in your index, then the leaf level page of your non-clustered index contains
the columns as defined in the nonclustered index
the clustering key column(s)
all those additional columns as defined in your INCLUDE statement
If those columns "cover" your query, e.g. provide all the values that your query needs, then SQL Server is done once it finds the value you searched for in the nonclustered index - it can take all the values it needs from that leaf-level page of the nonclustered index, and it does NOT need to do another (expensive) key lookup into the clustering index to get the actual values.
Because of this, trying to always explicitly specify only those columns you really need in your SELECT can be beneficial - in this case, you might be able to create an efficient covering index that provides all the values for your SELECT - always using SELECT * makes that really hard or next to impossible.....
In general, you want the index to be from most selective (i.e. filtering out the most possible records) to least selective; if a column has low cardinality, the query optimizer may ignore it.
That makes intuitive sense - if you have a phone book, and you're looking for people called "smith", with the initial "A", you want to start with searching for "smith" first, and then the "A"s, rather than all people whose initial is "A" and then filter out those called "Smith". After all, the odds are that one in 26 people have the initial "A".
So, in your example, I guess you have a wide range of values in short date - so that's the first column the query optimizer is trying to filter out. You say you have few different values in "source", so the query optimizer may decide to ignore it; in that case, the second column in that index is no use either.
The order of where clauses in the index is irrelevant - you can swap them round and achieve the exact same results, so the query optimizer ignores them.
EDIT:
So, yes, make the index. Imagine you have a pile of cards to sort - in your first run, you want to remove as many cards as possible. Assuming it's all evenly spread - if you have 1000 separate short_dates over a million rows, that means you end up with 1000 items if your first run starts on short_date; if you sort by source, you have 100000 rows.
The included columns of an index is for the columns you are selecting.
Due to the fact that you do select * (which isn't good practice), the index won't be used, because it would have to lookup the whole table to get the values for the columns.
For your scenario, I would drop the default clustered index (if there is one) and create a new clustered index with the following statement:
USE [db]
GO
CREATE CLUSTERED INDEX CIX_reports
ON [dbo].[reports] ([ShortDate],[Source])
GO

How to create Index for this scenario in SQL Server?

What is the best Index to this Item table for this following query
select
tt.itemlookupcode,
tt.TotalQuantity,
tt.ExtendedPrice,
tt.ExtendedCost,
items.ExtendedDescription,
items.SubDescription1,
dept.Name,
categories.Name,
sup.Code,
sup.SupplierName
from
#temp_tt tt
left join HQMatajer.dbo.Item items
on items.ItemLookupCode=tt.itemlookupcode
left join HQMatajer.dbo.Department dept
ON dept.ID=items.DepartmentID
left join HQMatajer.dbo.Category categories
on categories.ID=items.CategoryID
left join HQMatajer.dbo.Supplier sup
ON sup.ID=items.SupplierID
drop table #temp_tt
I created Index like
CREATE NONCLUSTERED INDEX [JFC_ItemLookupCode_DepartmentID_CategoryID_SupplierID_INC_Description_SubDescriptions] ON [dbo].[Item]
(
[DBTimeStamp] ASC,
[ItemLookupCode] ASC,
[DepartmentID] ASC,
[CategoryID] ASC,
[SupplierID] ASC
)
INCLUDE (
[Description],
[SubDescription1]
)
But in Execution plan when I check the index which picked another index. That index having only TimeStamp column.
What is the best index for this scenario to that particular table.
First column in index should be part of filtration else Index will not be used. In your index first column is DBTimeStamp and it is not filtered in your query. That is the reason your index is not used.
Also in covering index you have used [Description],[SubDescription1] but in query you have selected ExtendedDescription,items.SubDescription1 this will have additional overhead of key/Rid lookup
Try alerting your index like this
CREATE NONCLUSTERED INDEX [JFC_ItemLookupCode_DepartmentID_CategoryID_SupplierID_INC_Description_SubDescriptions] ON [dbo].[Item]
(
[ItemLookupCode] ASC,
[DepartmentID] ASC,
[CategoryID] ASC,
[SupplierID] ASC
)
INCLUDE (
[ExtendedDescription],
[SubDescription1]
)
Having said that all still optimizer go for scan or choose some other index based on data retrieved from Item table
I'm not surprised your index isn't used. DBTimeStamp is likely to be highly selective, and is not referenced in your query at all.
You might have forgotten to include an ORDER BY clause in your query which was intended reference DBTimeStamp. But even then your query would probably need to scan the entire index. So it may as well scan the actual table.
The only way to make that index 'look enticing' would be to ensure it includes all columns that are used/returned. I.e. You'd need to add ExtendedDescription. The reason this can help is that indexes typically require less storage than the full table. So it's faster to read from disk. But if you're missing columns (in your case ExtendedDescription), then the engine needs to perform an additional lookup onto the full table in any case.
I can't comment why the DBTimeStamp column is preferred - you haven't given enough detail. But perhaps it's the CLUSTERED index?
Your index would be almost certain to be used if defined as:
(
[ItemLookupCode] ASC --The only thing you're actually filtering by
)
INCLUDE (
/* Moving the rest to include is most efficient for the index tree.
And by including ALL used columns, there's no need to perform
extra lookups to the full table.
*/
[DepartmentID],
[CategoryID],
[SupplierID],
[ExtendedDescription],
[SubDescription1]
)
Note however, that this kind of indexing strategy 'Find the best for each query used' is unsustainable.
You're better off finding 'narrower' indexes that are appropriate multiple queries.
Every index slows down INSERT and UPDATE queries.
And indexes like this are impacted by more columns than the preferred 'narrower' indexes.
Index choice should focus on the selectivity of columns. I.e. Given a specific value or small range of values, what percentage of data is likely to be selected based on your queries?
In your case, I'd expect ItemLookupCode to be unique per item in the Items table. In other words indexing by that without any includes should be sufficient. However, since you're joining to a temp table that theoretically could include all item codes: in some cases it might be better to scan the CLUSTERED INDEX in any case.

How are the table data stored when it has a clustered index

I have found umpteen posts which begin like Quite a lot of time I have come across people saying "Clustered Index Physically sorts the data inside the table based on the Clustered Index Keys". It's not true! Then such posts go on to describe how it is actually stored, via linked lists or whatever. For example, this post says that
Each Index row contains a Key value and a pointer to either an
Intermediate level page in the B-tree, or a Data row in the Leaf level
of the Index. The Pages in each level of the Index are linked in a
Doubly-linked list. The Pages in the Data chain and the rows in them
are ordered on the value of the Clustered Index key.
That brings me to my question, the data pages are the place where the table data are stored, right? So if they are sorted and the data within them also are sorted according to the indexed column value, why is it wrong to say that the clustered index keeps the table data in sorted order? Here is a pic from Kalen Delaney's book, which shows that the leaf pages in a table with CI are all sorted according to the CI value:
You're right.
Clustered indexes do not physically sort the data inside the table based on the Clustered Index Keys. If that was the case then the inserts into the middle of a large table with no free space would require huge amounts of IO to make room for the new record.
Instead a new page is allocated from anywhere in the file and linked into the linked list.
The degree to which the physical order of pages differs from the logical order is the extent of logical fragmentation. Rebuilding or reorganizing the index can reduce this.
When you create an index, there is also an index table is created(I think its called Index allocation map (IAM), not so sure about the name)
In case of clustered index, the index table contains the index column, and pointer to the actual records.
So When a table has a clustered index, data may not be physically sorted on the table..
The data in the disk will be maintained as a linked list and clustered index is a pointer to that data.
Now the index table will be sorted physically... not the actual table..and the index table is maintained as a B-Tree, so that searching would be faster.
Now when you create a non-clustered index, it will point to the clustered index table
Edit: (as marc_s pointed out) Leaf node of clustered index actually contains data, where as in non clustered index contains pointers..
But still I don't believe, it will reorder the data in the disk, it will just reorder the pointers
Clustered indexes order the table data by the columns of the index. Each new row will be positioned in the right spot of the table when inserted or updated.
This doesn't happen with nonclustered indexes.
My original statement here is WRONG
Because any index does NOT affect the data in the table at all. Clustered index is just another type of index pointing to data in the table. It does not change the order or does anything else to the data.
You can always fetch data directly from the table with row numbers before and after you create (clustered or unclustered) index.
End of Original statement
Correction is required (I don't use MSSQL very often, so never had a chance to test this before)
It seems that MSSQL implements clustered index as not really an index at all, but probably closer to trigger/constraint pair.
From my crude test right now:
1)
CREATE TABLE testTable ...
INSERT ... (few rows)
SELECT * FROM testTable
This shows ALL results in insertion order
2)
CREATE CLUSTERED INDEX ... ON testTable (...);
INSERT ... (few rows)
SELECT * FROM testTable
This shows ALL results ordered by fields in CLUSTERED INDEX
3)
DROP INDEX (CLUSTERED INDEX Name) ON testTable;
INSERT ... (few rows)
SELECT * FROM testTable
This shows ALL results from step 2) [before DROP INDEX] in the same order and rows inserted later [in step 3)] in insertion order again.
To me it means that MSSQL DOES re-order the actual data records (most likely at great cost on insert/delete).
So, I stand corrected and rebuked. In all honesty I never expected this (CLUSTERED INDEX behaviour, not me being proved wrong) to be the case.