What is the proper index to make SELECT queries more efficient? - sql

I have a table similar you can see below
Table Keywords
Column ID
Column Keyword
Column Keyword2
the first query is
select keyword from keywords with (nolock) where keyword = keyword
another query for the same tabel is
select keyword2 from with (nolock) keywords where keyword2 Like 'keyword%'
My question is what index type to set on which columns int this table
to make a select process more sufficient? Is it should be clustered index or non-clustered? and on which columns I need to set it?
This table contains about 600k rows and it constantly growing.
Another question I'm getting a dead-lock error when I trying to insert a new record to Keywords table. What can be the problem? I'm selecting records with nolock.
Thank you

Since your two queries are on totally separate columns, you will need two separate non-clustered indices:
one index on keyword to speed up the first query
a second index on keyword2 to speed up the second query
And assuming you're using SQL Server: neither of them really makes a good clustered index, I would say - but a good clustered index would be really beneficial!
A good clustered index should be:
unique
small
stable (never changing)
ever-increasing
Your best bet would be on an INT IDENTITY field. See Kimberly Tripp's outstanding blog post Ever-increasing clustering key - the Clustered Index Debate..........again! for more detailed background on the requirements for a good clustering key.

If we are really seeing the only use cases, you want a clustered key on keyword2 and then hope your DBMS is smart enough to optimize index use with LIKE operator. Clustering helps when the returned rows from a typical query are adjacent in the DB, so keeping the table in alphabetical order on keyword2 will mean fewer pages have to be scanned on the SELECT. Clustering a table where access is pretty much random (e.g., user names) won't give you any more than a standard index.

Related

SQL Server non-clustered index

I have two different queries in SQL Server and I want to clarify
how the execution plan would be different, and
which of them is more efficient
Queries:
SELECT *
FROM table_name
WHERE column < 2
and
SELECT column
FROM table_name
WHERE column < 2
I have a non-clustered index on column.
I used to use Postgresql and I am not familiar with SQL Server and these kind of indexes.
As I read many questions here I kept two notes:
When I have a non-clustered index, I need one more step in order to have access to data
With a non-clustered index I could have a copy of part of the table and I get a quicker response time.
So, I got confused.
One more question is that when I have "SELECT *" which is the influence of a non-clustered index?
1st query :
Depending on the size of the data you might face lookup issues such as Key lookup and RID lookups .
2nd query :
It will be faster because it will not fetch columns that are not part of the index , though i recommend using covering index ..
I recommend you check this blog post
The first select will use the non-clustered index to find the clustering key [clustered index exists] or page and slot [no clustered index]. Then that will be used to get the row. The query plan will be different depending on your STATS (the data).
The second query is "covered" by the non-clustered index. What that means is that the non-clustered index contains all of the data that you are selecting. The clustering key is not needed, and the clustered index and/or heap is not needed to provide data to the select list.

SQL index for date range query

For a few days, I've been struggling with improving the performance of my database and there are some issues that I'm still kind a confused about regarding indexing in a SQL Server database.
I'll try to be as informative as I can.
My database currently contains about 100k rows and will keep growing, therfore I'm trying to find a way to make it work faster.
I'm also writing to this table, so if you suggestion will drastically reduce the writing time please let me know.
Overall goal is to select all rows with a specific names that are in a date range.
That will usually be to select over 3,000 rows out of a lot lol ...
Table schema:
CREATE TABLE [dbo].[reports]
(
[id] [int] IDENTITY(1,1) NOT NULL,
[IsDuplicate] [bit] NOT NULL,
[IsNotValid] [bit] NOT NULL,
[Time] [datetime] NOT NULL,
[ShortDate] [date] NOT NULL,
[Source] [nvarchar](350) NULL,
[Email] [nvarchar](350) NULL,
CONSTRAINT [PK_dbo.reports]
PRIMARY KEY CLUSTERED ([id] ASC)
) ON [PRIMARY]
This is the SQL query I'm using:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-13' AND '2017-10-15'
As I understood, my best approach to improve efficency without hurting the writing time as much would be to create a nonclustered index on the Source and ShortDate.
Which I did like such, index schema:
CREATE NONCLUSTERED INDEX [Source&Time]
ON [dbo].[reports]([Source] ASC, [ShortDate] ASC)
Now we are getting to the tricky part which got me completely lost, the index above sometimes works, sometime half works and sometime doesn't work at all....
(not sure if it matters but currently 90% of the database rows has the same Source, although this won't stay like that for long)
With the query below, the index isn't used at all, I'm using SQL Server 2014 and in the Execution Plan it says it only uses the clustered index scan:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-10' AND '2017-10-15'
With this query, the index isn't used at all, although I'm getting a suggestion from SQL Server to create an index with the date first and source second... I read that the index should be made by the order the query is? Also it says to include all the columns Im selecting, is that a must?... again I read that I should include in the index only the columns I'm searching.
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate = '2017-10-13'
SQL Server index suggestion -
/* The Query Processor estimates that implementing the following
index could improve the query cost by 86.2728%. */
/*
USE [db]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[reports] ([ShortDate], [Source])
INCLUDE ([id], [IsDuplicate], [IsNotValid], [Time], [Email])
GO
*/
Now I tried using the index SQL Server suggested me to make and it works, seems like it uses 100% of the nonclustered index using both the queries above.
I tried to use this index but deleting the included columns and it doesn't work... seems like I must include in the index all the columns I'm selecting?
BTW it also work when using the index I made if I include all the columns.
To summarize: seems like the order of the index didn't matter, as it worked both when creating Source + ShortDate and ShortDate + Source
But for some reason its a must to include all the columns... (which will drastically affect the writing to this table?)
Thanks a lot for reading, My goal is to understand why this stuff happens and what I should do otherwise (not just the solution as I'll need to apply it on other projects as well ).
Cheers :)
Indexing in SQL Server is part know-how from long experience (and many hours of frustration), and part black magic. Don't beat yourself up over that too much - that's what a place like SO is ideal for - lots of brains, lots of experience from many hours of optimizing, that you can tap into.
I read that the index should be made by the order the query is?
If you read this - it is absolutely NOT TRUE - the order of the columns is relevant - but in a different way: a compound index (made up from multiple columns) will only ever be considered if you specify the n left-most columns in the index definition in your query.
Classic example: a phone book with an index on (city, lastname, firstname). Such an index might be used:
in a query that specifies all three columns in its WHERE clause
in a query that uses city and lastname (find all "Miller" in "Detroit")
or in a query that only filters by city
but it can NEVER EVER be used if you want to search only for firstname ..... that's the trick about compound indexes you need to be aware of. But if you always use all columns from an index, their ordering is typically not really relevant - the query optimizer will handle this for you.
As for the included columns - those are stored only in the leaf level of the nonclustered index - they are NOT part of the search structure of the index, and you cannot specify filter values for those included columns in your WHERE clause.
The main benefit of these included columns is this: if you search in a nonclustered index, and in the end, you actually find the value you're looking for - what do you have available at that point? The nonclustered index will store the columns in the non-clustered index definition (ShortDate and Source), and it will store the clustering key (if you have one - and you should!) - but nothing else.
So in this case, once a match is found, and your query wants everything from that table, SQL Server has to do what is called a Key lookup (often also referred to as a bookmark lookup) in which it takes the clustered key and then does a Seek operation against the clustered index, to get to the actual data page that contains all the values you're looking for.
If you have included columns in your index, then the leaf level page of your non-clustered index contains
the columns as defined in the nonclustered index
the clustering key column(s)
all those additional columns as defined in your INCLUDE statement
If those columns "cover" your query, e.g. provide all the values that your query needs, then SQL Server is done once it finds the value you searched for in the nonclustered index - it can take all the values it needs from that leaf-level page of the nonclustered index, and it does NOT need to do another (expensive) key lookup into the clustering index to get the actual values.
Because of this, trying to always explicitly specify only those columns you really need in your SELECT can be beneficial - in this case, you might be able to create an efficient covering index that provides all the values for your SELECT - always using SELECT * makes that really hard or next to impossible.....
In general, you want the index to be from most selective (i.e. filtering out the most possible records) to least selective; if a column has low cardinality, the query optimizer may ignore it.
That makes intuitive sense - if you have a phone book, and you're looking for people called "smith", with the initial "A", you want to start with searching for "smith" first, and then the "A"s, rather than all people whose initial is "A" and then filter out those called "Smith". After all, the odds are that one in 26 people have the initial "A".
So, in your example, I guess you have a wide range of values in short date - so that's the first column the query optimizer is trying to filter out. You say you have few different values in "source", so the query optimizer may decide to ignore it; in that case, the second column in that index is no use either.
The order of where clauses in the index is irrelevant - you can swap them round and achieve the exact same results, so the query optimizer ignores them.
EDIT:
So, yes, make the index. Imagine you have a pile of cards to sort - in your first run, you want to remove as many cards as possible. Assuming it's all evenly spread - if you have 1000 separate short_dates over a million rows, that means you end up with 1000 items if your first run starts on short_date; if you sort by source, you have 100000 rows.
The included columns of an index is for the columns you are selecting.
Due to the fact that you do select * (which isn't good practice), the index won't be used, because it would have to lookup the whole table to get the values for the columns.
For your scenario, I would drop the default clustered index (if there is one) and create a new clustered index with the following statement:
USE [db]
GO
CREATE CLUSTERED INDEX CIX_reports
ON [dbo].[reports] ([ShortDate],[Source])
GO

Why am I getting a Clustered Index Scan when the column is indexed?

So, we have a table, InventoryListItems, that has several columns. Because we're going to be looking for rows at times based on a particlar column (g_list_id, a foreign key), we have that foreign key column placed into a non-clustered index we'll call MYINDEX.
So when I search for data like this:
-- fake data for example
DECLARE #ListId uniqueidentifier
SELECT #ListId = '7BCD0E9F-28D9-4F40-BD67-803005179B04'
SELECT *
FROM [dbo].[InventoryListItems]
WHERE [g_list_id] = #ListId
I expected that it would use the MYINDEX index to find just the needed rows, and then look up the information in those rows. So not as good as just finding everything we need in the index itself, but still a big win over doing a full scan of the table.
But instead it seems that I'm still getting a clustered index scan. I can't figure out why that would happen.
If I do something like SELECTing only the values in the included columns of the index, it does what I would expect, an index seek, and just pulls everything from the index.
But if I SELECT *, why does it just bail on the index and do a scan when it seems like it would still benefit greatly from using it because it's referenced in the WHERE clause?
Since you're doing a SELECT * and thus you retrieve all columns, SQL Server's query optimizer may have decided it's easier and more efficient to just do a clustered index scan - since it needs to go to the clustered index leaf level to get all the columns anyway (and doing a seek first, and then a key lookup to actually get the whole data page, is quite an expensive operation - scan might just be more efficient in this setup).
I'm almost sure if you try
SELECT g_list_id
FROM [dbo].[InventoryListItems]
WHERE [g_list_id] = #ListId
then there will be an index seek instead (since you're only retrieving a single column - not everything).
That's one of the reasons why I would recommend to be extra careful when using SELECT * .... - try to avoid it if ever possible.

is count(indexed column) faster than count(*)? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Performance of COUNT SQL function
Hi all,
I've very large tables and I need to know number of records in each , My question is does it reduce the run time if I run :
select count(indexed column like my PK) from tbTest
instead of
select count(*) from tbTest
see Performance of COUNT SQL function
The important thing to note is they are not equivalent
Since the question is whether or not there is a performance difference, it would depend on the index. When you do COUNT(*), it will use the PK column(s) to determine the number of rows. If you do not have any indexes besides a clustered index on the PK column(s), it will scan the leaf nodes on the clustered index. That's probably a lot of pages. If you have a non clustered index that is skinnier than the clustered index, it will choose that instead, resulting in less reads.
So, if the column you select is contained in the smallest possible non-clustered index on the table, the SQL query optimizer will choose that for both count() (if you have a clustered ix that is the PK) and count(indexed_column). If you choose a count(indexed_col) that is only contained in a wide index, then the count() will be faster if your PK is a clustered index. The reason this works is that there is a pointer to the clustered index in all non-clustered indexes and SQL Server can figure out the number of rows based on that non-clustered index.
So, as usual in SQL Server, it depends. Do a showplan and compare the queries to each other.
SELECT COUNT(*) may be faster. That is because using * gives the optimizer liberty to choose any column to count on. Say you have a primary key on a INT column, and a non clustered key on a different bigint column. But the primary key is likely the clustered index, and as such it is in fact significantly larger than the nonclustered bigint index (has more pages). So if the optimizer is free to choose the bigint non-clustered index, it can return the response faster. Possible much faster, depending on the table.
So overall is always better to leave it as COUNT(*) and let the optimizer choose.
most likely, if the query scans the index instead of the whole table.
it is an easy thing to test, become your own scientist.
Both are identical. If you look at the query execution plan for both, both will do an "index scan"

Seek & Scan in SQL Server

After googling i came to know that Index seek is better than scan.
How can I write the query that will yield to seek instead of scan. I am trying to find this in google but as of now no luck.
Any simple example with explanation will be appreciated.
Thanks
Search by the primary key column(s)
Search by column(s) with index(es) on them
An index is a data structure that improves the speed of data retrieval operations on a database table. Most dbs automatically create an index when a primary key is defined for a table. SQL Server creates an index for primary key (composite or otherwise) as a "clustered index", but it doesn't have to be the primary key - it can be other columns.
NOTE:
LIKE '%'+ criteria +'%' will not use an index; LIKE criteria +'%' will
Related reading:
SQL SERVER – Index Seek vs. Index Scan
Index
Which is better: Bookmark/Key Lookup or Index Scan
Extending rexem's feedback:
The clustered index idea for pkeys isn't arbitrary. It's simply a default to make the pkey clustered. And clustered means that values will be physically placed near each other on a Sql Server 8k page thus assuming that if you fetch one value by pkey, you will probably be interested in its neighbors. i don't think it's a good idea to do that for pkeys since they're usually unique but arbitrary identifiers. Better to cluster on more useful data. One clustered index per table btw.
In a nutshell: If you can filter your query on a clustered index column (that makes sense) then all the better.
An index seek is when SQL Server can use a binary search to quickly find the row. The rows in an index are sorted in a particular order, and your query has to specify enough information in the WHERE clause to allow SQL Server to make use of the sorted index.
An index scan is when SQL Server cannot use the sort order of the index, but can still use the index itself. This makes sense if the table rows are very large, but the index is relatively small. SQL Server will only have to read the smaller index from disk.
As a simple example, take a phonebook table:
id int identity primary key
lastname varchar(50)
phonenumber varchar(15)
Say that there is an index on (lastname). Then this query will result in an index seek:
select * from phonebook where lastname = 'leno'
This query will result in an index scan:
select * from phonebook where lastname like '%no'
The analogy with a real life phonebook is that you can't look up people whose name ends in 'no'. You have to browse the entire phonebook.