Defining four indexes Versus defining one index on the search fields - indexing

I have an employee table that have the following columns:
employeeID (PK)
First_name
Last_name
Third_name
Age
Sex
Telephone
Dateofbirth
Position
And i have a search functionality to search for the employee using any of these fields:-
First_name
Last_name
Telephone
Dateofbirth
And the user can specify any combinations of these fields to be included for the search.
Now to improve the search what is the best way to create indexes ,, should i :-
create separate four indexes on the four search fields?
OR
it will be better to create a single index that contains the four columns together?
OR
there is a third better solution?
BR

The second option (one index on all four columns) is not very likely to be useful unless you know that users always include one column, almost always include a second column, etc. Any query that doesn't look at the first column in an index isn't going to make use of that index, at least for the purpose of index seek. Such an index still might be used to retrieve the data in those cases, if no other columns are included in the select list.
Whether all four indexes are necessary, I'm not sure. Is it really likely that a search for first name will occur without last name? How often do you think users will search for John? I'd let those scan and combine them as a single index (LastName,FirstName). I'd also be surprised if you had many searches for birth date. Is it really common practice where employees will know (or can easily find out) their co-workers' age / birthday?

There's not really an ideal solution using traditional indices. If you think about how it works, an index is really like sorting the table against a specific field. So if you only have one index with all four columns, it can only really use the first one anyway. If you use four indices, then SQL Server can search against each one in sequence.
However, the above is moot if you're searching against a wildcard patterns:
COLUMN LIKE '%searchstring%'
That search cannot use the index-- it will have to search the contents of each row.
All of which suggests that you should consider a full-text index, which is meant for exactly this kind of scenario.
http://msdn.microsoft.com/en-us/library/ms142571.aspx

Related

SQL Indexes and Multicolumn Database search

I have to implement search that finds substring in the name of the user.
User has FirstName and LastName in 2 columns. It is good enough to do WHERE FirstName LIKE '%searchText%' OR LastName LIKE '%searchText%'
What is my problem that I want to solve is performance. Let's say that currently I expect like 1000 users tops. I do not want to search to take ages. So I thought of indexes (those columns will not change much, I expect that the value of these columns will almost never change). I know that I will be looking for both columns I need multi column index.
Is this correct way of doing this?
Or it is better to use SQL full text search for this (please provide some good link)?
Would it be better to create a View where FirstName and LastName will be concatenated and search there?
Or it is better to just use e.g. Azure Search (Currently, this is the only and it may be the last entity I would need to search for)?
I am using .NET 4.6 and EntityFramework hosted on Azure web apps.
Thanks
Because of the leading wildcard character, an index will not be used. The only time an index will be used is if the wildcard is either in the middle or at the end of the string. Adding one index on LastName, FirstName can be a good suggestion as well if your table is very wide, like David Browne said.
If you are truly needing to search with both wildcards (i.e. a partial match), then I would take a look at the amount of data in your table. If it's just a few thousand rows we're talking about, a table scan will still be very fine performance wise. If we're talking about something like 50.000 rows or more, then a full text index would be best.
This seems like a good tutorial on the matter: https://learn.microsoft.com/en-us/sql/relational-databases/search/get-started-with-full-text-search

How to search millions of record in SQL table faster?

I have SQL table with millions of domain name. But now when I search for let's say
SELECT *
FROM tblDomainResults
WHERE domainName LIKE '%lifeis%'
It takes more than 10 minutes to get the results. I tried indexing but that didn't help.
What is the best way to store this millions of record and easily access these information in short period of time?
There are about 50 million records and 5 column so far.
Most likely, you tried a traditional index which cannot be used to optimize LIKE queries unless the pattern begins with a fixed string (e.g. 'lifeis%').
What you need for your query is a full-text index. Most DBMS support it these days.
Assuming that your 50 million row table includes duplicates (perhaps that is part of the problem), and assuming SQL Server (the syntax may change but the concept is similar on most RDBMSes), another option is to store domains in a lookup table, e.g.
CREATE TABLE dbo.Domains
(
DomainID INT IDENTITY(1,1) PRIMARY KEY,
DomainName VARCHAR(255) NOT NULL
);
CREATE UNIQUE INDEX dn ON dbo.Domains(DomainName);
When you load new data, check if any of the domain names are new - and insert those into the Domains table. Then in your big table, you just include the DomainID. Not only will this keep your 50 million row table much smaller, it will also make lookups like this much more efficient.
SELECT * -- please specify column names
FROM dbo.tblDomainResults AS dr
INNER JOIN dbo.Domains AS d
ON dr.DomainID = d.DomainID
WHERE d.DomainName LIKE '%lifeis%';
Of course except on the tiniest of tables, it will always help to avoid LIKE clauses with a leading wildcard.
Full-text indexing is the far-and-away best option here - how this is accomplished will depend on the DBMS you're using.
Short of that, ensuring that you have an index on the column being matched with the pattern will help performance, but by the sounds of it, you've tried this and it didn't help a great deal.
Stop using LIKE statement. You could use fulltext search, but it will require MyISAM table and isn't all that good solution.
I would recommend for you to examine available 3rd party solutions - like Lucene and Sphinx. They will be superior.
One thing you might want to consider is having a separate search engine for such lookups. For example, you can use a SOLR (lucene) server to search on and retrieve the ids of entries that match your search, then retrieve the data from the database by id. Even having to make two different calls, its very likely it will wind up being faster.
Indexes are slowed down whenever they have to go lookup ("bookmark lookup") data that the index itself doesn't contain. For instance, if your index has 2 columns, ID, and NAME, but you're selecting * (which is 5 columns total) the database has to read the index for the first two columns, then go lookup the other 3 columns in a less efficient data structure somewhere else.
In this case, your index can't be used because of the "like". This is similar to not putting any where filter on the query, it will skip the index altogether since it has to read the whole table anyway it will do just that ("table scan"). There is a threshold (i think around 35-50% where the engine normally flips over to this).
In short, it seems unlikely that you need all 50 million rows from the DB for a production application, but if you do... use a machine with more memory and try methods that keep that data in memory. Maybe a No-SQL DB would be a better option - mongoDB, couch DB, tokyo cabinet. Things like this. Good luck!
You could try breaking up the domain into chunks and then searh the chunks themselves. I did some thing like that years ago when I needed to search for words in sentences. I did not have full text searching available so I broke up the sentences into a word list and searched the words. It was really fast to find the results since the words were indexed.

Why can't I simply add an index that includes all columns?

I have a table in SQL Server database which I want to be able to search and retrieve data from as fast as possible. I don't care about how long time it takes to insert into the table, I am only interested in the speed at which I can get data.
The problem is the table is accessed with 20 or more different types of queries. This makes it a tedious task to add an index specially designed for each query. I'm considering instead simply adding an index that includes ALL columns of the table. It's not something you would normally do in "good" database design, so I'm assuming there is some good reason why I shouldn't do it.
Can anyone tell me why I shouldn't do this?
UPDATE: I forgot to mention, I also don't care about the size of my database. It's OK that it means my database size will grow larger than it needed to
First of all, an index in SQL Server can only have at most 900 bytes in its index entry. That alone makes it impossible to have an index with all columns.
Most of all: such an index makes no sense at all. What are you trying to achieve??
Consider this: if you have an index on (LastName, FirstName, Street, City), that index will not be able to be used to speed up queries on
FirstName alone
City
Street
That index would be useful for searches on
(LastName), or
(LastName, FirstName), or
(LastName, FirstName, Street), or
(LastName, FirstName, Street, City)
but really nothing else - certainly not if you search for just Street or just City!
The order of the columns in your index makes quite a difference, and the query optimizer can't just use any column somewhere in the middle of an index for lookups.
Consider your phone book: it's order probably by LastName, FirstName, maybe Street. So does that indexing help you find all "Joe's" in your city? All people living on "Main Street" ?? No - you can lookup by LastName first - then you get more specific inside that set of data. Just having an index over everything doesn't help speed up searching for all columns at all.
If you want to be able to search by Street - you need to add a separate index on (Street) (and possibly another column or two that make sense).
If you want to be able to search by Occupation or whatever else - you need another specific index for that.
Just because your column exists in an index doesn't mean that'll speed up all searches for that column!
The main rule is: use as few indices as possible - too many indices can be even worse for a system than having no indices at all.... build your system, monitor its performance, and find those queries that cost the most - then optimize these, e.g. by adding indices.
Don't just blindly index every column just because you can - this is a guarantee for lousy system performance - any index also requires maintenance and upkeep, so the more indices you have, the more your INSERT, UPDATE and DELETE operations will suffer (get slower) since all those indices need to be updated.
You are having a fundamental misunderstanding how indexes work.
Read this explanation "how multi-column indexes work".
The next question you might have is why not creating one index per column--but that's also a dead-end if you try to reach top select performance.
You might feel that it is a tedious task, but I would say it's a required task to index carefully. Sloppy indexing strikes back, as in this example.
Note: I am strongly convinced that proper indexing pays off and I know that many people are having the very same questions you have. That's why I'm writing a the a free book about it. The links above refer the pages that might help you to answer your question. However, you might also want to read it from the beginning.
...if you add an index that contains all columns, and a query was actually able to use that index, it would scan it in the order of the primary key. Which means hitting nearly every record. Average search time would be O(n/2).. the same as hitting the actual database.
You need to read a bit lot about indexes.
It might help if you consider an index on a table to be a bit like a Dictionary in C#.
var nameIndex = new Dictionary<String, List<int>>();
That means that the name column is indexed, and will return a list of primary keys.
var nameOccupationIndex = new Dictionary<String, List<Dictionary<String, List<int>>>>();
That means that the name column + occupation columns are indexed. Now imagine the index contained 10 different columns, nested so far deep it contains every single row in your table.
This isn't exactly how it works mind you. But it should give you an idea of how indexes could work if implemented in C#. What you need to do is create indexes based on one or two keys that are queried on extensively, so that the index is more useful than scanning the entire table.
If this is a data warehouse type operation where queries are highly optimized for READ queries, and if you have 20 ways of dissecting the data, e.g.
WHERE clause involves..
Q1: status, type, customer
Q2: price, customer, band
Q3: sale_month, band, type, status
Q4: customer
etc
And you absolutely have plenty of fast storage space to burn, then by all means create an index for EVERY single column, separately. So a 20-column table will have 20 indexes, one for each individual column. I could probably say to ignore bit columns or low cardinality columns, but since we're going so far, why bother (with that admonition). They will just sit there and churn the WRITE time, but if you don't care about that part of the picture, then we're all good.
Analyze your 20 queries, and if you have hot queries (the hottest ones) that still won't go any faster, plan it using SSMS (press Ctrl-L) with one query in the query window. It will tell you what index can help that queries - just create it; create them all, fully remembering that this adds again to the write cost, backup file size, db maintenance time etc.
I think the questioner is asking
'why can't I make an index like':
create index index_name
on table_name
(
*
)
The problems with that have been addressed.
But given it sounds like they are using MS sql server.
It's useful to understand that you can include nonkey columns in an index so they the values of those columns are available for retrieval from the index, but not to be used as selection criteria :
create index index_name
on table_name
(
foreign_key
)
include (a,b,c,d) -- every column except foreign key
I created two tables with a million identical rows
I indexed table A like this
create nonclustered index index_name_A
on A
(
foreign_key -- this is a guid
)
and table B like this
create nonclustered index index_name_B
on B
(
foreign_key -- this is a guid
)
include (id,a,b,c,d) -- ( every key except foreign key)
no surprise, table A was slightly faster to insert to.
but when I and ran these this queries
select * from A where foreign_key = #guid
select * from B where foreign_key = #guid
On table A, sql server didn't even use the index, it did a table scan, and complained about a missing index including id,a,b,c,d
On table B, the query was over 50 times faster with much less io
forcing the query on A to use the index didn't make it any faster
select * from A where foreign_key = #guid
select * from A with (index(index_name_A)) where foreign_key = #guid
I'm considering instead simply adding an index that includes ALL columns of the table.
This is always a bad idea. Indexes in database is not some sort of pixie dust that works magically. You have to analyze your queries and according to what and how is being queried - append indexes.
It is not as simple as "add everything to index and have a nap"
I see only long and complicated answers here so I thought I should give the simplest answer possible.
You cannot add an entire table, or all its columns, to an index because that just duplicates the table.
In simple terms, an index is just another table with selected data ordered in the order you normally expect to query it in, and a pointer to the row on disk where the rest of the data lives.
So, a level of indirection exists. You have a partial copy of a table in an preordered manner (both on disk and in RAM, assuming the index is not fragmented), which is faster to query for the columns defined in the index only, while the rest of the columns can be fetched without having to scan the disk for them, because the index contains a reference to the correct position on disk where the rest of the data is for each row.
1) size, an index essentially builds a copy of the data in that column some easily searchable structure, like a binary tree (I don't know SQL Server specifcs).
2) You mentioned speed, index structures are slower to add to.
That index would just be identical to your table (possibly sorted in another order).
It won't speed up your queries.

For a char/varchar/text column, why will an index for that column make it faster to search?

If it's an int, I know it will be faster, just cannot understand the string type.
notes:
most Asian language don't have the space between words. and mysql cannot split the sentence to words. and also, I mean random search, that is, words can appears in any where in a sentence .
One big point is that an index won't help at all for certain kinds of searches. For example:
SELECT * FROM [MyTable] WHERE [MyVarcharColumn] LIKE '%' + #SearchText + '%'
No amount of normal indexing will help that query. It's forever doomed to be slow. That LIKE expression is just not sargable.
Why? You first need to understand how indexes work. They basically take the columns being indexed along with the primary key (record pointer) into a new table. They then sort that table on the indexed column rather than the key. When you do a lookup using the index, it can very quickly find the row(s) you want because this index is sorted to facilitate a more efficient search using algorithms like binary search and others.
Now look at that query again. By placing a wildcard in front of the search text, you've just told the database that you don't know for sure what your column starts with. No amount of sorting will help; you still need to go through the entire table to be sure you find every record that matches the expression. And that means any normal index on the column is worthless for this query.
If you want to search a text column for a search string anywhere in the column, you need to use something a little different: a full-text index.
Now for contrast look at this query:
SELECT * FROM [MyTable] WHERE [MyVarcharColumn] LIKE #SearchText + '%'
This will work perfectly fine with an normal index, because you know how you expect the column to start. It can still match up with the sorted values stored in the index, and so we can say that it is sargable.
An index is sorted, a table is not. Therefore, when you're searching on an index, it's got a clue as to where to find the string, even if there's a different value for each row in the table.
Moreover, indices are smaller (generally) than the table, so to scan each column in the table, you have to go over each row. An index seek is just finding the right spot in the index, selecting that, grab the pointer to the clustered index, and away you go to get the rest of the row.
The simplest answer is another couple of questions:
Why is finding a person by his/her last name very quick in the telephone book?
Why is finding a person by his/her first name not quick in the telephone book?
The index is essentially like an index in a book, where every word (depending on the book) that appears in the book is placed in the index, with the page numbers where that word appears. The index is alphabetically sorted, so it's quick to find the word in the index. If you didn't have the index in a book, the only way to find every instance of a specific word is to read through the entire book, noting where that word appears.
A table is just the same. If you search for a record that has a specific column value, and you don't have an index, the only thing the database can do is iterate through the entire table to find any match.
A phone book is indexed on last name. Can you imagine how slow looking up a number would be if it wasn't?
The index is essentially a 2-column table, with the indexed field in sorted order along with the PK lookup. SO for a string, it has the strings in sorted order. A search then can be done using a binary search instead of a table scan, which is going to be way faster for almost any length of table.

RDBS when to use complex indexes for queries and when use simple?

Suppose I have table in my DB schema called TEST with fields (id, name, address, phone, comments). Now, I know that I'm going to perform a large set of different queries for that table, therefore my question is next, when and why I shall create indexes like ID_NAME_INDX (index for id and name) and when it's more efficient to create separately index for id and index for name field(by when I mean for what type of query)?
The general aim would be to "cover" all columns so the query only has to use the index.
-- An index on Name including ID would be ideal
SELECT
[id]
FROM
TEST
WHERE
[name] = 'bob'
Say you need name and indx but have separate indexes. You'll end up with a bookmark lookup from the index to the PK to get the other columns (assuming it doesn't just scan the PK)
Edit, after 1st comment:
select * from test where id='id1' and name='Name1'
For this query, the SELECT * but mitigates against any index so the PK would be used.
If you had:
select address from test where id='id1' and name='Name1'
then an index on ID, name including address would "cover" it.
Using "OR" creates difficulties for any strategy. However,
select address from test where id='id1' and name='Name1'
would still use the "ID, name including address" inex most likely but scan it rather that seek
Read this: Execution Plan Basics
I'm not sure your example explains the actual question you're asking. You're saying if you should have an index on ID and and index on Name, as opposed to an index on both ID and Name. The thing is, I guess that ID is your primary key and so you're not likely to do a search on ID AND Name.
However, in the terms of a table with two ID's of which you would want to search on either one, or both together then having three indexes, one on each of the ID's and one combined will be the fastest. If you have two indexes then to find the record you're looking for both indexes will need to be searched. However, if you have one index covering both ID's then only that index will need to be searched.
As with all indexes though, as you add them, your database increases in size and you will get a reduction on insert / update performance. You always need to weigh up the gains / losses.
Add indexes to the absolutely obvious candidates, add indexes to the "maybe" ones as the need arises. Continue to monitor your database performance and run query analysers to see where any performance gains can be made over time.
Most database software include some sort of a tool to debug your queries. These can usually tell you which indexes the server considered and which it ended up using. This functionality is usually called explain or something similar.
Usually you should create indexes for columns that are used in the where clause or joins.