When should you use full-text indexing? - sql

We have a whole bunch of queries that "search" for clients, customers, etc. You can search by first name, email, etc. We're using LIKE statements in the following manner:
SELECT *
FROM customer
WHERE fname LIKE '%someName%'
Does full-text indexing help in the scenario? We're using SQL Server 2005.

It will depend upon your DBMS. I believe that most systems will not take advantage of the full-text index unless you use the full-text functions. (e.g. MATCH/AGAINST in mySQL or FREETEXT/CONTAINS in MS SQL)
Here is two good articles on when, why, and how to use full-text indexing in SQL Server:
How To Use SQL Server Full-Text Searching
Solving Complex SQL Problems with Full-Text Indexing

FTS can help in this scenario, the question is whether it is worth it or not.
To begin with, let's look at why LIKE may not be the most effective search. When you use LIKE, especially when you are searching with a % at the beginning of your comparison, SQL Server needs to perform both a table scan of every single row and a byte by byte check of the column you are checking.
FTS has some better algorithms for matching data as does some better statistics on variations of names. Therefore FTS can provide better performance for matching Smith, Smythe, Smithers, etc when you look for Smith.
It is, however, a bit more complex to use FTS, as you'll need to master CONTAINS vs FREETEXT and the arcane format of the search. However, if you want to do a search where either FName or LName match, you can do that with one statement instead of an OR.
To determine if FTS is going to be effective, determine how much data you have. I use FTS on a database of several hundred million rows and that's a real benefit over searching with LIKE, but I don't use it on every table.
If your table size is more reasonable, less than a few million, you can get similar speed by creating an index for each column that you're going to be searching on and SQL Server should perform an index scan rather than a table scan.

According to my test scenario:
SQL Server 2008
10.000.000 rows each with a string like "wordA wordB
wordC..." (varies between 1 and 30 words)
selecting count(*) with CONTAINS(column, "wordB")
result size several hundred thousands
catalog size approx 1.8GB
Full-text index was in range of 2s whereas like '% wordB %' was in range of 1-2 minutes.
But this counts only if you don't use any additional selection criteria! E.g. if I used some "like 'prefix%'" on a primary key column additionally, the performance was worse since the operation of going into the full-text index costs more than doing a string search in some fields (as long those are not too much).
So I would recommend full-text index only in cases where you have to do a "free string search" or use some of the special features of it...

To answer the question specifically for MSSQL, full-text indexing will NOT help in your scenario.
In order to improve that query you could do one of the following:
Configure a full-text catalog on the column and use the CONTAINS() function.
If you were primarily searching with a prefix (i.e. matching from the start of the name), you could change the predicate to the following and create an index over the column.
where fname like 'prefix%'
(1) is probably overkill for this, unless the performance of the query is a big problem.

Related

Why I would bother using full text search?

I am new to Full Text Search, I used the following query
Select * From Students Where FullName LIKE '%abc%'
Students table contains million records all random and look like this
'QZAQHIEK VABCNLRM KFFZJYUU'
It took only 2 seconds and resulted 1100 rows. If million record is searched in two seconds why I would bother using Full Text Search ?!! Did Like predicate used the Full Text Index as well?
No. LIKE does not make use of full text indexing. See here.
Computers are pretty darn fast these days but if you're seeing search results faster than you expect it's possible that you simply got back a cached result set because you executed the same query previously. To be sure you're not getting cached results you could use DBCC DROPCLEANBUFFERS. Take a look at this post for some SQL Server cache clearing options.
Excerpt from the linked page:
Comparing LIKE to Full-Text Search
In contrast to full-text search, the LIKE Transact-SQL predicate works on character patterns only. Also, you cannot use the LIKE predicate to query formatted binary data. Furthermore, a LIKE query against a large amount of unstructured text data is much slower than an equivalent full-text query against the same data. A LIKE query against millions of rows of text data can take minutes to return; whereas a full-text query can take only seconds or less against the same data, depending on the number of rows that are returned.
I think you have answered your own question, at least to your own satisfaction. If your prototyping produces results in an acceptable amount of time, and you are certain that caching does not explain the quick response (per Paul Sasik), by all means skip the overhead of full-text indexing and proceed with LIKE.
You might be interested in full-text search if you care about ranking your set of result or lexical stemming.
No, in fact your example query can't even take advantage of a regular index to speed things up because it doesn't know the first letters of any potential matches.
In general, full-text search is faster than a regular lookup. But LIKE is considerably slower.

Performance with LIKE vs CONTAINS using full-text indexing

I have a table with a large(ish) amount of rows 500k, MSSQL Server 2008. I have a column which holds a nvarchar product ID which is usually 15 characters in length, alphabetical and numerical e.g. FF93F348HJKCF5HW9 . I would like to be able to search for this product ID with the best performance. I have done some research into using Full-Text indexing on this column and I dont really think that using full-text indexing using CONTAINS offers any benefit over using LIKE '%%'. This looks to be down to the fact Full-text indexing is more beneficial when searching for whole words, rather than a series of characters.
Can somebody confirm/deny this for me?
Full-Text indexing is about searching for language words in unstructured text data. Your data doesn't contain words, just a sequence of characters.
I haven't tested this, but I would expect that LIKE would actually be faster, as long as your data is indexed. CONTAINS is meant for searching for words & word-like structures.
If your requirement is for "auto-complete", then LIKE will perform pretty well since the optimizer will use an INDEX SEEK when you search for something such as LIKE 'F5521%'.
This MSDN article explains the basics of the CONTAINS keyword.

Sql Search in millions of records. Possible?

I have a table in my sql server 2005 database which contains about 50 million records.
I have firstName and LastName columns, and I would like to be able to allow the user to search on these columns without it taking forever.
Out of indexing these columns, is there a way to make my query work fast?
Also, I want to search similar sounded names. for example, if the user searches for Danny, I would like to return records with the name Dan, Daniel as well. It would be nice to show the user a rank in % how close the result he got to what he actually searched.
I know this is a tuff task, but I bet I'm not the first one in the world that face this issue :)
Thanks for your help.
We have databases with half a billion of records (Oracle, but should have similar performances). You can search in it within a few milli seconds if you have proper indexes. In your case, place an index on firstname and lastname. Using binary-tree index will perform good and will scale with the size of your database. Careful, LIKE clauses often break the use of the index and degrades largely the performances. I know MySQL can keep using indexes with LIKE clauses when wildcards are only at the right of the string. You would have to make similar search for SQL Server.
String similarity is indeed not simple. Have a look at http://en.wikipedia.org/wiki/Category:String_similarity_measures, you'll see some of the possible algorithms. Cannot say if SQL Server do implement one of them, dont know this database. Try to Google "SQL Server" + the name of the algorithms to maybe find what you need. Otherwise, you have code provided on Wiki for various languages (maybe not SQL but you should be able to adapt them for a stored procedure).
Have you tried full text indexing? I used it on free text fields in a table over 1 million records, and found it to be pretty fast. Plus you can add synonyms to it, so that Dan, Danial, and Danny all index as the same (where you get the dictionary of name equivalents is a different story). It does allow wildcard searches as well. Full text indexing can also do rank, though I found it to be less useful on names (better for documents).
use FUll TEXT SEARCH enable for this table and those columns, that will create full text index for those columns.

How to search millions of record in SQL table faster?

I have SQL table with millions of domain name. But now when I search for let's say
SELECT *
FROM tblDomainResults
WHERE domainName LIKE '%lifeis%'
It takes more than 10 minutes to get the results. I tried indexing but that didn't help.
What is the best way to store this millions of record and easily access these information in short period of time?
There are about 50 million records and 5 column so far.
Most likely, you tried a traditional index which cannot be used to optimize LIKE queries unless the pattern begins with a fixed string (e.g. 'lifeis%').
What you need for your query is a full-text index. Most DBMS support it these days.
Assuming that your 50 million row table includes duplicates (perhaps that is part of the problem), and assuming SQL Server (the syntax may change but the concept is similar on most RDBMSes), another option is to store domains in a lookup table, e.g.
CREATE TABLE dbo.Domains
(
DomainID INT IDENTITY(1,1) PRIMARY KEY,
DomainName VARCHAR(255) NOT NULL
);
CREATE UNIQUE INDEX dn ON dbo.Domains(DomainName);
When you load new data, check if any of the domain names are new - and insert those into the Domains table. Then in your big table, you just include the DomainID. Not only will this keep your 50 million row table much smaller, it will also make lookups like this much more efficient.
SELECT * -- please specify column names
FROM dbo.tblDomainResults AS dr
INNER JOIN dbo.Domains AS d
ON dr.DomainID = d.DomainID
WHERE d.DomainName LIKE '%lifeis%';
Of course except on the tiniest of tables, it will always help to avoid LIKE clauses with a leading wildcard.
Full-text indexing is the far-and-away best option here - how this is accomplished will depend on the DBMS you're using.
Short of that, ensuring that you have an index on the column being matched with the pattern will help performance, but by the sounds of it, you've tried this and it didn't help a great deal.
Stop using LIKE statement. You could use fulltext search, but it will require MyISAM table and isn't all that good solution.
I would recommend for you to examine available 3rd party solutions - like Lucene and Sphinx. They will be superior.
One thing you might want to consider is having a separate search engine for such lookups. For example, you can use a SOLR (lucene) server to search on and retrieve the ids of entries that match your search, then retrieve the data from the database by id. Even having to make two different calls, its very likely it will wind up being faster.
Indexes are slowed down whenever they have to go lookup ("bookmark lookup") data that the index itself doesn't contain. For instance, if your index has 2 columns, ID, and NAME, but you're selecting * (which is 5 columns total) the database has to read the index for the first two columns, then go lookup the other 3 columns in a less efficient data structure somewhere else.
In this case, your index can't be used because of the "like". This is similar to not putting any where filter on the query, it will skip the index altogether since it has to read the whole table anyway it will do just that ("table scan"). There is a threshold (i think around 35-50% where the engine normally flips over to this).
In short, it seems unlikely that you need all 50 million rows from the DB for a production application, but if you do... use a machine with more memory and try methods that keep that data in memory. Maybe a No-SQL DB would be a better option - mongoDB, couch DB, tokyo cabinet. Things like this. Good luck!
You could try breaking up the domain into chunks and then searh the chunks themselves. I did some thing like that years ago when I needed to search for words in sentences. I did not have full text searching available so I broke up the sentences into a word list and searched the words. It was really fast to find the results since the words were indexed.

How do I write a string search query that uses the non-clustered indexing I have in place on the field?

I'm looking to build a query that will use the non-clustered indexing plan on a street address field that is built with a non-clustered index. The problem I'm having is that if I'm searching for a street address I will most likely be using the 'like' eval function. I'm thinking that using this function will cause a table scan instead of using the index. How would I go about writing one in this case? Is it just pointless to put a non-clustered index on an address3 field? Thanks in advance.
varchar fields are indexed from left to right, much the same as a dictionary or encyclopedia is indexed.
If you knew what the field started with, (ex. LIKE 'streetname%') then the index would be efficient. However, if you only know part of the field (ex. LIKE '%something%') then an index cannot be used.
If your LIKE expression is doing a start-of-string search (Address LIKE 'Blah%'), I would expect the index to be used, most likely through an index seek.
If you search for Address LIKE '%Blah%', a table scan/index scan will occur, depending on how many fields you return in your query and how selective the index is.
Using LIKE will not necessarily use a table scan; it may make use of an index, depending on what string you're searching against. (For instance, LIKE 'something%' is generally able to use an index, whereas LIKE '%something' is probably not, although the server may still be able to at least do an index scan in that case, which is more expensive that a straight index lookup, but still cheaper than a full table scan.) There's a good article here that talks about LIKE vs. indexes with respect to SQL Server (different DBMSs will implement it differently, obviously).
In theory the database will use whatever index is best. What database server are you using, what are you really trying to achieve, and what is your LIKE statement going to be like? For instance, where the wildcard characters are can make a difference to the query plan that is used.
Other possibilities depending on what you want to achieve are performing some pre-processing of the data and having other columns that are useful for your search, or using an indexed view.
Here's some discussion on the use of indexes with SQL Server 2005 and varchar fields.