Why I would bother using full text search? - sql

I am new to Full Text Search, I used the following query
Select * From Students Where FullName LIKE '%abc%'
Students table contains million records all random and look like this
'QZAQHIEK VABCNLRM KFFZJYUU'
It took only 2 seconds and resulted 1100 rows. If million record is searched in two seconds why I would bother using Full Text Search ?!! Did Like predicate used the Full Text Index as well?

No. LIKE does not make use of full text indexing. See here.
Computers are pretty darn fast these days but if you're seeing search results faster than you expect it's possible that you simply got back a cached result set because you executed the same query previously. To be sure you're not getting cached results you could use DBCC DROPCLEANBUFFERS. Take a look at this post for some SQL Server cache clearing options.
Excerpt from the linked page:
Comparing LIKE to Full-Text Search
In contrast to full-text search, the LIKE Transact-SQL predicate works on character patterns only. Also, you cannot use the LIKE predicate to query formatted binary data. Furthermore, a LIKE query against a large amount of unstructured text data is much slower than an equivalent full-text query against the same data. A LIKE query against millions of rows of text data can take minutes to return; whereas a full-text query can take only seconds or less against the same data, depending on the number of rows that are returned.

I think you have answered your own question, at least to your own satisfaction. If your prototyping produces results in an acceptable amount of time, and you are certain that caching does not explain the quick response (per Paul Sasik), by all means skip the overhead of full-text indexing and proceed with LIKE.

You might be interested in full-text search if you care about ranking your set of result or lexical stemming.

No, in fact your example query can't even take advantage of a regular index to speed things up because it doesn't know the first letters of any potential matches.
In general, full-text search is faster than a regular lookup. But LIKE is considerably slower.

Related

SQL Server: Performance of searching for hex strings in large tables (using LIKE, Full-Text Search, etc.)

I have a table with 40+ million rows in MS SQL Server 2019. One of the columns store pure hexadecimal strings (both binary and readable ASCII content). I need to search this table for rows containing a specific hex string.
Normally, I would do this:
SELECT * FROM transactionoutputs WHERE outhex LIKE '%74657374%' ORDER BY id DESC OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;
Since the results are paginated, it can take less than a second to find the first 10 results. However, when increasing the offset, or searching for strings that only appear 1-2 times in the entire table, it can take more than a minute, at which point my application will time out.
The execution plan for this query is this:
Are there any easy ways to improve the performance of such a search?
Using this answer, I was able to reduce the query time from 33 seconds to 27 seconds:
SELECT * FROM transactionoutputs WHERE
CHARINDEX('74657374' collate Latin1_General_BIN, outhex collate Latin1_General_BIN) > 0
ORDER BY id DESC OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;
When I leave out the ORDER BY and pagination, I can reduce this to 19 seconds. This is not ideal because I need both the ordering and pagination. It still has to scan the entire table
I have tried the following:
Create an index on that column. This has no noticeable effect.
I came across this article about slow queries. Initially, I was using parameterized queries in my application, which was much slower than running them in SSMS. I have since moved to the query shown above, but it is still slow.
I tried to enable Multiple Active Result Sets (MARS), but without any improvement in query time.
I also tried using Full-Text Search. This seemed to be the most promising solution as text search is exactly what I need. I created a full-text index and can do a similar query like above, but using the index:
SELECT * FROM transactionoutputs WHERE CONTAINS(outhex,'7465') ORDER BY id desc OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;
This returns results almost instantly. However, when the query is longer than a few characters (often 4), it doesn't return anything. Am I doing something wrong or why is it doing that?
The execution plan:
My understanding is that my case is not the ideal use case for FTS, as it is designed to search in readable text and not hexadecimal strings. Is it possible to use anyway, and if so, how?
After reading dozens of articles and SO posts, I can not confidently say I know how to improve the performance of such queries for my specific use case, if it is even possible at all. So, is there any easy option to improve this?
First, Kudo's for the fantastic explanation of your problem. This helps you get better answers fast. You should also include DDL, including indexes when possible. This will be come clear as I answer your question.
I'm going to tackle a couple issues with your query which are unrelated to how your parsing your text now and talk about how to handle the string problem later tonight.
Answer Part 1: Unrelated to string parsing
It's quite possible that the way you are searching through the string is the main performance problem. Let's start with the SELECT * - do you absolutely need all the columns? Specifically, do you absolutely need all the columns that are included in that Key lookup? This is the most important thing to sort out. Let me explain.
You're query is performing a scan against your nonclustered index named outhex-index, then performs a Key lookup to retrieve the rows not included in outhex-index. Key lookups destroy performance, especially against a clustered and nonclustered index with 40,000,000 rows.
If you do need those columns, then you should consider adding them as included columns to your outhex-index index. I say consider because I don't know how many columns nor the data type. Include columns speed up queries by eliminating costly key lookups but they slow down data modification, sometimes dramatically depending on the number/type of indexes. If you need the columns not included in outhex-index and they are big columns (MAX/BLOB/LOB data types, XML, etc) then a covering index is NOT an option. If your don't need them then you she refactor your SELECT * statement to only include the columns you need.
Full text indexing is not an option here unless you can find a way to lose that sort. Sorting has an N log N complexity which means the sort gets more expensive the more rows you sort. A 40 million row sort should be avoided whenever possible. This will be hard to avoid with full text indexing for reasons which require more time to explain then I have time for. Adding/modifying a 40-million row index can be expensive and take a lot of time. If you do go that route I suggest taking an offline copy of that table to time how long it takes to build. You can also consider adding creating a filtered index if possible to reduce your search area.
I noticed, too, that both queries are getting serial execution plans. I don't know if a parallel plan will help the first query with the key lookup but I know that it will likely help with the second one as there is a sort involved. Parallel execution plans can really speed up sorts. Consider testing your query with OPTION (QUERYTRACEON 8649) or make_parallel() by Adam Machanic.
I'll update this post tonight with some ideas to parse your string faster. One thing you could look into in the meantime is Paul White's clever Trigram Wildcard String Search trick which might be an option.

Sql Search in millions of records. Possible?

I have a table in my sql server 2005 database which contains about 50 million records.
I have firstName and LastName columns, and I would like to be able to allow the user to search on these columns without it taking forever.
Out of indexing these columns, is there a way to make my query work fast?
Also, I want to search similar sounded names. for example, if the user searches for Danny, I would like to return records with the name Dan, Daniel as well. It would be nice to show the user a rank in % how close the result he got to what he actually searched.
I know this is a tuff task, but I bet I'm not the first one in the world that face this issue :)
Thanks for your help.
We have databases with half a billion of records (Oracle, but should have similar performances). You can search in it within a few milli seconds if you have proper indexes. In your case, place an index on firstname and lastname. Using binary-tree index will perform good and will scale with the size of your database. Careful, LIKE clauses often break the use of the index and degrades largely the performances. I know MySQL can keep using indexes with LIKE clauses when wildcards are only at the right of the string. You would have to make similar search for SQL Server.
String similarity is indeed not simple. Have a look at http://en.wikipedia.org/wiki/Category:String_similarity_measures, you'll see some of the possible algorithms. Cannot say if SQL Server do implement one of them, dont know this database. Try to Google "SQL Server" + the name of the algorithms to maybe find what you need. Otherwise, you have code provided on Wiki for various languages (maybe not SQL but you should be able to adapt them for a stored procedure).
Have you tried full text indexing? I used it on free text fields in a table over 1 million records, and found it to be pretty fast. Plus you can add synonyms to it, so that Dan, Danial, and Danny all index as the same (where you get the dictionary of name equivalents is a different story). It does allow wildcard searches as well. Full text indexing can also do rank, though I found it to be less useful on names (better for documents).
use FUll TEXT SEARCH enable for this table and those columns, that will create full text index for those columns.

Lucene.Net memory consumption and slow search when too many clauses used

I have a DB having text file attributes and text file primary key IDs and
indexed around 1 million text files along with their IDs (primary keys in DB).
Now, I am searching at two levels.
First is straight forward DB search, where i get primary keys as result (roughly 2 or 3 million IDs)
Then i make a Boolean query for instance as following
+Text:"test*" +(pkID:1 pkID:4 pkID:100 pkID:115 pkID:1041 .... )
and search it in my Index file.
The problem is that such query (having 2 million clauses) takes toooooo much time to give result and consumes reallly too much memory....
Is there any optimization solution for this problem ?
Assuming you can reuse the dbid part of your queries:
Split the query into two parts: one part (the text query) will become the query and the other part (the pkID query) will become the filter
Make both parts into queries
Convert the pkid query to a filter (by using QueryWrapperFilter)
Convert the filter into a cached filter (using CachingWrapperFilter)
Hang onto the filter, perhaps via some kind of dictionary
Next time you do a search, use the overload that allows you to use a query and filter
As long as the pkid search can be reused, you should quite a large improvement. As long as you don't optimise your index, the effect of caching should even work through commit points (I understand the bit sets are calculated on a per-segment basis).
HTH
p.s.
I think it would be remiss of me not to note that I think you're putting your index through all sorts of abuse by using it like this!
The best optimization is NOT to use the query with 2 million clauses. Any Lucene query with 2 million clauses will run slowly no matter how you optimize it.
In your particular case, I think it will be much more practical to search your index first with +Text:"test*" query and then limit the results by running a DB query on Lucene hits.

SQL full text search vs "LIKE"

Let's say I have a fairly simple app that lets users store information on DVDs they own (title, actors, year, description, etc.) and I want to allow users to search their collection by any of these fields (e.g. "Keanu Reeves" or "The Matrix" would be valid search queries).
What's the advantage of going with SQL full text search vs simply splitting the query up by spaces and doing a few "LIKE" clauses in the SQL statement? Does it simply perform better or will it actually return results that are more accurate?
Full text search is likely to be quicker since it will benefit from an index of words that it will use to look up the records, whereas using LIKE is going to need to full table scan.
In some cases LIKE will more accurate since LIKE "%The%" AND LIKE "%Matrix" will pick out "The Matrix" but not "Matrix Reloaded" whereas full text search will ignore "The" and return both. That said both would likely have been a better result.
Full-text indexes (which are indexes) are much faster than using LIKE (which essentially examines each row every time). However, if you know the database will be small, there may not be a performance need to use full-text indexes. The only way to determine this is with some intelligent averaging and some testing based on that information.
Accuracy is a different question. Full-text indexing allows you to do several things (weighting, automatically matching eat/eats/eating, etc.) you couldn't possibly implement that in any sort of reasonable time-frame using LIKE. The real question is whether you need those features.
Without reading the full-text documentation's description of these features, you're really not going to know how you should proceed. So, read up!
Also, some basic tests (insert a bunch of rows in a table, maybe with some sort of public dictionary as a source of words) will go a long way to helping you decide.
A full text search query is much faster. Especially when working which lots of data in various columns.
Additionally you will have language specific search support. E.g. german umlauts like "ü" in "über" will also be found when stored as "ueber". Also you can use synonyms where you can automatically expand search queries, or replace or substitute specific phrases.
In some cases LIKE will more accurate
since LIKE "%The%" AND LIKE "%Matrix"
will pick out "The Matrix" but not
"Matrix Reloaded" whereas full text
search will ignore "The" and return
both. That said both would likely have
been a better result.
That is not correct. The full text search syntax lets you specify "how" you want to search. E.g. by using the CONTAINS statement you can use exact term matching as well fuzzy matching, weights etc.
So if you have performance issues or would like to provide a more "Google-like" search experience, go for the full text search engine. It is also very easy to configure.
Just a few notes:
LIKE can use an Index Seek if you don't start your LIKE with %. Example: LIKE 'Santa M%' is good! LIKE '%Maria' is bad! and can cause a Table or Index Scan because this can't be indexed in the standard way.
This is very important. Full-Text Indexes updates are Asynchronous. For instance, if you perform an INSERT on a table followed by a SELECT with Full-Text Search where you expect the new data to appear, you might not get the data immediatly. Based on your configuration, you may have to wait a few seconds or a day. Generally, Full-Text Indexes are populated when your system does not have many requests.
It will perform better, but unless you have a lot of data you won't notice that difference. A SQL full text search index lets you use operators that are more advanced then a simple "LIKE" operation, but if all you do is the equivalent of a LIKE operation against your full text index then your results will be the same.
Imagine if you will allow to enter notes/descriptions on DVDs.
In this case it will be good to allow to search by descriptions.
Full text search in this case will do better job.
You may get slightly better results, or else at least have an easier implementation with full text indexing. But it depends on how you want it to work ...
What I have in mind is that if you are searching for two words, with LIKE you have to then manually implement (for example) a method to weight those with both higher in the list. A fulltext index should do this for you, and allow you to influence the weightings too using relevant syntax.
To FullTextSearch in SQL Server as LIKE
First, You have to create a StopList and assign it to your table
CREATE FULLTEXT STOPLIST [MyStopList];
GO
ALTER FULLTEXT INDEX ON dbo.[MyTableName] SET STOPLIST [MyStopList]
GO
Second, use the following tSql script:
SELECT * FROM dbo.[MyTableName] AS mt
WHERE CONTAINS((mt.ColumnName1,mt.ColumnName2,mt.ColumnName3), N'"*search text s*"')
If you do not just search English word, say you search a Chinese word, then how your fts tokenizes words will make your search a big different, as I gave an example here https://stackoverflow.com/a/31396975/301513. But I don't know how sql server tokenizes Chinese words, does it do a good job for that?

When should you use full-text indexing?

We have a whole bunch of queries that "search" for clients, customers, etc. You can search by first name, email, etc. We're using LIKE statements in the following manner:
SELECT *
FROM customer
WHERE fname LIKE '%someName%'
Does full-text indexing help in the scenario? We're using SQL Server 2005.
It will depend upon your DBMS. I believe that most systems will not take advantage of the full-text index unless you use the full-text functions. (e.g. MATCH/AGAINST in mySQL or FREETEXT/CONTAINS in MS SQL)
Here is two good articles on when, why, and how to use full-text indexing in SQL Server:
How To Use SQL Server Full-Text Searching
Solving Complex SQL Problems with Full-Text Indexing
FTS can help in this scenario, the question is whether it is worth it or not.
To begin with, let's look at why LIKE may not be the most effective search. When you use LIKE, especially when you are searching with a % at the beginning of your comparison, SQL Server needs to perform both a table scan of every single row and a byte by byte check of the column you are checking.
FTS has some better algorithms for matching data as does some better statistics on variations of names. Therefore FTS can provide better performance for matching Smith, Smythe, Smithers, etc when you look for Smith.
It is, however, a bit more complex to use FTS, as you'll need to master CONTAINS vs FREETEXT and the arcane format of the search. However, if you want to do a search where either FName or LName match, you can do that with one statement instead of an OR.
To determine if FTS is going to be effective, determine how much data you have. I use FTS on a database of several hundred million rows and that's a real benefit over searching with LIKE, but I don't use it on every table.
If your table size is more reasonable, less than a few million, you can get similar speed by creating an index for each column that you're going to be searching on and SQL Server should perform an index scan rather than a table scan.
According to my test scenario:
SQL Server 2008
10.000.000 rows each with a string like "wordA wordB
wordC..." (varies between 1 and 30 words)
selecting count(*) with CONTAINS(column, "wordB")
result size several hundred thousands
catalog size approx 1.8GB
Full-text index was in range of 2s whereas like '% wordB %' was in range of 1-2 minutes.
But this counts only if you don't use any additional selection criteria! E.g. if I used some "like 'prefix%'" on a primary key column additionally, the performance was worse since the operation of going into the full-text index costs more than doing a string search in some fields (as long those are not too much).
So I would recommend full-text index only in cases where you have to do a "free string search" or use some of the special features of it...
To answer the question specifically for MSSQL, full-text indexing will NOT help in your scenario.
In order to improve that query you could do one of the following:
Configure a full-text catalog on the column and use the CONTAINS() function.
If you were primarily searching with a prefix (i.e. matching from the start of the name), you could change the predicate to the following and create an index over the column.
where fname like 'prefix%'
(1) is probably overkill for this, unless the performance of the query is a big problem.