Best way to index a SQL table to find best matching string - sql

Let's say I have a SQL table with an int PK column and an nvarchar(max). In the the nvarchar(max) column, I have a bunch of table entries that are all like this:
SOME_PEOPLE_LIKE_APPLES
SOME_PEOPLE_LIKE_APPLES_ON_TUESDAY
SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON
SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON_CAFE
SOME_PEOPLE_LIKE_APPLES_ON_THE_RIVER
.
.
.
SOME_ANTS_HATE_SYRUP
SOME_ANTS_HATE_SYRUP_WITH_STRAWBERRIES
There's millions of these rows - Then let's say my goal is to find the row with the most overlap for an input searchTerm - So in this case, if I input SOME PEOPLE_LIKE_APPLES_ON_THE_MOON_MOUNTAIN, the returned entry would be the third entry from the table above, SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON
I have a SPROC that does this very naively, it goes through the entire table as follows:
SELECT DISTINCT phrase, len(phrase) l, [id] FROM X WHERE searchTerm LIKE phrase + '%'
-- phrase is the row entry being searched against
-- searchTerm is the phrase we're searching for
I then ORDER BY length and pick the TOP only
Would there be a way to speed this up, perhaps by doing some indexing?
If this is confusing, think of it as tableRowEntry + wildcard = searchTerm
I'm on MSSQL 2008 if that makes any difference

If there is an index on your NVARCHAR-column a LIKE 'Something%' -search will be able to use it and should be pretty fast.
If there is a wildcard in the beginning you are out of luck. But - in your case - this should work.
You might use an indexed persistant computed column storing the length of the string. In this case you might reduce the workload enormously by filtering out all string which are to short or to long.
If there are certain words in your search terms which appear often but not everywhere, you might use side columns again and filter like AND InlcudePEOPLE=1 AND IncludeMOON=1
UPDATE
Here is an example
CREATE TABLE Phrase(ID INT IDENTITY
,Phrase NVARCHAR(100)
,PhraseLength AS LEN(Phrase) PERSISTED);
CREATE INDEX IX_Phrase_Phrase ON Phrase(Phrase);
CREATE INDEX IX_Phrase_PhraseLength ON Phrase(PhraseLength);
INSERT INTO Phrase
VALUES
('SOME_PEOPLE_LIKE_APPLES')
,('SOME_PEOPLE_LIKE_APPLES_ON_TUESDAY')
,('SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON')
,('SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON_CAFE')
,('SOME_PEOPLE_LIKE_APPLES_ON_THE_RIVER')
,('SOME_ANTS_HATE_SYRUP')
,('SOME_ANTS_HATE_SYRUP_WITH_STRAWBERRIES');
DECLARE #SearchTerm NVARCHAR(100)=N'SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON_MOUNTAIN';
--This uses the index (checked against execution plan)
SELECT TOP 1 *
FROM Phrase
WHERE #SearchTerm LIKE Phrase + '%'
ORDER BY PhraseLength DESC;
--This might be even better, check with your high row count.
SELECT TOP 1 *
FROM Phrase
WHERE Phrase=LEFT(#SearchTerm,PhraseLength)
ORDER BY PhraseLength DESC;
GO
--Clean-Up
DROP TABLE Phrase;

The best solution here is to create a full-text search index:
https://msdn.microsoft.com/en-us/library/ms142571.aspx
Full text search is optimized for this task, once the index is created you can use full-text queries with the CONTAINS full-text function to find the matches efficiently:
SELECT DISTINCT phrase, len(phrase) l, [id] FROM X WHERE CONTAINS(phrase, searchPhrase)
Full text search not only allows custom optimization through query hints like OPTIMIZE FOR, it also allows for stopwords like AND and OR within the search terms, and a variety of other text-searching goodies, like being able to find spelling variations of the same word automatically and filter by relevance, etc..

Related

Full text search VS Fuzzy Search based on many columns

I have an Employee table with these columns :
EmployeeId
Fullname
Phone
Department
Team
Function
Manager
I have a form with a search text input, where a user can type one column or all of them there like for example :
a user can search by Fullname only
a user can search by combining the Fullname + Phone + Team
What is the difference between Full text search and Fuzzy Search in SQL Server in this case?
so you have to options :
using full text search :
if the data is huge and you are looking for scalable data search , this method is preferable , however harder to maintain . so I suggest you add a computed column in that table and put and full text index on that:
alter table tablename
add column cmptcolumn as concat_ws(',',EmployeeId,FullName,PhoneNumber,...)
--full text catalog
CREATE FULLTEXT CATALOG catalogName AS DEFAULT;
-- full text index
create full text index on tablename (cmptcolumn)
-- search :
select * from tablename
where contain(cmptcolumn, 'SearchString');
by full text search you can search for synonyms and also words related to each other as well:
select * from tablename
where freetext(cmptcolumn, 'SearchString');
read more about different full text search options here
using search query. witch again you can benefit from computed column or search inside each column separately:
select *
from tablename
where (Fullname like '%'+#fullNameSearchString+'%' or #fullNameSearchString is null)
and (Department = #DepartmentSearchString or #DepartmentSearchString is null)
and ...
while first method is a faster way to search insode strings, second method provides more accurate result . however 'FreeText' looks for meaning of the word as well, in that case it might be slower.
in the second method either way you go ( with or without computed column) , having index on the column(s) in a necessity to improve the performance , however using like '%%' usually can't use index as it should.
create stored procedure and pass column comma separated check this example:
CREATE Procedure [dbo].Create_Sp
#SearchColumn varchar(500)=null
AS
BEGIN
DECLARE #SQL nvarchar(max)
SELECT #SQL = N'select '+#SearchColumn+' from #Employee'
EXECUTE sp_executesql #SQL
End
Well we know that the main difference between fuzzy and full text is "exactly what im looking for" versus "similarities." As that applies to SQL Server and to your example given, is fulltext search on firstname as opposed to mining similarities across three different columns, which might give a lot of noise depending on how clean your data is or how "guided" the end user experience in terms of entering those values is.
With fulltext search on fullname, im pretty sure that its a straight forward answer:
SELECT * FROM tblEmployee WHERE FullName=#FullNamParameter;
With fuzzy search, it can get very gross and nasty, as without clearer understanding of what the UI is doing I am forced to assume we want to check similarities on all parameters for all fields. TO be fairly honest Im pretty sure this query is the absolutel worst but it does demonstrate the idea for you.
SELECT * FROM tblEmployee WHERE FullName LIKE '%'+#Parameter1+'%'
UNION
SELECT * FROM tblEmployee WHERE FullName LIKE '%'+#Parameter2+'%'
UNION
SELECT * FROM tblEmployee WHERE FullName LIKE '%'+#Parameter3+'%'
UNION
SELECT * FROM tblEmployee WHERE Phone LIKE '%'+#Parameter1+'%'
UNION
SELECT * FROM tblEmployee WHERE Phone LIKE '%'+#Parameter2+'%'
UNION
SELECT * FROM tblEmployee WHERE Phone LIKE '%'+#Parameter3+'%'
UNION
SELECT * FROM tblEmployee WHERE Team LIKE '%'+#Parameter1+'%'
UNION
SELECT * FROM tblEmployee WHERE Team LIKE '%'+#Parameter2+'%'
UNION
SELECT * FROM tblEmployee WHERE Team LIKE '%'+#Parameter3+'%'
A very good solution is searloc. It is a CLR library with zero dependencies and with many features.
It supports full text search, phonetic match for all languages, keyboard match, fuzzy search, multi columns search, and many others.
And the most important it is very fast, it needs just a few milliseconds for millions of records.

Need help in optimizing an Oracle query

I came up with a query, to fetch data from a table, which contains 93781665 entries, to display the results as suggestions in an autocomplete text box.
But it takes more than 300 Sec to fetch results.
The query is given below.
select * from table
where upper(column1||' '||column2||' '||column3) like upper('searchstring%')
and rownum <= 10;
Kindly help me to optimize it.
The WHERE clause in your query is not sargable, meaning that no index can be used there. This rules out most of the methods you might use here to optimize the query. Here is one suggestion:
SELECT *
FROM yourTable
WHERE column4 LIKE 'SEARCHSTRING%';
Here, column4 is a new column in your table, which contains the concatenation of the first three columns. Furthermore, all text in column4 will always be uppercase, and the search string you pass into the query will also always be uppercase. Given these assumptions, the following index might help the query:
CREATE INDEX idx ON yourTable (column4);
In Oracle, you can index an expression:
create index idx_t_columns on t(upper(column1||' '||column2||' '||column3))
Then, this condition can use the index:
where upper(column1||' '||column2||' '||column3) like 'searchstring%'
If the search string is constant, then this should also work:
where upper(column1||' '||column2||' '||column3) like upper('searchstring%')
Note that a wildcard at the beginning of the like pattern would preclude the use of an index.

Improving performance on an alphanumeric text search query

I have table where millions of records are there I'm just posting sample data. Actually I'm looking to get only Endorsement data by using LIKE or LEFT but there is no difference between them in Execution time. IS there any fine way to get data in less time while dealing with Alphanumeric Data. I have 4.4M records in table. Suggest me
declare #t table (val varchar(50))
insert into #t(val)values
('0-1AB11BC11yerw123Endorsement'),
('0-1AB114578Endorsement'),
('0-1BC11BC11yerw122553Endorsement'),
('0-1AB11BC11yerw123newBusiness'),
('0-1AB114578newBusiness'),
('0-1BC11BC11yerw122553newBusiness'),
('0-1AB11BC11yerw123Renewal'),
('0-1AB114578Renewal'),
('0-1BC11BC11yerw122553Renewal')
SELECT * FROM #t where RIGHT(val,11) = 'Endorsement'
SELECT * FROM #t where val like '%Endorsement%'
Imagine you'd have to find names in a telephone book that end with a certain string. All you could do is read every single name and compare. It doesn't help you at all to see where the names with A, B, C, etc. start, because you are not interested in the initial characters of the names but only in the last characters instead. Well, the only thing you could do to speed this up is ask some friends to help you and each person scans a range of pages only. In a DBMS it is the same. The DBMS performs a full table scan and does this parallelized if possible.
If however you had a telephone book listing the words backwards, so you'd see which words end with A, B, C, etc., that sure would help. In SQL Server: Create a computed column on the reverse string:
alter table t add reverse_val as reverse(val);
And add an index:
create index idx_reverse_val on t(reverse_val);
Then query the string with LIKE. The DBMS should notice that it can use the index for speeding up the search process.
select * from t where reverse_val like reverse('Endorsement') + '%';
Having said this, it seems strange that you are interested in the end of your strings at all. In a good database you store atomic information, e.g. you would not store a person's name and birthdate in the same column ('John Miller 12.12.2000'), but in separate columns instead. Sure, it does happen that you store names and want to look for names starting with, ending with, containing substrings, but this is a rare thing after all. Check your column and think about whether its content should be separate columns instead. If you had the string ('Endorsement', 'Renewal', etc.) in a separate column, this would really speed up the lookup, because all you'd have to do is ask where val = 'Endorsement' and with an index on that column this is a super-simple task for the DBMS.
try charindex or patindex:
SELECT *
FROM #t t
WHERE CHARINDEX('endorsement', t.val) > 0
SELECT *
FROM #t t
WHERE PATINDEX('%endorsement%', t.val) > 0
CREATE TABLE tbl
(val varchar(50));
insert into tbl(val)values
('0-1AB11BC11yerw123Endorsement'),
('0-1AB114578Endorsement'),
('0-1BC11BC11yerw122553Endorsement'),
('0-1AB11BC11yerw123newBusiness'),
('0-1AB114578newBusiness'),
('0-1BC11BC11yerw122553newBusiness'),
('0-1AB11BC11yerw123Renewal'),
('0-1AB114578Renewal'),
('0-1BC11BC11yerw122553Renewal');
CREATE CLUSTERED INDEX inx
ON dbo.tbl(val)
SELECT * FROM tbl where val like '%Endorsement';
--LIKE '%Endorsement' will give better performance it will utilize the index well efficiently than RIGHT(val,ll)

Full-Text Catalog unable to search Primary Key

I have create this little sample Table Person.
The Id is a primary key and is identity = True
Now when I try to create a FullText Catalog, I'm unable to search for the Id.
If I follow this wizard, I'll end up with a Catalog, where its only possible to search for a persons name. I would really like to know, how I can make it possible to do a fulltext search for both the Id and Name.
edit
SELECT *
FROM [TestDB].[dbo].[Person]
WHERE FREETEXT (*, 'anders' );
SELECT *
FROM [TestDB].[dbo].[Person]
WHERE FREETEXT (*, '1' );
I would like them to return the same result, the first returns id = 1 name = Anders, while the second query don't return anything.
edit 2
Looks like the problem is in using int, but is it not possible to trick FullText to support it?
edit 3
Created a view where I convert the int to a nvarchar. CONVERT(nvarchar(50), Id) AS PersonId this did make it possible for me to select that column, when creating the Full Text Catalog, but It still won't let me find it searching for the Id.
From reading your question, I am not sure that you understand the purpose of a full-text index. A full-text index is intended to search TEXT (one or more columns) on a table. And not just as a replacement for:
SELECT *
FROM table
WHERE col1 LIKE 'Bob''s Pizzaria%'
OR col2 LIKE 'Bob''s Pizzaria%'
OR col3 LIKE 'Bob''s Pizzaria%'
It also allows you to search for variations of "Bob's Pizzaria" like "Bobs Pizzeria" (in case someone misspells Pizzeria or forgot to put in the ' or over-zealous anti-SQL-injection code stripped the ') or "Robert's Pizza" or "Bob's Pizzeria" or "Bob's Pizza", etc. It also allows you to search "in the middle" of a text column (char, varchar, nchar, nvarchar, etc.) without the dreaded "%Bob%Pizza%" that eliminates any chance of using a traditional index.
Enough with the lecture, however. To answer your specific question, I would create a separate column (not a "computed column") "IdText varchar(10)" and then an AFTER INSERT trigger something like this:
UPDATE t
SET IdText = CAST(Id AS varchar(10))
FROM table AS t
INNER JOIN inserted i ON i.Id = t.Id
If you don't know what "inserted" is, see this MSDN article or search Stack Overflow for "trigger inserted". Then you can include the IdText column in your full-text index.
Again, from your example I am not sure that a full-text index is what you should use here but then again your actual situation might be something different and you just created this example for the question. Also, full-text indexes are relatively "expensive" so make sure you do a cost-benefit analysis. Here is a Stack Overflow question about full-text index usage.
Why not just do a query like this.
SELECT *
FROM [TestDB].[dbo].[Person]
WHERE FREETEXT (*, '1' )
OR ID = 1
You can leave off the "OR ID = 1" part if by first checking if the search term is a number.

How to implement a Keyword Search in MySQL?

I am new to SQL programming.
I have a table job where the fields are id, position, category, location, salary range, description, refno.
I want to implement a keyword search from the front end. The keyword can reside in any of the fields of the above table.
This is the query I have tried but it consist of so many duplicate rows:
SELECT
a.*,
b.catname
FROM
job a,
category b
WHERE
a.catid = b.catid AND
a.jobsalrange = '15001-20000' AND
a.jobloc = 'Berkshire' AND
a.jobpos LIKE '%sales%' OR
a.jobloc LIKE '%sales%' OR
a.jobsal LIKE '%sales%' OR
a.jobref LIKE '%sales%' OR
a.jobemail LIKE '%sales%' OR
a.jobsalrange LIKE '%sales%' OR
b.catname LIKE '%sales%'
For a single keyword on VARCHAR fields you can use LIKE:
SELECT id, category, location
FROM table
WHERE
(
category LIKE '%keyword%'
OR location LIKE '%keyword%'
)
For a description you're usually better adding a full text index and doing a Full-Text Search (MyISAM only):
SELECT id, description
FROM table
WHERE MATCH (description) AGAINST('keyword1 keyword2')
SELECT
*
FROM
yourtable
WHERE
id LIKE '%keyword%'
OR position LIKE '%keyword%'
OR category LIKE '%keyword%'
OR location LIKE '%keyword%'
OR description LIKE '%keyword%'
OR refno LIKE '%keyword%';
Ideally, have a keyword table containing the fields:
Keyword
Id
Count (possibly)
with an index on Keyword. Create an insert/update/delete trigger on the other table so that, when a row is changed, every keyword is extracted and put into (or replaced in) this table.
You'll also need a table of words to not count as keywords (if, and, so, but, ...).
In this way, you'll get the best speed for queries wanting to look for the keywords and you can implement (relatively easily) more complex queries such as "contains Java and RCA1802".
"LIKE" queries will work but they won't scale as well.
Personally, I wouldn't use the LIKE string comparison on the ID field or any other numeric field. It doesn't make sense for a search for ID# "216" to return 16216, 21651, 3216087, 5321668..., and so on and so forth; likewise with salary.
Also, if you want to use prepared statements to prevent SQL injections, you would use a query string like:
SELECT * FROM job WHERE `position` LIKE CONCAT('%', ? ,'%') OR ...
I will explain the method i usally prefer:
First of all you need to take into consideration that for this method you will sacrifice memory with the aim of gaining computation speed.
Second you need to have a the right to edit the table structure.
1) Add a field (i usually call it "digest") where you store all the data from the table.
The field will look like:
"n-n1-n2-n3-n4-n5-n6-n7-n8-n9" etc.. where n is a single word
I achieve this using a regular expression thar replaces " " with "-".
This field is the result of all the table data "digested" in one sigle string.
2) Use the LIKE statement %keyword% on the digest field:
SELECT * FROM table WHERE digest LIKE %keyword%
you can even build a qUery with a little loop so you can search for multiple keywords at the same time looking like:
SELECT * FROM table WHERE
digest LIKE %keyword1% AND
digest LIKE %keyword2% AND
digest LIKE %keyword3% ...
You can find another simpler option in a thread here: Match Against.. with a more detail help in 11.9.2. Boolean Full-Text Searches
This is just in case someone need a more compact option. This will require to create an Index FULLTEXT in the table, which can be accomplish easily.
Information on how to create Indexes (MySQL): MySQL FULLTEXT Indexing and Searching
In the FULLTEXT Index you can have more than one column listed, the result would be an SQL Statement with an index named search:
SELECT *,MATCH (`column`) AGAINST('+keyword1* +keyword2* +keyword3*') as relevance FROM `documents`USE INDEX(search) WHERE MATCH (`column`) AGAINST('+keyword1* +keyword2* +keyword3*' IN BOOLEAN MODE) ORDER BY relevance;
I tried with multiple columns, with no luck. Even though multiple columns are allowed in indexes, you still need an index for each column to use with Match/Against Statement.
Depending in your criterias you can use either options.
I know this is a bit late but what I did to our application is this. Hope this will help someone tho. But it works for me:
SELECT * FROM `landmarks` WHERE `landmark_name` OR `landmark_description` OR `landmark_address` LIKE '%keyword'
OR `landmark_name` OR `landmark_description` OR `landmark_address` LIKE 'keyword%'
OR `landmark_name` OR `landmark_description` OR `landmark_address` LIKE '%keyword%'