FTS doesn't work as expected with emails with dots - sql

We're developing a search as a part of a bigger system.
We have Microsoft SQL Server 2014 - 12.0.2000.8 (X64) Standard Edition (64-bit) with this setup:
CREATE TABLE NewCompanies(
[Id] [uniqueidentifier] NOT NULL,
[Name] [nvarchar](400) NOT NULL,
[Phone] [nvarchar](max) NULL,
[Email] [nvarchar](max) NULL,
[Contacts1] [nvarchar](max) NULL,
[Contacts2] [nvarchar](max) NULL,
[Contacts3] [nvarchar](max) NULL,
[Contacts4] [nvarchar](max) NULL,
[Address] [nvarchar](max) NULL,
CONSTRAINT PK_Id PRIMARY KEY (Id)
);
Phone is a structured comma separated digits string like
"77777777777, 88888888888"
Email is structured emails string with commas like
"email1#gmail.com, email2#gmail.com" (or without commas at all like
"email1#gmail.com")
Contacts1, Contacts2, Contacts3, Contacts4 are text fields where users can specify contact details in free form. Like "John Smith +1 202 555 0156" or "Bob, +1-999-888-0156, bob#company.com". These fields can contain emails and phones we want to search further.
Here we create full-text stuff
-- FULL TEXT SEARCH
CREATE FULLTEXT CATALOG NewCompanySearch AS DEFAULT;
CREATE FULLTEXT INDEX ON NewCompanies(Name, Phone, Email, Contacts1, Contacts2, Contacts3, Contacts4, Address)
KEY INDEX PK_Id
Here is a data sample
INSERT INTO NewCompanies(Id, Name, Phone, Email, Contacts1, Contacts2, Contacts3, Contacts4)
VALUES ('7BA05F18-1337-4AFB-80D9-00001A777E4F', 'PJSC Azimuth', '79001002030, 78005005044', 'regular#hotmail.com, s.m.s#gmail.com', 'John Smith', 'Call only at weekends +7-999-666-22-11', NULL, NULL)
Actually we have about 100 thousands of such records.
We expect users can specify a part of email like "#gmail.com" and this should return all the rows with Gmail email addresses in any of Email, Contacts1, Contacts2, Contacts3, Contacts4 fields.
The same for phone numbers. Users can search for a pattern like "70283" and a query should return phones with these digits in them. It's even for free form Contacts1, Contacts2, Contacts3, Contacts4 fields where we probably should remove all but digits and space characters firstly before searching.
We used to use LIKE for the search when we had about 1500 records and it worked fine but now we have a lot of records and the LIKE search takes infinite to get results.
This is how we try to get data from there:
SELECT * FROM NewCompanies WHERE CONTAINS((Email, Contacts1, Contacts2, Contacts3, Contacts4), '"s.m.s#gmail.com*"') -- this doesn't get the row
SELECT * FROM NewCompanies WHERE CONTAINS((Phone, Contacts1, Contacts2, Contacts3, Contacts4), '"6662211*"') -- doesn't get anything
SELECT * FROM NewCompanies WHERE CONTAINS(Name, '"zimuth*"') -- doesn't get anything

Actually requests
SELECT [...] CONTAINS([...], '"6662211*"') -- doesn't get anything
against 'Call only at weekends +7-999-666-22-11'
and
SELECT [...] CONTAINS(Name, '"zimuth*"') -- doesn't get anything
against 'PJSC Azimuth'
do work as expected.
See Prefix Term. Because 6662211* is not a prefix of +7-999-666-22-11 as well as zimuth* is not a prefix of Azimuth
As for
SELECT [...] CONTAINS([...], '"s.m.s#gmail.com*"') -- this doesn't get the row
This is probably due to word breakers as alwayslearning pointed out in comments. See word-breakers
I don't think that Full-Text Search is applicable for your task.
Why use for FTS in the exact same tasks that LIKE operator is used for? If there were a better index type for LIKE queries... then there would be the better index type, not the totally different technology and syntax.
And in no way it will help you to match "6662211*" against "666some arbitrary char22some arbitrary char11".
Full Text search is not about regex-es (and "6662211*" is not even a correct expression for the job - there is nothing about "some arbitrary char" part) it's about synonyms, word forms, etc.
But is it at all possible to search for substrings effectively?
Yes it is. Leaving aside such prospects as writing your own search engine, what can we do within SQL?
First of all - it is an imperative to cleanup your data!
If you want to return to the users the exact strings they have entered
users can specify contact details in free form
...you can save them as is... and leave them along.
Then you need to extract data from the free form text (it is not so hard for emails and phone numbers) and save the data in some canonical form.
For email, the only thing you really need to do - make them all lowercase or uppercase (doesn't matter), and maybe split then on the # sing. But in phone numbers you need to leave only digits
(...And then you can even store them as numbers. That can save you some space and time. But the search will be different... For now let's dive into a more simple and universal solution using strings.)
As MatthewBaker mentioned you can create a table of suffixes.
Then you can search like so
SELECT DISTINCT * FROM NewCompanies JOIN Sufficies ON NewCompanies.Id = Sufficies.Id WHERE Sufficies.sufficies LIKE 'some text%'
You should place the wildcard % only at the end. Or there would be no benefits from the Suffixes table.
Let take for example a phone number
+7-999-666-22-11
After we get rid of waste chars in it, it will have 11 digits. That means we'll need 11 suffixes for one phone number
1
11
211
2211
62211
662211
6662211
96662211
996662211
9996662211
79996662211
So the space complexity for this solution is linear... not so bad, I'd say... But wait it's complexity in the number of records. But in symbols... we need N(N+1)/2 symbols to store all the suffixes - that is quadratic complexity... not good... but if you have now 100 000 records and do not have plans for millions in the near future - you can go with this solution.
Can we reduce space complexity?
I will only describe the idea, implementing it will take some effort. And probably we'll need to cross the boundaries of SQL
Let's say you have 2 rows in NewCompanies and 2 strings of free form text in it:
aaaaa
11111
How big should the Suffixes table be? Obviously, we need only 2 records.
Let's take another example. Also 2 rows, 2 free text strings to search for. But now it's:
aa11aa
cc11cc
Let's see how many suffixes do we need now:
a // no need, LIKE `a%` will match against 'aa' and 'a11aa' and 'aa11aa'
aa // no need, LIKE `aa%` will match against 'aa11aa'
1aa
11aa
a11aa
aa11aa
c // no need, LIKE `c%` will match against 'cc' and 'c11cc' and 'cc11cc'
cc // no need, LIKE `cc%` will match against 'cc11cc'
1cc
11cc
c11cc
cc11cc
No so bad, but not so good either.
What else can we do?
Let's say, user enters "c11" in the search field. Then LIKE 'c11%' needs 'c11cc' suffix to succeed. But if instead of searching for "c11" we first search for "c%", then for "c1%" and so on? The first search will give as only one row from NewCompanies. And there would be no need for subsequent searches. And we can
1aa // drop this as well, because LIKE '1%' matches '11aa'
11aa
a11aa // drop this as well, because LIKE 'a%' matches 'aa11aa'
aa11aa
1cc // same here
11cc
c11cc // same here
cc11cc
and we end up with only 4 suffixes
11aa
aa11aa
11cc
cc11cc
I can't say what the space complexity would be in this case, but it feels like it would be acceptable.

In cases like this full text searching is less than ideal. I was in the same boat as you are. Like searches are too slow, and full text searches search for words that start with a term rather than contains a term.
We tried several solutions, one pure SQL option is to build your own version of full text search, in particular an inverted index search. We tried this, and it was successful, but took a lot of space. We created a secondary holding table for partial search terms, and used full text indexing on that. However this mean we repeatedly stored multiple copies of the same thing. For example we stored "longword" as Longword, ongword, ngword, gword.... etc. So any contained phrase would always be at the start of the indexed term. A horrendous solution, full of flaws, but it worked.
We then looked at hosting a separate server for lookups. Googling Lucene and elastisearch will give you good information on these off the shelf packages.
Eventually, we developed our own in house search engine, which runs along side SQL. This has allowed us to implement phonetic searches (double metaphone) and then using levenshtein calculations along side soundex to establish relevance. Overkill for a lot of solutions, but worth the effort in our use case. We even now have an option of leveraging Nvidia GPUs for cuda searches, but this represented a whole new set of headaches and sleepless nights. Relevance of all these will depend on how often you see your searches being performed, and how reactive you need them to be.

Full-Text Indexes have a number of limitations. You can use wildcards on words that the index finds are whole "parts" but even then you are constrained to the ending part of the word. That is why you can use CONTAINS(Name, '"Azimut*"') but not CONTAINS(Name, '"zimuth*"')
From the Microsoft documentation:
When the prefix term is a phrase, each token making up the phrase is
considered a separate prefix term. All rows that have words beginning
with the prefix terms will be returned. For example, the prefix term
"light bread*" will find rows with text of "light breaded," "lightly
breaded," or "light bread," but it will not return "lightly toasted
bread."
The dots in the email, as indicated by the title, are not the main issue. This, for example, works:
SELECT * FROM NewCompanies
WHERE CONTAINS((Email, Contacts1, Contacts2, Contacts3, Contacts4), 's.m.s#gmail.com')
In this case, the index identifies the whole email string as valid, as well as "gmail" and "gmail.com." Just "s.m.s" though is not valid.
The last example is similar. The parts of the phone number are indexed (666-22-11 and 999-666-22-11 for example), but removing the hyphens is not a string that the index is going to know about. Otherwise, this does work:
SELECT * FROM NewCompanies
WHERE CONTAINS((Phone, Contacts1, Contacts2, Contacts3, Contacts4), '"666-22-11*"')

Related

Optimising LIKE expressions that start with wildcards

I have a table in a SQL Server database with an address field (ex. 1 Farnham Road, Guildford, Surrey, GU2XFF) which I want to search with a wildcard before and after the search string.
SELECT *
FROM Table
WHERE Address_Field LIKE '%nham%'
I have around 2 million records in this table and I'm finding that queries take anywhere from 5-10s, which isn't ideal. I believe this is because of the preceding wildcard.
I think I'm right in saying that any indexes won't be used for seek operations because of the preceeding wildcard.
Using full text searching and CONTAINS isn't possible because I want to search for the latter parts of words (I know that you could replace the search string for Guil* in the below query and this would return results). Certainly running the following returns no results
SELECT *
FROM Table
WHERE CONTAINS(Address_Field, '"nham"')
Is there any way to optimise queries with preceding wildcards?
Here is one (not really recommended) solution.
Create a table AddressSubstrings. This table would have multiple rows per address and the primary key of table.
When you insert an address into table, insert substrings starting from each position. So, if you want to insert 'abcd', then you would insert:
abcd
bcd
cd
d
along with the unique id of the row in Table. (This can all be done using a trigger.)
Create an index on AddressSubstrings(AddressSubstring).
Then you can phrase your query as:
SELECT *
FROM Table t JOIN
AddressSubstrings ads
ON t.table_id = ads.table_id
WHERE ads.AddressSubstring LIKE 'nham%';
Now there will be a matching row starting with nham. So, like should make use of an index (and a full text index also works).
If you are interesting in the right way to handle this problem, a reasonable place to start is the Postgres documentation. This uses a method similar to the above, but using n-grams. The only problem with n-grams for your particular problem is that they require re-writing the comparison as well as changing the storing.
I can't offer a complete solution to this difficult problem.
But if you're looking to create a suffix search capability, in which, for example, you'd be able to find the row containing HWilson with ilson and the row containing ABC123000654 with 654, here's a suggestion.
WHERE REVERSE(textcolumn) LIKE REVERSE('ilson') + '%'
Of course this isn't sargable the way I wrote it here. But many modern DBMSs, including recent versions of SQL server, allow the definition, and indexing, of computed or virtual columns.
I've deployed this technique, to the delight of end users, in a health-care system with lots of record IDs like ABC123000654.
Not without a serious preparation effort, hwilson1.
At the risk of repeating the obvious - any search path optimisation - leading to the decision whether an index is used, or which type of join operator to use, etc. (independently of which DBMS we're talking about) - works on equality (equal to) or range checking (greater-than and less-than).
With leading wildcards, you're out of luck.
The workaround is a serious preparation effort, as stated up front:
It would boil down to Vertica's text search feature, where that problem is solved. See here:
https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/AdministratorsGuide/Tables/TextSearch/UsingTextSearch.htm
For any other database platform, including MS SQL, you'll have to do that manually.
In a nutshell: It relies on a primary key or unique identifier of the table whose text search you want to optimise.
You create an auxiliary table, whose primary key is the primary key of your base table, plus a sequence number, and a VARCHAR column that will contain a series of substrings of the base table's string you initially searched using wildcards. In an over-simplified way:
If your input table (just showing the columns that matter) is this:
id |the_search_col |other_col
42|The Restaurant at the End of the Universe|Arthur Dent
43|The Hitch-Hiker's Guide to the Galaxy |Ford Prefect
Your auxiliary search table could contain:
id |seq|search_token
42| 1|Restaurant
42| 2|End
42| 3|Universe
43| 1|Hitch-Hiker
43| 2|Guide
43| 3|Galaxy
Normally, you suppress typical "fillers" like articles and prepositions and apostrophe-s , and split into tokens separated by punctuation and white space. For your '%nham%' example, however, you'd probably need to talk to a linguist who has specialised in English morphology to find splitting token candidates .... :-]
You could start by the same technique that I use when I un-pivot a horizontal series of measures without the PIVOT clause, like here:
Pivot sql convert rows to columns
Then, use a combination of, probably nested, CHARINDEX() and SUBSTRING() using the index you get from the CROSS JOIN with a series of index integers as described in my post suggested above, and use that very index as the sequence for the auxiliary search table.
Lay an index on search_token and you'll have a very fast access path to a big table.
Not a stroll in the park, I agree, but promising ...
Happy playing -
Marco the Sane

SQL Server full text search and spaces

I have a column with a product names. Some names look like ‘ab-cd’ ‘ab cd’
Is it possible to use full text search to get these names when user types ‘abc’ (without spaces) ? The like operator is working for me, but I’d like to know if it’s possible to use full text search.
If you want to use FTS to find terms that are adjacent to each other, like words separated by a space you should use a proximity term.
You can define a proximity term by using the NEAR keyword or the ~ operator in the search expression, as documented here.
So if you want to find ab followed immediately by cd you could use the expression,
'NEAR((ab,cd), 0)'
searching for the word ab followed by the word cd with 0 terms in-between.
No, unfortunately you cannot make such search via full-text. You can only use LIKE in that case LIKE ('ab%c%')
EDIT1:
You can create a view (WITH SCHEMABINDING!) with some id and column name in which you want to search:
CREATE VIEW dbo.ftview WITH SCHEMABINDING
AS
SELECT id,
REPLACE(columnname,' ','') as search_string
FROM YourTable
Then create index
CREATE UNIQUE CLUSTERED INDEX UCI_ftview ON dbo.ftview (id ASC)
Then create full-text search index on search_string field.
After that you can run CONTAINS query with "abc*" search and it will find what you need.
EDIT2:
But it wont help if search_string does not start with your search term.
For example:
ab c d -> abcd and you search cd
No. Full Text Search is based on WORDS and Phrases. It does not store the original text. In fact, depending on configuration it will not even store all words - there are so called stop words that never go into the index. Example: in english the word "in" is not selective enough to be considered worth storing.
Some names look like ‘ab-cd’ ‘ab cd’
Those likely do not get stored at all. At least the 2nd example is actually 2 extremely short words - quite likely they get totally ignored.
So, no - full text search is not suitable for this.

How to tackle efficient searching of a string that could have multiple variations?

My title sounds complicated, but the situation is very simple. People search on my site using a term such as "blackfriday".
When they conduct the search, my SQL code needs to look in various places such as a ProductTitle and ProductDescription field to find this term. For example:
SELECT *
FROM dbo.Products
WHERE ProductTitle LIKE '%blackfriday%' OR
ProductDescription LIKE '%blackfriday%'
However, the term appears differently in the database fields. It is most like to appear with a space between the words as such "Black Friday USA 2015". So without going through and adding more combinations to the WHERE clause such as WHERE ProductTitle LIKE '%Black-Friday%', is there a better way to accomplish this kind of fuzzy searching?
I have full-text search enabled on the above fields but its really not that good when I use the CONTAINS clause. And of course other terms may not be as neat as this example.
I should start by saying that "variations (of a string)" is a bit vague. You could mean plurality, verb tenses, synonyms, and/or combined words (or, ignoring spaces and punctuation between 2 words) like the example you posted: "blackfriday" vs. "black friday" vs "black-friday". I have a few solutions of which 1 or more together may work for you depending on your use case.
Ignoring punctuation
Full Text searches already ignore punctuation and match them to spaces. So black-friday will match black friday whether using FREETEXT or CONTAINS. But it won't match blackfriday.
Synonyms and combined words
Using FREETEXT or FREETEXTTABLE for your full text search is a good way to handle synonyms and some matching of combined words (I don't know which ones). You can customize the thesaurus to add more combined words assuming it's practical for you to come up with such a list.
Handling combinations of any 2 words
Maybe your use case calls for you to match poorly formatted text or hashtags. In that case I have a couple of ideas:
Write the full text query to cover each combination of words using a dictionary. For example your data layer can rewrite a search for black friday as CONTAINS(*, '"black friday" OR "blackfriday"'). This may have to get complex, for example would black friday treehouse have to be ("black friday" OR "blackfriday") AND ("treehouse" OR "tree house")? You would need a dictionary to figure out that "treehouse" is made up of 2 words and thus can be split.
If it's not practical to use a dictionary for the words being searched for (I don't know why, maybe acronyms or new memes) you could create a long query to cover every letter combination. So searching for do-re-mi could be "do re mi" OR "doremi" OR "do remi" OR "dore mi" OR "d oremi" OR "d o remi" .... Yes it will be a lot of combinations, but surprisingly it may run quickly because of how full text efficiently looks up words in the index.
A hack / workaround if searching for multiple variations is very important.
Define which fields in the DB are searchable (e.g ProductTitle, ProductDescription)
Before saving these fields in the DB, replace each space (or consecutive spaces by a placeholder e.g "%")
Search the DB for variation matches employing the placeholder
Do the reverse process when displaying these fields on your site (i.e replace placeholder with space)
Alternatively you can enable regex matching for your users (meaning they can define a regex either explicitly or let your app build one from their search term). But it is slower and probably error-prone to do it this way
After looking into everything, I have settled for using SQL's FREETEXT full-text search. Its not ideal, or accurate, but for now it will have to do.
My answer is probably inadequate but do you have any scenarios which wont be addressed by query below.
SELECT *
FROM dbo.Products
WHERE ProductTitle LIKE '%black%friday%' OR
ProductDescription LIKE '%black%friday%'

Strange issue with SQL contains - ignoring starting characters of a string

I am experiencing a strange issue with the sql full text indexing. Basically i am searching a column which is used to house email addresses. Seems to be working as expected for all cases i tested except one!
SELECT *
FROM Table
WHERE CONTAINS(Email, '"email#me.com"')
For a certain email address it is completely ignoring the "email" part above and is instead doing
SELECT *
FROM Table
WHERE CONTAINS(Email, '#me.com')
There was only one case that i could find that this was happening for. I repopulated the index, but no joy. Also rebuilt the catalog.
Any ideas??
Edit:
I cannot put someone's email address on a public website, so I will give more appropriate examples. The one that is causing the issue is of the form:
a.b.c#somedomain.net.au
When i do
WHERE CONTAINS(Email, "'a.b.c#somedomain.net.au"')
The matching rows which are returned are all of the form .*#somedomain.net.au. I.e. it is ignoring the a.b.c part.
Full stops are treated as noise words (or stopwords) in a fulltext index, you can find a list of the excluded characters by checking the system stopwords:
SELECT * FROM sys.fulltext_system_stopwords WHERE language_id = 2057 --this is the lang Id for British English (change accordingly)
So your email address which is "a.b.c#somedomain.net.au" is actually treated as "a b c#somedomain.net.au" and in this particular case as individual letters are also excluded from the index you end up searching on "#somedomain.net.au"
You really have two choices, you can either replace the character you want to include before indexing (so replace the special characters with a match tag) or you remove the words/character you which to include from the Full Text Stoplist.
NT// If you choose the latter I would be careful as this can bloat your index significantly.
Here are some links that should help you :
Configure and Manage Stopwords and Stoplists for Full-Text Search
Create Full Text Stoplists

FREETEXT queries in SQL Server 2008 not phrase matching

I have a full text indexed table in SQL Server 2008 that I am trying to query for an exact phrase match using FULLTEXT. I don't believe using CONTAINS or LIKE is appropriate for this, because in other cases the query might not be exact (user doesn't surround phrase in double quotes) and in general I want to flexibility of FREETEXT.
According to the documentation[MSDN] for FREETEXT:
If freetext_string is enclosed in double quotation marks, a phrase match is instead performed; stemming and thesaurus are not performed.
which would lead me to believe a query like this:
SELECT Description
FROM Projects
WHERE FREETEXT(Description, '"City Hall"')
would only return results where the term "City Hall" appears in the Description field, but instead I get results like this:
1 Design of handicap ramp at Manning Hall.
2 Antenna investigation. Client: City of Cranston Engineering Dept.
3 Structural investigation regarding fire damage to International Tennis Hall of Fame.
4 Investigation Roof investigation for proposed satellite design on Herald Hall.
... etc
Obviously those results include at least one of the words in my phrase, but not the phrase itself. What's worse, I had thought the results would be ranked but the two results I actually wanted (because they include the actual phrase) are buried.
SELECT Description
FROM Projects
WHERE Description LIKE '%City Hall%'
1 Major exterior and interior renovation of the existing city hall for Quincy Massachusetts
2 Cursory structural investigation of Pawtucket City Hall tower plagued by leaks.
I'm sure this is a case of me not understanding the documentation, but is there a way to achieve what I'm looking for? Namely, to be able to pass in a search string without quotes and get exactly what I'm getting now or with quotes and get only that exact phrase?
As you said, FREETEXT looks up every word in your phrase, not the phrase as an all. For that you need to use the CONTAINS statement. Like this:
SELECT Description
FROM Projects
WHERE CONTAINS(Description, '"City Hall"')
If you want to get the rank of the results, you have to use CONTAINSTABLE. It works roughly the same, but it returns a table with two columns: [Key] wich contains the primary key of the search table and [Rank], which gives you the rank of the result.