CONTAINSTABLE adding 'of*' returns no results - sql

I have a containstable query on SQL server 2008:
SELECT contacts.*, [Rank] FROM
CONTAINSTABLE(Contacts, SearchName, '("department*") AND ("work*")') tmp
JOIN contacts on contacts.contactid = tmp.[key]
WHERE contacts.deleted = 0
This returns 1 result as expected, however if the user has entered "of" in their search criteria the query returns no results:
SELECT contacts.*, [Rank] FROM
CONTAINSTABLE(Contacts, SearchName, '("department*") AND ("of*") AND ("work*")') tmp
JOIN contacts on contacts.contactid = tmp.[key]
WHERE contacts.deleted = 0
The full name of the contact record is "department of work and pensions".
The same happens if the user includes "and" in their search. Why are these words breaking the query and is there a way around it or do i have to strip out the words before executing the search?

You need to learn about stop words. These are explained well in the documentation.
The short explanation, though, is that all full text engines keep a list of words that are not indexed. Prominent among these words are things like "of", "the", and similar non-content containing words. Of course, you can configure the server to actually recognize these words. And this is very important in some applications: stop words happen to be very useful when trying to determine the language of a document.
In any case, the word "of" is in the stop word list. So it is not indexed and you cannot find it using CONTAINSTABLE. If you need to search for it, you can implement your own custom stop word list and rebuild the index.

Related

SQL Server full-text search for Latex content

I have a web app that allows users to save Latex content to a SQL Server 2012 database. I am running a full-text query as below to search for Latex expression.
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'2x-4=0');
The problem I am facing with above query is that some of the messages being returned by above query do not contain the latex expression 2x-4=0. For example, a message whose saved value is as below is also being returned by above query. You can clearly see that there is no 2x-4=0 contained in this message.
<p>Another example of inline Latex is \$x=34\$.</p>
<p>What are the roots of following equation: \$x^2 - 2x + 1 = 0\$?</p>
Question
Why is this happening and is there a way to get correct records returned when doing a full text search to look for the latex expression 2x-4 = 0? I have tried to repopulate the full text search data for the table being used, but it had no effect.
UPDATE 1
Strange, but the following Latex expression filter always returns exact matching results. I am now looking for $2x-4=0$ rather than 2x-4=0.
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'$2x-4=0$');
I have two types of delimiters for latex expression in my app: $$ for paragraph display and \$ for inline display of Latex expression, and therefore there will always be a $ symbol surrounding the latex expression stored in database, though the trailing delimiter could be \$ but full-text search seems to ignore the backslash character.
Why this modified query returns exact matches is not clear to me.
UPDATE 2
Another approach that works accurately is as mentioned in the answer. The full query for this is mentioned below. So, the LIKE operator ends up scanning only those rows that are selected by full-text search query.
WITH x AS
(SELECT MessageID,
Message
FROM Messages m
WHERE CONTAINS (m.Message,
N'2x-4=0') )
SELECT MessageID,
Message
FROM x
WHERE x.Message LIKE "%2x-4=0%"
To understand why it happens you can run the following query (1033 is the English language id):
select * from sys.dm_fts_parser('2x-4=0', 1033, 0,1)
In my instance it would return the following results:
Note, all other parts of the search criteria are considered to be noise words except for 2x. Therefore, I suspect your full text index simply does not have the full 2x-4=0 string and instead you get results with occurrences of 2x.
I tried adding 2x-4=0 to my own FTS index and CONTAINS was able to find it as the top result for both CONTAINS(col, '2x-4=0') and CONTAINS(col, '"2x-4=0"'). However, partial matches were included too right after the exact match.
Note, that when extra white space is added around = in the search term the FTS parser won't accept it and complain about syntax error.
CONTAINS is more like an end-user search operation, with support for keywords like NEAR, AND and OR. Try adding quotes within the quotes, to force the exact search term:
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'"2x-4=0"');
This is called <simple-term> in the documentation.
You can also try the LIKE operator:
SELECT MessageID, Message FROM Messages m WHERE m.Message LIKE '%2x-4=0%';
But note that this is probably slower than CONTAINS because it doesn't use a full text search index. If it's too slow, maybe you can even combine both of them in one query, so the CONTAINS is used to filter the result set down to the non-noise words using the index, and then LIKE applies the final matching.

Search multiple keywords in DocumentDB collection

I have a Azure DocumentDB collection with a 100 documents. I have tokenized an array of search terms in each document for performing a search based on keywords.
I was able to search on just one keyword using below SQL query for DocumentDB:
SELECT VALUE c FROM root c JOIN word IN c.tags WHERE
CONTAINS(LOWER(word), LOWER('keyword'))
However, this only allows search based on single keyword. I want to be able to search given multiple keywords. For this, I tried below query:
SELECT * FROM c WHERE ARRAY_CONTAINS(c.tags, "Food") OR
ARRAY_CONTAINS(c.tags, "Dessert") OR ARRAY_CONTAINS(c.tags, "Spicy")
This works, but is case-sensitive. How do I make this case-insensitive? I tried using scalar function LOWER like this
LOWER(c.tags), LOWER("Dessert")
but this doesn't seem to work with ARRAY_CONTAINS.
Any idea how I can perform a case-insensitive search on multiple keywords using SQL query for DocumentDB?
Thanks,
AB
The best way to deal with the case sensitivity is to store them in the tags array with all lower case (or upper case) and then just do LOWER(<user-input-tag>) at query time.
As for your desire to search on multiple user input tags, your approach of building a series of OR clauses is probably the best approach.

SQL Server full text search and spaces

I have a column with a product names. Some names look like ‘ab-cd’ ‘ab cd’
Is it possible to use full text search to get these names when user types ‘abc’ (without spaces) ? The like operator is working for me, but I’d like to know if it’s possible to use full text search.
If you want to use FTS to find terms that are adjacent to each other, like words separated by a space you should use a proximity term.
You can define a proximity term by using the NEAR keyword or the ~ operator in the search expression, as documented here.
So if you want to find ab followed immediately by cd you could use the expression,
'NEAR((ab,cd), 0)'
searching for the word ab followed by the word cd with 0 terms in-between.
No, unfortunately you cannot make such search via full-text. You can only use LIKE in that case LIKE ('ab%c%')
EDIT1:
You can create a view (WITH SCHEMABINDING!) with some id and column name in which you want to search:
CREATE VIEW dbo.ftview WITH SCHEMABINDING
AS
SELECT id,
REPLACE(columnname,' ','') as search_string
FROM YourTable
Then create index
CREATE UNIQUE CLUSTERED INDEX UCI_ftview ON dbo.ftview (id ASC)
Then create full-text search index on search_string field.
After that you can run CONTAINS query with "abc*" search and it will find what you need.
EDIT2:
But it wont help if search_string does not start with your search term.
For example:
ab c d -> abcd and you search cd
No. Full Text Search is based on WORDS and Phrases. It does not store the original text. In fact, depending on configuration it will not even store all words - there are so called stop words that never go into the index. Example: in english the word "in" is not selective enough to be considered worth storing.
Some names look like ‘ab-cd’ ‘ab cd’
Those likely do not get stored at all. At least the 2nd example is actually 2 extremely short words - quite likely they get totally ignored.
So, no - full text search is not suitable for this.

Strange issue with SQL contains - ignoring starting characters of a string

I am experiencing a strange issue with the sql full text indexing. Basically i am searching a column which is used to house email addresses. Seems to be working as expected for all cases i tested except one!
SELECT *
FROM Table
WHERE CONTAINS(Email, '"email#me.com"')
For a certain email address it is completely ignoring the "email" part above and is instead doing
SELECT *
FROM Table
WHERE CONTAINS(Email, '#me.com')
There was only one case that i could find that this was happening for. I repopulated the index, but no joy. Also rebuilt the catalog.
Any ideas??
Edit:
I cannot put someone's email address on a public website, so I will give more appropriate examples. The one that is causing the issue is of the form:
a.b.c#somedomain.net.au
When i do
WHERE CONTAINS(Email, "'a.b.c#somedomain.net.au"')
The matching rows which are returned are all of the form .*#somedomain.net.au. I.e. it is ignoring the a.b.c part.
Full stops are treated as noise words (or stopwords) in a fulltext index, you can find a list of the excluded characters by checking the system stopwords:
SELECT * FROM sys.fulltext_system_stopwords WHERE language_id = 2057 --this is the lang Id for British English (change accordingly)
So your email address which is "a.b.c#somedomain.net.au" is actually treated as "a b c#somedomain.net.au" and in this particular case as individual letters are also excluded from the index you end up searching on "#somedomain.net.au"
You really have two choices, you can either replace the character you want to include before indexing (so replace the special characters with a match tag) or you remove the words/character you which to include from the Full Text Stoplist.
NT// If you choose the latter I would be careful as this can bloat your index significantly.
Here are some links that should help you :
Configure and Manage Stopwords and Stoplists for Full-Text Search
Create Full Text Stoplists

sql query relative searching to previous searched words

I have list of word in table. I want to search for all records contain e.g. book and books, pen and pens, that means, for all the word which ends with 's'. The query should show the word without 's' and the word with 's' too.
not a query "SELECT * FROM words WHERE word LIKE '%s'"
schema definition is,
words = <word, part_of_speech>
I have to search on 'word'
How can I do this?
The result could be,
book
books
pen
pens
Its something like, if there is a value in the colum as 'word' and there is another value as 'word'+'s' then show the rows of both 'word' and 'word'+'s'.
I'm using sqlite.
SELECT word FROM words WHERE word LIKE 'book%'
will match 'book', 'books', 'bookmark', etc
if you want to search for only a specific sufix then try
SELECT
*
FROM
words
WHERE
word = '%s'
or word = '%s' || 's' #change 's' to any addition you want to try
Google the "Porter Stemming Algorithm" and apply it to your data before you load it. This algorithm is as close as you can get to converting not just plurals but many other forms of word to a single word. e.g., "scholarly" becomes "scholar" and stuff like that.
If that does not meet your quality standards, because it will not trap for "mice" and other examples given in other answers, you will have to find a "stemming file". I know of no free ones (which does not mean there are none), but the one we use at my shop is part of a commercial package, so I've never had to find a free one.
At any rate, once you have applied the stemming to the words on the way in, you no longer have to search for multiple versions of a word, you just search for the stem.