MySQL MATCH...AGAINST sometimes finds answer, sometimes doesn't - sql

The following two queries return the same (expected) result when I query my database:
SELECT * FROM articles
WHERE content LIKE '%Euskaldunak%'
SELECT * FROM articles
WHERE MATCH (content) AGAINST ('+"Euskaldunak"' IN BOOLEAN MODE)
The text in the content field that it's searching looks like this: "...These Euskaldunak, or newcomers..."
However, the following query on the same table returns the expected single result:
SELECT * FROM articles
WHERE content LIKE '%PCC%'
And the following query returns an empty result:
SELECT * FROM articles
WHERE MATCH (content) AGAINST ('+"PCC"' IN BOOLEAN MODE)
The text in the content field that matches this result looks like this: "...Portland Community College (PCC) is the largest..."
I can't figure out why searching for "Euskaldunak" works with that MATCH...AGAINST syntax but "PCC" doesn't. Does anyone see something that I'm not seeing?
(Also: "PCC" is not a common phrase in this field - no other rows contain the word, so the natural language search shouldn't be excluding it.)

Your fulltext minimum word length is probably set too high. I think the default is 4, which would explain what you are seeing. Set it to 1 if you want all words indexed regardless of length.
Run this query:
show variables like 'ft_min_word_len';
If the values is greater than 3 and you want to get hits on words shorter than that, edit your /etc/my.cnf and add or update this line in the [mysqld] section using a value appropriate for your application:
ft_min_word_len = 1
Then restart MySQL and rebuild your fulltext indexes and you should be all set.

There are two things I can think of right away. The first is your ft_min_word_len value is set to more than 3 characters. Any "word" less than the ft_min_word_len length will not get indexed.
The second is that more then 50% of your records contain the 'PCC' string. A full text search that matches more than 50% of the records is considered irrelevant and returns nothing.
Full text indexes have different rules than regular string indexes. For example, there is a stop words list so certain common words like to, the, and, don't get indexed.

Related

SQL Server full-text search for Latex content

I have a web app that allows users to save Latex content to a SQL Server 2012 database. I am running a full-text query as below to search for Latex expression.
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'2x-4=0');
The problem I am facing with above query is that some of the messages being returned by above query do not contain the latex expression 2x-4=0. For example, a message whose saved value is as below is also being returned by above query. You can clearly see that there is no 2x-4=0 contained in this message.
<p>Another example of inline Latex is \$x=34\$.</p>
<p>What are the roots of following equation: \$x^2 - 2x + 1 = 0\$?</p>
Question
Why is this happening and is there a way to get correct records returned when doing a full text search to look for the latex expression 2x-4 = 0? I have tried to repopulate the full text search data for the table being used, but it had no effect.
UPDATE 1
Strange, but the following Latex expression filter always returns exact matching results. I am now looking for $2x-4=0$ rather than 2x-4=0.
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'$2x-4=0$');
I have two types of delimiters for latex expression in my app: $$ for paragraph display and \$ for inline display of Latex expression, and therefore there will always be a $ symbol surrounding the latex expression stored in database, though the trailing delimiter could be \$ but full-text search seems to ignore the backslash character.
Why this modified query returns exact matches is not clear to me.
UPDATE 2
Another approach that works accurately is as mentioned in the answer. The full query for this is mentioned below. So, the LIKE operator ends up scanning only those rows that are selected by full-text search query.
WITH x AS
(SELECT MessageID,
Message
FROM Messages m
WHERE CONTAINS (m.Message,
N'2x-4=0') )
SELECT MessageID,
Message
FROM x
WHERE x.Message LIKE "%2x-4=0%"
To understand why it happens you can run the following query (1033 is the English language id):
select * from sys.dm_fts_parser('2x-4=0', 1033, 0,1)
In my instance it would return the following results:
Note, all other parts of the search criteria are considered to be noise words except for 2x. Therefore, I suspect your full text index simply does not have the full 2x-4=0 string and instead you get results with occurrences of 2x.
I tried adding 2x-4=0 to my own FTS index and CONTAINS was able to find it as the top result for both CONTAINS(col, '2x-4=0') and CONTAINS(col, '"2x-4=0"'). However, partial matches were included too right after the exact match.
Note, that when extra white space is added around = in the search term the FTS parser won't accept it and complain about syntax error.
CONTAINS is more like an end-user search operation, with support for keywords like NEAR, AND and OR. Try adding quotes within the quotes, to force the exact search term:
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'"2x-4=0"');
This is called <simple-term> in the documentation.
You can also try the LIKE operator:
SELECT MessageID, Message FROM Messages m WHERE m.Message LIKE '%2x-4=0%';
But note that this is probably slower than CONTAINS because it doesn't use a full text search index. If it's too slow, maybe you can even combine both of them in one query, so the CONTAINS is used to filter the result set down to the non-noise words using the index, and then LIKE applies the final matching.

Why is SQL Server full-text search not matching numbers?

I'm using SQL Server 2014 Express, and have a full-text index setup on a table.
The full-text index only indexes a single column, in this example named foo.
The table has 3 rows in it. The values in the 3 rows, for that full-text indexed column are like so ...
test 1
test 2
test 3 test 1
Each new line above is a new row in the table, and that text is literally what is in the full-text indexed column. So, using SQL Server's CONTAINS function, if I perform the following query, I get all rows back as matches, as expected.
SELECT * FROM example WHERE CONTAINS(foo, 'test')
But, if I run the following query, I also get all of the rows back as matches, which I am not expecting. In the following query, I only expected one row as a match.
SELECT * FROM example WHERE CONTAINS(foo, '"test 3"')
Lastly, simply searching for "3" returns no matching rows, which I also did not expect. I'd expect one matching row from the following query, but get none.
SELECT * FROM example WHERE CONTAINS(foo, '3')
I've read the MSDN pages on CONTAINS and full-text indexing, but I can't figure out this behavior. I must be doing something wrong.
Would anybody be able to explain to me what's happening and how to perform the searches I've described?
While this may not be the answer, it solved my original question. My full-text index was using the system stop list. For whatever reason, certain individual numbers, such as "1" in "test 1", were being skipped or whatever the stop list does.
The following question and answer, here on SO, suggested disabling the stop list alltogether. I did this and now my full text searches match as I expected them to, at the expense of a larger full text index, it looks like.
Full text search does not work if stop word is included even though stop word list is empty

Strange issue with SQL contains - ignoring starting characters of a string

I am experiencing a strange issue with the sql full text indexing. Basically i am searching a column which is used to house email addresses. Seems to be working as expected for all cases i tested except one!
SELECT *
FROM Table
WHERE CONTAINS(Email, '"email#me.com"')
For a certain email address it is completely ignoring the "email" part above and is instead doing
SELECT *
FROM Table
WHERE CONTAINS(Email, '#me.com')
There was only one case that i could find that this was happening for. I repopulated the index, but no joy. Also rebuilt the catalog.
Any ideas??
Edit:
I cannot put someone's email address on a public website, so I will give more appropriate examples. The one that is causing the issue is of the form:
a.b.c#somedomain.net.au
When i do
WHERE CONTAINS(Email, "'a.b.c#somedomain.net.au"')
The matching rows which are returned are all of the form .*#somedomain.net.au. I.e. it is ignoring the a.b.c part.
Full stops are treated as noise words (or stopwords) in a fulltext index, you can find a list of the excluded characters by checking the system stopwords:
SELECT * FROM sys.fulltext_system_stopwords WHERE language_id = 2057 --this is the lang Id for British English (change accordingly)
So your email address which is "a.b.c#somedomain.net.au" is actually treated as "a b c#somedomain.net.au" and in this particular case as individual letters are also excluded from the index you end up searching on "#somedomain.net.au"
You really have two choices, you can either replace the character you want to include before indexing (so replace the special characters with a match tag) or you remove the words/character you which to include from the Full Text Stoplist.
NT// If you choose the latter I would be careful as this can bloat your index significantly.
Here are some links that should help you :
Configure and Manage Stopwords and Stoplists for Full-Text Search
Create Full Text Stoplists

CONTAINS in SQL 2000

I have a table 'Asset' with a column 'AssetDescription'. Every row of it has some group of words/sentences, seprated by comma.
row1: - flowers, full color, female, Trend
row2:- baby smelling flowers, heart
Now if a put a search query like:-
select * from Asset where contains(AssetDescription,'flower')
It returns nothing.
I have one more table 'SearchData' with column 'SearchCol', having similar rows as mentioned above in table 'Asset'. Now if a put a search query like:-
select * from SearchData where contains(SearchCol,'flower')
It returns both the rows.
QUESTION:-
Why first query doesn't return any result, but second one does correctly.
If 'Full Text Search' has something to do with 1st ques, than what to do regarding that. As I'm using SQL server 2000.
CONTAINS requires a full text search index, and for full text search indexing to be enabled.
LIKE doesn't require full text search.
The advantage of using CONTAINS over LIKE is that CONTAINS is more flexible and potentially a lot faster. LIKE may require a full table scan depending how you use it.
From the SQL Server docs
In contrast to full-text search, the LIKE Transact-SQL predicate works
on character patterns only. Also, you cannot use the LIKE predicate to
query formatted binary data. Furthermore, a LIKE query against a large
amount of unstructured text data is much slower than an equivalent
full-text query against the same data. A LIKE query against millions
of rows of text data can take minutes to return; whereas a full-text
query can take only seconds or less against the same data, depending
on the number of rows that are returned.
Your first query isn't matching anything because you're not using a wildcard character. Your rows contain the word 'flowers' whereas you're searching for rows containing 'flower'. You would need to change the query to:
select * from asset where contains(AssetDescription, 'flower*')
Try rebuilding your full-text index. Could be that it's out of date and hence not finding them when you use CONTAINS.
Assuming SQL Server, to use contains with a word prefix, you use a wildcard.
More here: http://msdn.microsoft.com/en-us/library/ms187787.aspx

SQL Contains - only match at start

For some reason I cannot find the answer on Google! But with the SQL contains function how can I tell it to start at the beginning of a string, I.e I am looking for the full-text equivalent to
LIKE 'some_term%'.
I know I can use like, but since I already have the full-text index set up, AND the table is expected to have thousands of rows, I would prefer to use Contains.
Thanks!
You want something like this:
Rather than specify multiple terms, you can use a 'prefix term' if the
terms begin with the same characters. To use a prefix term, specify
the beginning characters, then add an asterisk (*) wildcard to the end
of the term. Enclose the prefix term in double quotes. The following
statement returns the same results as the previous one.
-- Search for all terms that begin with 'storm'
SELECT StormID, StormHead, StormBody FROM StormyWeather
WHERE CONTAINS(StormHead, '"storm*"')
http://www.simple-talk.com/sql/learn-sql-server/full-text-indexing-workbench/
You can use CONTAINS with a LIKE subquery for matching only a start:
SELECT *
FROM (
SELECT *
FROM myTable WHERE CONTAINS('"Alice in wonderland"')
) AS S1
WHERE S1.edition LIKE 'Alice in wonderland%'
This way, the slow LIKE query will be run against a smaller set
The only solution I can think of it to actually prepend a unique word to the beginning of every field in the table.
e.g. Update every row so that 'xfirstword ' appears at the start of the text (e.g. Field1). Then you can search for CONTAINS(Field1, 'NEAR ((xfirstword, "TERM*"),0)')
Pretty crappy solution, especially as we know that the full text index stores the actual position of each word in the text (see this link for details: http://msdn.microsoft.com/en-us/library/ms142551.aspx)
I am facing the similar issue. This is what I have implemented as a work around.
I have made another table and pulled only the rows like 'some_term%'.
Now, on this new table I have implemented the FullText search.
Please do inform me if you tried some other better approach