T-SQL CONTAINS having problems with fulltext index on computed column with a long number - sql-server-2012

I have a computed field with a full-text index on it. It's working properly except for just a few records out of thousands, and why remains a mystery:
[fake name]
JAMES J BRATWURST LTDJ LCHM LCDAA1 ACD 1215041803 111.223.3333
A select-query with
... where CONTAINS(searchname,'bratwurst')
works fine, and
... where CONTAINS(searchname,'111.223.3333')
works fine too. But
... where CONTAINS(searchname,'1215041803') does not return anything.
Edit: this wildcard search works too (which shows that the index has been populated):
.. where CONTAINS(searchname, '"121504180*"').
Yet on other similar records, searching the searchname column for a 10-digit "number" using CONTAINS does return values. So it's not that the full-text tokenizer is ignoring numbers.
Thinking that it was possibly a stoplist issue, I turned stoplist off and repopulated the index, but to no avail.
I am open to suggestions for other things to check! Thanks

Related

Full Text Search Finding Word Only Occasionally

I have a table with a full text index. This query returns 2 results:
SELECT *
FROM SampleTable
WHERE ContentForSearch LIKE '% mount %'
ORDER BY 1
This query returns 1 result:
SELECT *
FROM SampleTable
WHERE CONTAINS(ContentForSearch, '"mount"')
ORDER BY 1
Adding a new instance of the word "mount" to the table does show up in the search. Why?
I've already checked the stopwords list as best as I knew how to. This returns no results:
SELECT *
FROM sys.fulltext_stoplists
This returns no results:
SELECT *
FROM sys.fulltext_system_stopwords
WHERE stopword like '%mount%'
I also checked to see if the index was up to date, this returned the current time (minus a few minutes) and a 0, indicating idle:
SELECT DATEADD(ss, FULLTEXTCATALOGPROPERTY('SampleTableCatalog','PopulateCompletionAge'), '1/1/1990') AS LastPopulated,
FULLTEXTCATALOGPROPERTY('SampleTableCatalog','PopulateStatus')
I also did some searches in the string that doesn't show up in the CONTAINS result to see if the ASCII values were strange (and can provide queries if needed), but they were exactly the same as the one that did show up.
On one copy of the database, someone ran:
ALTER FULLTEXT INDEX ON SampleTable SET STOPLIST = OFF;
ALTER FULLTEXT INDEX ON SampleTable SET STOPLIST = SYSTEM;
and that seemed to fix it, but I have no idea why, and I'm uncomfortable making changes I don't understand.
UPDATE
Stoleg's comments led me to the solution eventually. Full text indexing was somehow turned off on a certain database server. When that database was then restored to another server, those entries that didn't get indexed on the first server were still not indexed even though the new server was properly updating the index. I found this by using Stoleg's queries to check which rows were missing from the index, and then checking the modified date for those rows (which luckily were stored). I noticed the pattern that rows from the dates when the database was on the other server were not in the index. The solution on the problem server was to turn on full text indexing and rebuild the catalogs. As to how the indexing got turned off, I don't understand it myself. The comment from the DBA on how he solved it was "I added full text search as resource to cluster node. "
Well, the obvious question is: Is 'mount' in your stoplist?
Microsoft Configure and Manage Stopwords and Stoplists for Full-Text Search shows you how to query and update your stop words.
You might also want to review the general info on stoplist from Microsoft.
ADDED
Don't take as insult (not that you sounded insulted). Way too many times, people say they have checked something when they only thought they had -- looking at the wrong database, etc. So wanted you to make sure. I interpreted your some of the time as works with like, not with contains, so I thought it was more likely actually the stoplist.
The only other "obvious" solution would be to rebuild the full text index -- with the thought that changing the stoplist has the same effect on the other database. I suppose you could restart the server first too. But, as another mysterious solution, not a first choice.
Changes the full-text stoplist that is associated with the index, if any.
OFF
Specifies that no stoplist be associated with the full-text index.
SYSTEM
Specifies that the default full-text system STOPLIST should be used for this full-text index.
stoplist_name
Specifies the name of the stoplist to be associated with the full-text index.
For more information, see Configure and Manage Stopwords and Stoplists for Full-Text Search.
This just removed and reset the stoplist to system default.
It is because of the extra spaces in '% mount %'
try:
SELECT *
FROM SampleTable
WHERE ContentForSearch LIKE '%mount%'
ORDER BY 1

Why is SQL Server full-text search not matching numbers?

I'm using SQL Server 2014 Express, and have a full-text index setup on a table.
The full-text index only indexes a single column, in this example named foo.
The table has 3 rows in it. The values in the 3 rows, for that full-text indexed column are like so ...
test 1
test 2
test 3 test 1
Each new line above is a new row in the table, and that text is literally what is in the full-text indexed column. So, using SQL Server's CONTAINS function, if I perform the following query, I get all rows back as matches, as expected.
SELECT * FROM example WHERE CONTAINS(foo, 'test')
But, if I run the following query, I also get all of the rows back as matches, which I am not expecting. In the following query, I only expected one row as a match.
SELECT * FROM example WHERE CONTAINS(foo, '"test 3"')
Lastly, simply searching for "3" returns no matching rows, which I also did not expect. I'd expect one matching row from the following query, but get none.
SELECT * FROM example WHERE CONTAINS(foo, '3')
I've read the MSDN pages on CONTAINS and full-text indexing, but I can't figure out this behavior. I must be doing something wrong.
Would anybody be able to explain to me what's happening and how to perform the searches I've described?
While this may not be the answer, it solved my original question. My full-text index was using the system stop list. For whatever reason, certain individual numbers, such as "1" in "test 1", were being skipped or whatever the stop list does.
The following question and answer, here on SO, suggested disabling the stop list alltogether. I did this and now my full text searches match as I expected them to, at the expense of a larger full text index, it looks like.
Full text search does not work if stop word is included even though stop word list is empty

Strange issue with SQL contains - ignoring starting characters of a string

I am experiencing a strange issue with the sql full text indexing. Basically i am searching a column which is used to house email addresses. Seems to be working as expected for all cases i tested except one!
SELECT *
FROM Table
WHERE CONTAINS(Email, '"email#me.com"')
For a certain email address it is completely ignoring the "email" part above and is instead doing
SELECT *
FROM Table
WHERE CONTAINS(Email, '#me.com')
There was only one case that i could find that this was happening for. I repopulated the index, but no joy. Also rebuilt the catalog.
Any ideas??
Edit:
I cannot put someone's email address on a public website, so I will give more appropriate examples. The one that is causing the issue is of the form:
a.b.c#somedomain.net.au
When i do
WHERE CONTAINS(Email, "'a.b.c#somedomain.net.au"')
The matching rows which are returned are all of the form .*#somedomain.net.au. I.e. it is ignoring the a.b.c part.
Full stops are treated as noise words (or stopwords) in a fulltext index, you can find a list of the excluded characters by checking the system stopwords:
SELECT * FROM sys.fulltext_system_stopwords WHERE language_id = 2057 --this is the lang Id for British English (change accordingly)
So your email address which is "a.b.c#somedomain.net.au" is actually treated as "a b c#somedomain.net.au" and in this particular case as individual letters are also excluded from the index you end up searching on "#somedomain.net.au"
You really have two choices, you can either replace the character you want to include before indexing (so replace the special characters with a match tag) or you remove the words/character you which to include from the Full Text Stoplist.
NT// If you choose the latter I would be careful as this can bloat your index significantly.
Here are some links that should help you :
Configure and Manage Stopwords and Stoplists for Full-Text Search
Create Full Text Stoplists

Lucene numeric range search with LUKE

I have a number of numeric Lucene indexed fields:
60000
78500
105000
If I use LUKE to query for 78500 as follows:
price:78500
It returns the correct record, however if I try to return all three record as a range I get no results.
price:[60000 TO 105000]
I realise this is due to padding as numbers are treated strings by Lucene however I just wish to know what I should be putting into LUKE to return the three records.
Many thanks for any help.
If the fields are indexed as NumericField you must use "Use XML Query Parser" option in query parser tab and the 3.5 version of Luke:
https://code.google.com/p/luke/downloads/detail?name=lukeall-3.5.0.jar&can=2&q=
An example of query with a string and numeric field is:
<BooleanQuery>
<Clause fieldName="colour" occurs="must">
<TermQuery>rojo</TermQuery>
</Clause>
<Clause fieldName="price" occurs="must">
<NumericRangeQuery type="int" lowerTerm="4000" upperTerm="5000" />
</Clause>
</BooleanQuery>
The solution I used for this was that the values inputted for price needed to be added to the index in padded form. Then I would just query the new padded value which works great. Therefore the new values in the index were:
060000
078500
105000
This solution was tied into an Examine search issue for Umbraco so there is a thread on the Forum of how to implement a numeric based range search if anyone requires this it is located here with a walk through end to end.
Umbraco Forum Thread
Zero padding won't come into this particular query since all the numbers you've shown have the same number of digits
The range query you've shown has too many zeros on the second part of the range
So the query for the data you've shown would be price:[10500 TO 78500]
Hope this helps,
I assume these fields are indexed as NumericFields. The problem with them is that Lucene/Luke does not know how to parse numeric queries automatically. You need to override Lucene's QueryParser and provide your own logic how these numbers should be interpreted.
As far as I know, Luke allows sticking in your custom parser, it just need to be present in the CLASSPATH.
Have a look at this thread on Lucene mailing list:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201102.mbox/%3CAANLkTi=XUpyw09tcbjuTzNRpMJa730Cq-6_1agMAjYz6#mail.gmail.com%3E

MySQL MATCH...AGAINST sometimes finds answer, sometimes doesn't

The following two queries return the same (expected) result when I query my database:
SELECT * FROM articles
WHERE content LIKE '%Euskaldunak%'
SELECT * FROM articles
WHERE MATCH (content) AGAINST ('+"Euskaldunak"' IN BOOLEAN MODE)
The text in the content field that it's searching looks like this: "...These Euskaldunak, or newcomers..."
However, the following query on the same table returns the expected single result:
SELECT * FROM articles
WHERE content LIKE '%PCC%'
And the following query returns an empty result:
SELECT * FROM articles
WHERE MATCH (content) AGAINST ('+"PCC"' IN BOOLEAN MODE)
The text in the content field that matches this result looks like this: "...Portland Community College (PCC) is the largest..."
I can't figure out why searching for "Euskaldunak" works with that MATCH...AGAINST syntax but "PCC" doesn't. Does anyone see something that I'm not seeing?
(Also: "PCC" is not a common phrase in this field - no other rows contain the word, so the natural language search shouldn't be excluding it.)
Your fulltext minimum word length is probably set too high. I think the default is 4, which would explain what you are seeing. Set it to 1 if you want all words indexed regardless of length.
Run this query:
show variables like 'ft_min_word_len';
If the values is greater than 3 and you want to get hits on words shorter than that, edit your /etc/my.cnf and add or update this line in the [mysqld] section using a value appropriate for your application:
ft_min_word_len = 1
Then restart MySQL and rebuild your fulltext indexes and you should be all set.
There are two things I can think of right away. The first is your ft_min_word_len value is set to more than 3 characters. Any "word" less than the ft_min_word_len length will not get indexed.
The second is that more then 50% of your records contain the 'PCC' string. A full text search that matches more than 50% of the records is considered irrelevant and returns nothing.
Full text indexes have different rules than regular string indexes. For example, there is a stop words list so certain common words like to, the, and, don't get indexed.