Why is SQL Server full-text search not matching numbers? - sql

I'm using SQL Server 2014 Express, and have a full-text index setup on a table.
The full-text index only indexes a single column, in this example named foo.
The table has 3 rows in it. The values in the 3 rows, for that full-text indexed column are like so ...
test 1
test 2
test 3 test 1
Each new line above is a new row in the table, and that text is literally what is in the full-text indexed column. So, using SQL Server's CONTAINS function, if I perform the following query, I get all rows back as matches, as expected.
SELECT * FROM example WHERE CONTAINS(foo, 'test')
But, if I run the following query, I also get all of the rows back as matches, which I am not expecting. In the following query, I only expected one row as a match.
SELECT * FROM example WHERE CONTAINS(foo, '"test 3"')
Lastly, simply searching for "3" returns no matching rows, which I also did not expect. I'd expect one matching row from the following query, but get none.
SELECT * FROM example WHERE CONTAINS(foo, '3')
I've read the MSDN pages on CONTAINS and full-text indexing, but I can't figure out this behavior. I must be doing something wrong.
Would anybody be able to explain to me what's happening and how to perform the searches I've described?

While this may not be the answer, it solved my original question. My full-text index was using the system stop list. For whatever reason, certain individual numbers, such as "1" in "test 1", were being skipped or whatever the stop list does.
The following question and answer, here on SO, suggested disabling the stop list alltogether. I did this and now my full text searches match as I expected them to, at the expense of a larger full text index, it looks like.
Full text search does not work if stop word is included even though stop word list is empty

Related

SQL Server full-text search for Latex content

I have a web app that allows users to save Latex content to a SQL Server 2012 database. I am running a full-text query as below to search for Latex expression.
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'2x-4=0');
The problem I am facing with above query is that some of the messages being returned by above query do not contain the latex expression 2x-4=0. For example, a message whose saved value is as below is also being returned by above query. You can clearly see that there is no 2x-4=0 contained in this message.
<p>Another example of inline Latex is \$x=34\$.</p>
<p>What are the roots of following equation: \$x^2 - 2x + 1 = 0\$?</p>
Question
Why is this happening and is there a way to get correct records returned when doing a full text search to look for the latex expression 2x-4 = 0? I have tried to repopulate the full text search data for the table being used, but it had no effect.
UPDATE 1
Strange, but the following Latex expression filter always returns exact matching results. I am now looking for $2x-4=0$ rather than 2x-4=0.
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'$2x-4=0$');
I have two types of delimiters for latex expression in my app: $$ for paragraph display and \$ for inline display of Latex expression, and therefore there will always be a $ symbol surrounding the latex expression stored in database, though the trailing delimiter could be \$ but full-text search seems to ignore the backslash character.
Why this modified query returns exact matches is not clear to me.
UPDATE 2
Another approach that works accurately is as mentioned in the answer. The full query for this is mentioned below. So, the LIKE operator ends up scanning only those rows that are selected by full-text search query.
WITH x AS
(SELECT MessageID,
Message
FROM Messages m
WHERE CONTAINS (m.Message,
N'2x-4=0') )
SELECT MessageID,
Message
FROM x
WHERE x.Message LIKE "%2x-4=0%"
To understand why it happens you can run the following query (1033 is the English language id):
select * from sys.dm_fts_parser('2x-4=0', 1033, 0,1)
In my instance it would return the following results:
Note, all other parts of the search criteria are considered to be noise words except for 2x. Therefore, I suspect your full text index simply does not have the full 2x-4=0 string and instead you get results with occurrences of 2x.
I tried adding 2x-4=0 to my own FTS index and CONTAINS was able to find it as the top result for both CONTAINS(col, '2x-4=0') and CONTAINS(col, '"2x-4=0"'). However, partial matches were included too right after the exact match.
Note, that when extra white space is added around = in the search term the FTS parser won't accept it and complain about syntax error.
CONTAINS is more like an end-user search operation, with support for keywords like NEAR, AND and OR. Try adding quotes within the quotes, to force the exact search term:
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'"2x-4=0"');
This is called <simple-term> in the documentation.
You can also try the LIKE operator:
SELECT MessageID, Message FROM Messages m WHERE m.Message LIKE '%2x-4=0%';
But note that this is probably slower than CONTAINS because it doesn't use a full text search index. If it's too slow, maybe you can even combine both of them in one query, so the CONTAINS is used to filter the result set down to the non-noise words using the index, and then LIKE applies the final matching.

SQL: LIKE and Contains — Different results

I am using MS SQL Express SQL function Contains to select data. However when I selected data with LIKE operator, I realised that Contains function is missing a few rows.
Rebuilt indexes but it didn't help.
Sql: brs.SearchText like '%aprilis%' and CONTAINS(brs.SearchText, '*aprilis*')
The contains function missed rows like:
22-28.aprīlis
[1.aprīlis]
Sīraprīlis
PS. If I search directly CONTAINS(brs.SearchText, '*22-28.aprīlis*'), then it finds them
contains is functionality based on the full text index. It supports words, phrases, and prefixed matches on words, but not suffixed matches. So you can match words that start with 'aprilis' but not words that end with it or that contain it arbitrarily in the middle. You might be able to take advantage of a thesaurus for these terms.
This is explained in more detail in the documentation.

CONTAINS in SQL 2000

I have a table 'Asset' with a column 'AssetDescription'. Every row of it has some group of words/sentences, seprated by comma.
row1: - flowers, full color, female, Trend
row2:- baby smelling flowers, heart
Now if a put a search query like:-
select * from Asset where contains(AssetDescription,'flower')
It returns nothing.
I have one more table 'SearchData' with column 'SearchCol', having similar rows as mentioned above in table 'Asset'. Now if a put a search query like:-
select * from SearchData where contains(SearchCol,'flower')
It returns both the rows.
QUESTION:-
Why first query doesn't return any result, but second one does correctly.
If 'Full Text Search' has something to do with 1st ques, than what to do regarding that. As I'm using SQL server 2000.
CONTAINS requires a full text search index, and for full text search indexing to be enabled.
LIKE doesn't require full text search.
The advantage of using CONTAINS over LIKE is that CONTAINS is more flexible and potentially a lot faster. LIKE may require a full table scan depending how you use it.
From the SQL Server docs
In contrast to full-text search, the LIKE Transact-SQL predicate works
on character patterns only. Also, you cannot use the LIKE predicate to
query formatted binary data. Furthermore, a LIKE query against a large
amount of unstructured text data is much slower than an equivalent
full-text query against the same data. A LIKE query against millions
of rows of text data can take minutes to return; whereas a full-text
query can take only seconds or less against the same data, depending
on the number of rows that are returned.
Your first query isn't matching anything because you're not using a wildcard character. Your rows contain the word 'flowers' whereas you're searching for rows containing 'flower'. You would need to change the query to:
select * from asset where contains(AssetDescription, 'flower*')
Try rebuilding your full-text index. Could be that it's out of date and hence not finding them when you use CONTAINS.
Assuming SQL Server, to use contains with a word prefix, you use a wildcard.
More here: http://msdn.microsoft.com/en-us/library/ms187787.aspx

Can scalar functions be applied before filtering when executing a SQL Statement?

I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?
Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.
Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
Build a working table from all of
the table constructors in the FROM
clause.
Remove from the working table those
rows that do not satisfy the WHERE
clause.
Construct the expressions in the
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.
You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.

MySQL MATCH...AGAINST sometimes finds answer, sometimes doesn't

The following two queries return the same (expected) result when I query my database:
SELECT * FROM articles
WHERE content LIKE '%Euskaldunak%'
SELECT * FROM articles
WHERE MATCH (content) AGAINST ('+"Euskaldunak"' IN BOOLEAN MODE)
The text in the content field that it's searching looks like this: "...These Euskaldunak, or newcomers..."
However, the following query on the same table returns the expected single result:
SELECT * FROM articles
WHERE content LIKE '%PCC%'
And the following query returns an empty result:
SELECT * FROM articles
WHERE MATCH (content) AGAINST ('+"PCC"' IN BOOLEAN MODE)
The text in the content field that matches this result looks like this: "...Portland Community College (PCC) is the largest..."
I can't figure out why searching for "Euskaldunak" works with that MATCH...AGAINST syntax but "PCC" doesn't. Does anyone see something that I'm not seeing?
(Also: "PCC" is not a common phrase in this field - no other rows contain the word, so the natural language search shouldn't be excluding it.)
Your fulltext minimum word length is probably set too high. I think the default is 4, which would explain what you are seeing. Set it to 1 if you want all words indexed regardless of length.
Run this query:
show variables like 'ft_min_word_len';
If the values is greater than 3 and you want to get hits on words shorter than that, edit your /etc/my.cnf and add or update this line in the [mysqld] section using a value appropriate for your application:
ft_min_word_len = 1
Then restart MySQL and rebuild your fulltext indexes and you should be all set.
There are two things I can think of right away. The first is your ft_min_word_len value is set to more than 3 characters. Any "word" less than the ft_min_word_len length will not get indexed.
The second is that more then 50% of your records contain the 'PCC' string. A full text search that matches more than 50% of the records is considered irrelevant and returns nothing.
Full text indexes have different rules than regular string indexes. For example, there is a stop words list so certain common words like to, the, and, don't get indexed.