SQL Server full-text search for Latex content - sql

I have a web app that allows users to save Latex content to a SQL Server 2012 database. I am running a full-text query as below to search for Latex expression.
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'2x-4=0');
The problem I am facing with above query is that some of the messages being returned by above query do not contain the latex expression 2x-4=0. For example, a message whose saved value is as below is also being returned by above query. You can clearly see that there is no 2x-4=0 contained in this message.
<p>Another example of inline Latex is \$x=34\$.</p>
<p>What are the roots of following equation: \$x^2 - 2x + 1 = 0\$?</p>
Question
Why is this happening and is there a way to get correct records returned when doing a full text search to look for the latex expression 2x-4 = 0? I have tried to repopulate the full text search data for the table being used, but it had no effect.
UPDATE 1
Strange, but the following Latex expression filter always returns exact matching results. I am now looking for $2x-4=0$ rather than 2x-4=0.
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'$2x-4=0$');
I have two types of delimiters for latex expression in my app: $$ for paragraph display and \$ for inline display of Latex expression, and therefore there will always be a $ symbol surrounding the latex expression stored in database, though the trailing delimiter could be \$ but full-text search seems to ignore the backslash character.
Why this modified query returns exact matches is not clear to me.
UPDATE 2
Another approach that works accurately is as mentioned in the answer. The full query for this is mentioned below. So, the LIKE operator ends up scanning only those rows that are selected by full-text search query.
WITH x AS
(SELECT MessageID,
Message
FROM Messages m
WHERE CONTAINS (m.Message,
N'2x-4=0') )
SELECT MessageID,
Message
FROM x
WHERE x.Message LIKE "%2x-4=0%"

To understand why it happens you can run the following query (1033 is the English language id):
select * from sys.dm_fts_parser('2x-4=0', 1033, 0,1)
In my instance it would return the following results:
Note, all other parts of the search criteria are considered to be noise words except for 2x. Therefore, I suspect your full text index simply does not have the full 2x-4=0 string and instead you get results with occurrences of 2x.
I tried adding 2x-4=0 to my own FTS index and CONTAINS was able to find it as the top result for both CONTAINS(col, '2x-4=0') and CONTAINS(col, '"2x-4=0"'). However, partial matches were included too right after the exact match.
Note, that when extra white space is added around = in the search term the FTS parser won't accept it and complain about syntax error.

CONTAINS is more like an end-user search operation, with support for keywords like NEAR, AND and OR. Try adding quotes within the quotes, to force the exact search term:
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'"2x-4=0"');
This is called <simple-term> in the documentation.
You can also try the LIKE operator:
SELECT MessageID, Message FROM Messages m WHERE m.Message LIKE '%2x-4=0%';
But note that this is probably slower than CONTAINS because it doesn't use a full text search index. If it's too slow, maybe you can even combine both of them in one query, so the CONTAINS is used to filter the result set down to the non-noise words using the index, and then LIKE applies the final matching.

Related

Full Text Search Find Exact Word

I use full text search on my SynonymWords table. I need to find exact word.
I put my search word between quotes but it still to get result.
I expected there is no result. Where i do a mistake?
My query is
SELECT * FROM SynonymWords WHERE CONTAINS(Words,'"zebra"')
The result is here
Fulltext index doesn't work that way
it would find Zebra as long as there is a space or an interpunct before and behind it, so that ozt can idenitfy it a a word.
That's the function of the full text search and you can't change it.
A word like zebrajsduerhn would not be found by your query
You would change it to
SELECT * FROM SynonymWords WHERE CONTAINS( [words], '"zebra*"')
It taes some time to get used to it
Iyou want to have an exact search for word i would recoment to read this thread SQL Server query to match the exact word which handles this and also have an answer for regular expressions.
If you need something faster you should take a look at elasticsearch and opensearch, but these needs also some time to undertsand the concepts

Why is SQL Server full-text search not matching numbers?

I'm using SQL Server 2014 Express, and have a full-text index setup on a table.
The full-text index only indexes a single column, in this example named foo.
The table has 3 rows in it. The values in the 3 rows, for that full-text indexed column are like so ...
test 1
test 2
test 3 test 1
Each new line above is a new row in the table, and that text is literally what is in the full-text indexed column. So, using SQL Server's CONTAINS function, if I perform the following query, I get all rows back as matches, as expected.
SELECT * FROM example WHERE CONTAINS(foo, 'test')
But, if I run the following query, I also get all of the rows back as matches, which I am not expecting. In the following query, I only expected one row as a match.
SELECT * FROM example WHERE CONTAINS(foo, '"test 3"')
Lastly, simply searching for "3" returns no matching rows, which I also did not expect. I'd expect one matching row from the following query, but get none.
SELECT * FROM example WHERE CONTAINS(foo, '3')
I've read the MSDN pages on CONTAINS and full-text indexing, but I can't figure out this behavior. I must be doing something wrong.
Would anybody be able to explain to me what's happening and how to perform the searches I've described?
While this may not be the answer, it solved my original question. My full-text index was using the system stop list. For whatever reason, certain individual numbers, such as "1" in "test 1", were being skipped or whatever the stop list does.
The following question and answer, here on SO, suggested disabling the stop list alltogether. I did this and now my full text searches match as I expected them to, at the expense of a larger full text index, it looks like.
Full text search does not work if stop word is included even though stop word list is empty

Examine lucene.net custom query after analyzer tokenizes

I'm using Examine in Umbraco to query Lucene index of content nodes. I have a field "completeNodeText" that is the concatenation of all the node properties (to keep things simple and not search across multiple fields).
I'm accepting user-submitted search terms. When the search term is multiple words (ie, "firstterm secondterm"), I want the resulting query to be an OR query: Bring me back results where fullNodeText is firstterm OR secondterm.
I want:
{+completeNodeText:"firstterm ? secondterm"}
but instead, I'm getting:
{+completeNodeText:"firstterm secondterm"}
If I search for "firstterm OR secondterm" instead of "firstterm secondterm", then the generated query is correctly: {+completeNodeText:"firstterm ? secondterm"}
I'm using the following API calls:
var searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var searchCriteria = searcher.CreateSearchCriteria();
var query = searchCriteria.Field("completeNodeText", term).Compile();
Is there an easy way to force Examine to generate this "OR" query? Or do I have to manually construct the raw query by calling the StandardAnalyzer to tokenize the user input and concatenating together a query by iterating through the tokens? And bypassing the entire Examine fluent query API?
I don't think that question mark means what you think it means.
It looks like you are generating a PhraseQuery, but you want two disjoint TermQueries. In Lucene query syntax, a phrase query is enclosed in quotes.
"firstterm secondterm"
A phrase query is looking for precisely that phrase, with the two terms appearing consecutively, and in order. Placing an OR within a phrase query does not perform any sort of boolean logic, but rather treats it as the word "OR". The question mark is a placeholder using in PhraseQuery.toString() to represent a removed stop word (See #Lucene-1396). You are still performing a phrasequery, but now it is expecting a three word phrase firstterm, followed by a removed stop word, followed by secondterm
To simply search for two separate terms, get rid of the quotes.
firstterm secondterm
Will search for any document with either of those terms (with higher score given to documents with both).

SQL Contains - only match at start

For some reason I cannot find the answer on Google! But with the SQL contains function how can I tell it to start at the beginning of a string, I.e I am looking for the full-text equivalent to
LIKE 'some_term%'.
I know I can use like, but since I already have the full-text index set up, AND the table is expected to have thousands of rows, I would prefer to use Contains.
Thanks!
You want something like this:
Rather than specify multiple terms, you can use a 'prefix term' if the
terms begin with the same characters. To use a prefix term, specify
the beginning characters, then add an asterisk (*) wildcard to the end
of the term. Enclose the prefix term in double quotes. The following
statement returns the same results as the previous one.
-- Search for all terms that begin with 'storm'
SELECT StormID, StormHead, StormBody FROM StormyWeather
WHERE CONTAINS(StormHead, '"storm*"')
http://www.simple-talk.com/sql/learn-sql-server/full-text-indexing-workbench/
You can use CONTAINS with a LIKE subquery for matching only a start:
SELECT *
FROM (
SELECT *
FROM myTable WHERE CONTAINS('"Alice in wonderland"')
) AS S1
WHERE S1.edition LIKE 'Alice in wonderland%'
This way, the slow LIKE query will be run against a smaller set
The only solution I can think of it to actually prepend a unique word to the beginning of every field in the table.
e.g. Update every row so that 'xfirstword ' appears at the start of the text (e.g. Field1). Then you can search for CONTAINS(Field1, 'NEAR ((xfirstword, "TERM*"),0)')
Pretty crappy solution, especially as we know that the full text index stores the actual position of each word in the text (see this link for details: http://msdn.microsoft.com/en-us/library/ms142551.aspx)
I am facing the similar issue. This is what I have implemented as a work around.
I have made another table and pulled only the rows like 'some_term%'.
Now, on this new table I have implemented the FullText search.
Please do inform me if you tried some other better approach

MySQL MATCH...AGAINST sometimes finds answer, sometimes doesn't

The following two queries return the same (expected) result when I query my database:
SELECT * FROM articles
WHERE content LIKE '%Euskaldunak%'
SELECT * FROM articles
WHERE MATCH (content) AGAINST ('+"Euskaldunak"' IN BOOLEAN MODE)
The text in the content field that it's searching looks like this: "...These Euskaldunak, or newcomers..."
However, the following query on the same table returns the expected single result:
SELECT * FROM articles
WHERE content LIKE '%PCC%'
And the following query returns an empty result:
SELECT * FROM articles
WHERE MATCH (content) AGAINST ('+"PCC"' IN BOOLEAN MODE)
The text in the content field that matches this result looks like this: "...Portland Community College (PCC) is the largest..."
I can't figure out why searching for "Euskaldunak" works with that MATCH...AGAINST syntax but "PCC" doesn't. Does anyone see something that I'm not seeing?
(Also: "PCC" is not a common phrase in this field - no other rows contain the word, so the natural language search shouldn't be excluding it.)
Your fulltext minimum word length is probably set too high. I think the default is 4, which would explain what you are seeing. Set it to 1 if you want all words indexed regardless of length.
Run this query:
show variables like 'ft_min_word_len';
If the values is greater than 3 and you want to get hits on words shorter than that, edit your /etc/my.cnf and add or update this line in the [mysqld] section using a value appropriate for your application:
ft_min_word_len = 1
Then restart MySQL and rebuild your fulltext indexes and you should be all set.
There are two things I can think of right away. The first is your ft_min_word_len value is set to more than 3 characters. Any "word" less than the ft_min_word_len length will not get indexed.
The second is that more then 50% of your records contain the 'PCC' string. A full text search that matches more than 50% of the records is considered irrelevant and returns nothing.
Full text indexes have different rules than regular string indexes. For example, there is a stop words list so certain common words like to, the, and, don't get indexed.