How to match against subsets of a search string in SOLR/lucene - lucene

I've got an unusual situation. Normally when you search a text index you are searching for a small number of keywords against documents with a larger number of terms.
For example you might search for "quick brown" and expect to match "the quick brown fox jumps over the lazy dog".
I have the situation where I have lots of small phrases in my document store and I wish to match them against a larger query phrase.
For example if I have a query:
"the quick brown fox jumps over the lazy dog"
and the documents
"quick brown"
"fox over"
"lazy dog"
I'd like to find the documents that have a phrase that occurs in the query. In this case "quick brown" and "lazy dog" (but not "fox over" because although the tokens match it's not a phrase in the search string).
Is this sort of query possible with SOLR/lucene?

It sounds like you want to use ShingleFilter in your analysis, so that you index word bigrams: so add ShingleFilterFactory at both query and index time.
At index time your documents are then indexed as such:
"quick brown" -> quick_brown
"fox over" -> fox_over
"lazy dog" -> lazy_dog
At query time your query becomes:
"the quick brown fox jumps over the lazy dog" -> "the_quick quick_brown brown_fox fox_jumps jumps_over over_the the_lazy lazy_dog"
This is still no good, by default it will form a phrase query.
So in your query analyzer only add PositionFilterFactory after the ShingleFilterFactory. This "flattens" the positions in the query so that the queryparser treats the output as synonyms, which will yield a booleanquery with these subs (all SHOULD clauses, so its basically an OR query):
BooleanQuery:
the_quick OR
quick_brown OR
brown_fox OR
...
this should be the most performant way, as then its really just a booleanquery of termqueries.

Sounds like you want the DisMax "minimum match" parameter. I wrote a blog article on the concept here a little while: http://blog.websolr.com/post/1299174416. There's also the Solr wiki on minimum match.
The "minimum match" concept is applied against all the "optional" terms in your query -- terms that aren't explicitly specified, using +/-, whether they are "+mandatory" or "-prohibited". By default, the minimum match is 100%, meaning that 100% of the optional terms must be present. In other words, all of your terms are considered mandatory.
This is why your longer query isn't currently matching documents containing shorter fragments of that phrase. The other keywords in the longer search phrase are treated as mandatory.
If you drop the minimum match down to 1, then only one of your optional terms will be considered mandatory. In some ways this is the opposite of the default of 100%. It's like your query of quick brown fox… is turned into quick OR brown OR fox OR … and so on.
If you set your minimum match to 2, then your search phrase will get broken up into groups of two terms. A search for quick brown fox turns into (quick brown) OR (brown fox) OR (quick fox) … and so on. (Excuse my psuedo-query there, I trust you see the point.)
The minimum match parameter also supports percentages -- say, 20% -- and some even more complex expressions. So there's a fair amount of tweakability.

only setting mm parameter will not satisfy your needs since
"the quick brown fox jumps over the lazy dog"
will match all three documents
"quick brown"
"fox over"
"lazy dog"
and as you said:
I'd like to find the documents that
have a phrase that occurs in the
query. In this case "quick brown" and
"lazy dog" (but not "fox over" because
although the tokens match it's not a
phrase in the search string).

Related

SQL Server Full Text Search with complete sentences

I have an Azure SQL database and tried the full text search.
Is is possible to search for a complete sentence?
E.g. query with LIKE-operator that works (but probably not fast as full text search):
SELECT Sentence
FROM Sentences
WHERE 'This is a whole sentence for example.' LIKE '%'+Sentence+'%'
Would return: "a whole sentence"
I need something like that with full text search:
SELECT Sentence
FROM Sentences
WHERE FREETEXT(WorkingExperience,'This is a whole sentence for example.')
This will return each hit on a word, but not on the complete sentence.
E.g. would return: "a whole sentence" and "another sentence".
Is that possible or do I have to use the LIKE-operator?
Have you tried this:
SELECT Sentence
FROM Sentences
WHERE FREETEXT(WorkingExperience,'"This is a whole sentence for example."')
If the above doesn't work you may need to construct the proper FTS search string using AND operator, like below:
SELECT Sentence
FROM Sentences
WHERE FREETEXT(WorkingExperience,'"This" AND "is" AND "a" AND "whole"
AND "sentence" AND "for" AND "example."')
Also, for more precise matching I recommend using CONTAINS or CONTAINSTABLE:
SELECT Sentence
FROM Sentences
WHERE CONTAINS(WorkingExperience,'"This is a whole sentence for example."')
HTH
If anyone else is interested here is a link for a good article with examples:
https://www.microsoftpressstore.com/articles/article.aspx?p=2201634&seqNum=3
You can choose the best method to accommodate your need from the examples.
To me matching a whole sentence can be easily done with the below where clause as mentioned in the other answer:
WHERE CONTAINS(WorkingExperience,'"This is a whole sentence for example."')
if you need to look for all the words but user might input them in abnormal sequence I would suggest to use
WHERE CONTAINS(WorkingExperience, N'NEAR(This, whole, sentence, is, a, for, example)')
You can do other magics with this which can be found in the above article. If you need to order the result based on the hit score/rank you will need to use CONTAINSTABLE instead of CONTAINS.

Lucene.net GetFieldQuery vs TermQuery

Using Lucene's standard analyzer. Title field in question is non-stored, analyzed. The query is as follows:
title:"Some-Url-Friendly-Title"
In Luke, this query gets correctly re-written as:
title:"some url friendly title" (- replaced by whitespace, everything lowercased).
I thought the Lucene.net version would be:
new TermQuery(new Term("title","Some-Url-Friendly-Title"))
However, no results are returned.
Then I tried:
_parser.GetFieldQuery("title","Some-Url-Friendly-Title")
And it worked as expected!
Both queries were executed via:
_searcher.Search([query object], [sort object])
Can somebody point me in the right direction to see what the differences between TermQuery and _parser.GetFieldQuery() are?
A TermQuery is much simpler than running a query through a queryparser. Not only is it not lowercased, and doesn't understand to break up hyphenated terms, it isn't even tokenized. It just searches for the term you tell it to look for. That means it is looking for the term "Some-Url-Friendly-Title" as a single untokenized keyword, in your index. I assume you are using an analyzer, so chances are no such tokens exist.
To take it a step further, if you had been searching for "Some Url Friendly Title" as the Term text, you still wouldn't come up with anything, since it's looking for "Some Url Friendly Title" as a Single Token, not as the four tokens (or rather, terms) in your index.
If you look at what a the standard query parser generates when you parse your query, you'll see that TermQueries are only one of the building blocks it uses to generate the complete query, along with BooleanQuery, and possibly PhraseQuery, PrefixQueriy, etc.
In Lucene.Net version 3.0.3 the GetFieldQuery is inaccessible due to it's protection modifier. Use
MultiFieldQueryParser.Parse(searchText, field)
instead.

How exact phrase search is performed by a Search Engine?

I am using Lucene to search in a Data-set, I need to now how "" search (I mean exact phrase search) mechanism has been implemented?
I want to make it able to result all "little cat" hits when the user enters "littlecat". I now that I should manipulate the indexing code, but at least I should now how the "" search works.
I want to make it able to result all "little cat" hits when the user enters "littlecat"
This might sound easy but it is very tough to implement. For a human being little and cat are two different words but for a computer it does not know little and cat seperately from littlecat, unless you have a dictionary and your code check those two words in dictionary. On the other hand searching for "little cat" can easily search for "littlecat" aswell. And i believe that this goes beyong the concept of an exact phrase search. Exact phrase search will only return littlecat if you search for "littlecat" and vice versa. Even google seemingly (expectedly too), doesnt return "little cat" on littlecat search
A way to implement this is Dynamic programming - using a dictionary/corpus to compare your individual words against(and also the left over words after you have parsed the text into strings).
Think of it like you were writing a custom spell-checker or likewise. In this, there's also a scenario when more than one combination of words may be left over eg -"walkingmydoginrain" - here you could break the 1st word as "walk", or as "walking" , and this is the beauty of DP - since you know (from your corpus) that you can't form legitimate words from "ingmydoginrain" (ie rest of the string - you have just discovered that in this context - you should pick the segmented word as "Walking" and NOT walk.
Also think of it like not being able to find a match is adding to a COST function that you define, so you should get optimal results - meaning you can be sure that your text(un-separated with white spaces) will for sure be broken into legitimate words- though there may be MORE than one possible word sequences in that line(and hence, possibly also intent of the person seeking this)
You should be able to find pretty good base implementations over the web for your use case (read also : How does Google implement - "Did you mean" )
For now, see also -
How to split text without spaces into list of words?

Comparison of Lucene Analyzers

Can someone please explain the difference between the different analyzers within Lucene? I am getting a maxClauseCount exception and I understand that I can avoid this by using a KeywordAnalyzer but I don't want to change from the StandardAnalyzer without understanding the issues surrounding analyzers. Thanks very much.
In general, any analyzer in Lucene is tokenizer + stemmer + stop-words filter.
Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text. For example, KeywordAnalyzer you mentioned doesn't split the text at all and takes all the field as a single token. At the same time, StandardAnalyzer (and most other analyzers) use spaces and punctuation as a split points. For example, for phrase "I am very happy" it will produce list ["i", "am", "very", "happy"] (or something like that). For more information on specific analyzers/tokenizers see its Java Docs.
Stemmers are used to get the base of a word in question. It heavily depends on the language used. For example, for previous phrase in English there will be something like ["i", "be", "veri", "happi"] produced, and for French "Je suis très heureux" some kind of French analyzer (like SnowballAnalyzer, initialized with "French") will produce ["je", "être", "tre", "heur"]. Of course, if you will use analyzer of one language to stem text in another, rules from the other language will be used and stemmer may produce incorrect results. It isn't fail of all the system, but search results then may be less accurate.
KeywordAnalyzer doesn't use any stemmers, it passes all the field unmodified. So, if you are going to search some words in English text, it isn't a good idea to use this analyzer.
Stop words are the most frequent and almost useless words. Again, it heavily depends on language. For English these words are "a", "the", "I", "be", "have", etc. Stop-words filters remove them from the token stream to lower noise in search results, so finally our phrase "I'm very happy" with StandardAnalyzer will be transformed to list ["veri", "happi"].
And KeywordAnalyzer again does nothing. So, KeywordAnalyzer is used for things like ID or phone numbers, but not for usual text.
And as for your maxClauseCount exception, I believe you get it on searching. In this case most probably it is because of too complex search query. Try to split it to several queries or use more low level functions.
In my perspective, I have used StandAnalyzer and SmartCNAnalyzer. As I have to search text in Chinese. Obviously, SmartCnAnalyzer is better at handling Chinese. For diiferent purposes, you have to choose properest analyzer.

Indexing n-word expressions as a single term in Lucene

I want to index a "compound word" like "New York" as a single term in Lucene not like "new", "york". In such a way that if someone searches for "new place", documents containing "new york" won't match.
I think this is not the case for N-grams (actually NGramTokenizer), because I won't index just any n-gram, I want to index only some specific n-grams.
I've done some research and I know I should write my own Analyzer and maybe my own Tokenizer. But I'm a bit lost extending TokenStream/TokenFilter/Tokenizer.
Thanks
I presume you have some way of detecting the multi-word units (MWUs) that you want to preserve. Then what you can do is replace the whitespace in them by an underscore and use a WhiteSpaceAnalyzer instead of a StandardAnalyzer (which throws away punctuation), perhaps with a LowerCaseFilter.
Writing your own Tokenizer requires quite some Lucene black magic. I've never been able to wrap my head around the Lucene 2.9+ APIs, but check out the TokenStream docs if you really want to try.
I did it by creating the field which is indexed but not analyzed.
For this I used the Field.Index.NOT_ANALYZED
>
doc.add(new Field("fieldName", "value", Field.Store.YES, Field.Index.NOT_ANALYZED, TermVector.YES));
the StandardAnalyzer.
I worked on Lucene 3.0.2.