Lucene Fuzzy Match on Phrase instead of Single Word - lucene

I'm trying to do a fuzzy match on the Phrase "Grand Prarie" (deliberately misspelled) using Apache Lucene. Part of my issue is that the ~ operator only does fuzzy matches on single word terms and behaves as a proximity match for phrases.
Is there a way to do a fuzzy match on a phrase with lucene?

Lucene 3.0 has ComplexPhraseQueryParser that supports fuzzy phrase query. This is in the contrib package.

Came across this through Google and felt solutions where not what I was after.
In my case, solution was to simply repeat the search sequence against the solr API.
So for example if I was looking for: title_t to include match for "dog~" and "cat~", I added some manual code to generate query as:
((title_t:dog~) and (title_t:cat~))
It might just be what above queries are about, however links seems dead.

There's no direct support for a fuzzy phrase, but you can simulate it by explicitly enumerating the fuzzy terms and then adding them to a MultiPhraseQuery. The resulting query would look like:
<MultiPhraseQuery: "grand (prarie prairie)">

Related

Lucene.NET 2.9 - MultiFieldQueryParser, boosted fields, stemming and prefixes

I have a system where the search queries multiple fields with different boost values. It is running on Lucene.NET 2.9.4 because it's an Umbraco (6.x) site, and that's what version of Lucene.NET the CMS uses.
My client asked me if I could add stemming, so I wrote a custom analyzer that does Standard / Lowercase / Stop / PorterStemmer. The stemming filter seems to work fine.
But now, when I try to use my new analyzer with the MultiFieldQueryParser, it's not finding anything.
The MultiFieldQueryParser is returning a query containing stemmed words - e.g. if I search for "the figure", what I get as part of the query it returns is:
keywords:figur^4.0 Title:figur^3.0 Collection:figur^2.0
i.e. it's searching the correct fields and applying the correct boosts, but trying to do an exact search on stemmed terms on indexes that contained unstemmed words.
I think what's actually needed is for the MultiFieldQueryParser to return a list of clauses which are of type PrefixQuery. so it'll output a query like
keywords:figur*^4.0 Title:figur*^3.0 Collection:figur*^2.0
If I try to just add a wildcard to the end of the term, and feed that into the parser, the stemmer doesn't kick in. i.e. it builds a query to look for "figure*".
Is there any way to combine MultiFieldQueryParser boosting and prefix queries?
You need to reindex using your custom analyzer. Applying a stemmer only at query time is useless. You might kludge together something using wildcards, but it would remain an ugly, unreliable kludge.

Cloudant Search: Why are my wildcards not working?

I have a Cloudant database with a search index. In the search index I index the titles of my documents. For instance, search for 'rijkspersoneel':
http://wetten.cloudant.com/regelingen/_design/RegelingInfo/_search/regeling?q=title:rijkspersoneel
Returns 48 rows.
However, when I replace the 'o' with a ? wildcard:
http://wetten.cloudant.com/regelingen/_design/RegelingInfo/_search/regeling?q=title:rijkspers?neel
I get 0 results. Why is that? The Cloudant docs say that this should match 'rijkspersoneel' as well!
My previous answer was definitely mistaken. Internal wildcads do appear to be supported. Try:
title:rijkspe*on*
title rijksper?on*
Fairly sure what is happening here is an analysis issue. Fairly sure you are using a stemming analyzer. I'm not really all the familiar with cloudant and their implementation of this, but in Lucene, wildcard queries are not subject to the same analysis as term queries. I'm guessing that your analysis of this field includes a stemmer, in which case "rijkspersoneel" is actually indexed as "rijkspersone".
So, when you search for
rijkspersonee*
or
rijkper?oneel
Since the "el" is missing from the end in the indexed form, you find no matches. When just searching for rijkpersoneel it does get analyzed though, and you search for the stemmed form of the word, and will find matches.
Stemming and wildcards just don't get along.

How lucene can be used to search words prefixed with adverbs/negations?

I am a newbie to Lucene. I wanted to know how can i use Lucene to search a word which may be prefixed with an adverb. The document contains only words and no adverbs prefixed to them.
For example: If term to be searched is 'very beautiful' and my document contains
only beautiful, then i want a Hit. The word can also be prefixed with
negations like 'not very beautiful' or my not have a prefix at all
like 'beautiful'. I just can't drop off the prefixes because I need to
keep a track of Negations which change flow of further processing.
I tried Fuzzy search but results are not that satisfactory. Is there any way to find accomplish this?
I could not find relevant answers for this.
If I was doing this, I would Google on "part of speech tagging" and "natural language processing". Once you have tagged your parts of speech, you could then apply Lucene indexing.
One way to implement this would be to index the words with their tags, like:
n:he v:is a:a aj:big n:man
for
he is a big man
where he and man are nouns, is is a verb, a is an article, and big is an adjective.

not query in lucene

i need to do not queries on my lucene index. Lucene currently allows not only when we have two or more terms in the query:
So I can do something like:
country:canada not sweden
but I can't run a query like:
country:not sweden
Could you please let me know if there is some efficient solution for this problem
Thanks
A very late reply, but it might be useful for somebody else later:
*:* AND NOT country:sweden
IF I'm not mistaken this should do a logical "AND" with all documents and the documents with a country that is different from "sweden".
Try with the following query in the search box:
NOT message:"warning"
message being the search field
Please check answer for similar question. The solution is to use MatchAllDocsQuery.
The short answer is that this is not possible using the standard Lucene.
Lucene does not allow NOT queries as a single term for the same reason it does not allow prefix queries - to perform either, the engine would have to look through each document to ascertain whether the document is/is not a hit. It has to look through each document because it cannot use the search term as the key to look up documents in the inverted index (used to store the indexed documents).
To take your case as an example:
To search for not sweden, the simplest (and possibly most efficient) approach would be to search for sweden and then "invert" the result set to return all documents that are not in that result set. Doing this would require finding all the required (ie. not in the result set) documents in the index, but without a key to look them up by. This would be done by iterating over the documents in the index - a task it is not optimised for, and hence speed would suffer.
If you really need this functionality, you could maintain your own list of items when indexing, so that a not sweden search becomes a sweden search using Lucene, followed by an inversion of the results using your set of items.
OK, I see what you are trying to do.
You can use it as a query refinement since there are no unary Boolean operators in Lucene. Despite the answers above, I believe this is a better and most forward approach (note the space before the wildcard):
&query= *&qf=-country:Canada

Wildcard for terms in phrase - Lucene

Google's query syntax allows to search phrases like "as * as a skyscraper" where the asterisk can match any term (or terms). Is there a way to achieve the same thing in Lucene? The proximity operator ~ could be of use but it is not what I exactly want.
Try a SpanNearQuery. Mark Miller's SpanQuery Blog Post explains how to use it, and the examples are similar to what you describe.