Question related to phrase search in lucene/solr? - lucene

I have question is it possible to perform a phrase search with wild cards in solr/lucene as if i have two queries both have exactly same results.
One is:
+Contents:"change market"
and the other is:
+Contents:"change* market"
I assumed the second should match "changes market", but it does not return any matches.

You can do this in Lucene with ComplexPhraseQueryParser. Solr has facility to plug in custom query parser with QParserPlugin. You can possibly use these two to have desired functionality with Solr as well.

IMO it is not possible to search for wild cards with in phrase.
You might want to consider using two queries with proximity search.(q=change* market&qs=1)
http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_search_for_one_term_near_another_term_.28say.2C_.22batman.22_and_.22movie.22.29

Related

How to improve Search query efficiency in lucene?

I'm building a search for my application. For the entered search term (foo),
1) I look for exact match (foo), if it returns NULL
2) I use fuzzy search (foo~), if it returns NULL
3) I use wildcard (foo*).
Is this an efficient way? Or is there any lucene method to do all these?
There is no built-in way of doing this in the Lucene. However, usually this case is handled outside of the Lucene in client-side. Yes, from my experience it's very efficient, since it's usually provides high precision results. In some sources over the internet it's called staged search
E.g. you create a query for exact match, let's say TermQuery("field","foo"), if this query return nothing, than you use FuzzyQuery and last one PrefixQuery (I will recommend it over WildcardQuery, for the last case you want to do)

Cloudant Search: Why are my wildcards not working?

I have a Cloudant database with a search index. In the search index I index the titles of my documents. For instance, search for 'rijkspersoneel':
http://wetten.cloudant.com/regelingen/_design/RegelingInfo/_search/regeling?q=title:rijkspersoneel
Returns 48 rows.
However, when I replace the 'o' with a ? wildcard:
http://wetten.cloudant.com/regelingen/_design/RegelingInfo/_search/regeling?q=title:rijkspers?neel
I get 0 results. Why is that? The Cloudant docs say that this should match 'rijkspersoneel' as well!
My previous answer was definitely mistaken. Internal wildcads do appear to be supported. Try:
title:rijkspe*on*
title rijksper?on*
Fairly sure what is happening here is an analysis issue. Fairly sure you are using a stemming analyzer. I'm not really all the familiar with cloudant and their implementation of this, but in Lucene, wildcard queries are not subject to the same analysis as term queries. I'm guessing that your analysis of this field includes a stemmer, in which case "rijkspersoneel" is actually indexed as "rijkspersone".
So, when you search for
rijkspersonee*
or
rijkper?oneel
Since the "el" is missing from the end in the indexed form, you find no matches. When just searching for rijkpersoneel it does get analyzed though, and you search for the stemmed form of the word, and will find matches.
Stemming and wildcards just don't get along.

not query in lucene

i need to do not queries on my lucene index. Lucene currently allows not only when we have two or more terms in the query:
So I can do something like:
country:canada not sweden
but I can't run a query like:
country:not sweden
Could you please let me know if there is some efficient solution for this problem
Thanks
A very late reply, but it might be useful for somebody else later:
*:* AND NOT country:sweden
IF I'm not mistaken this should do a logical "AND" with all documents and the documents with a country that is different from "sweden".
Try with the following query in the search box:
NOT message:"warning"
message being the search field
Please check answer for similar question. The solution is to use MatchAllDocsQuery.
The short answer is that this is not possible using the standard Lucene.
Lucene does not allow NOT queries as a single term for the same reason it does not allow prefix queries - to perform either, the engine would have to look through each document to ascertain whether the document is/is not a hit. It has to look through each document because it cannot use the search term as the key to look up documents in the inverted index (used to store the indexed documents).
To take your case as an example:
To search for not sweden, the simplest (and possibly most efficient) approach would be to search for sweden and then "invert" the result set to return all documents that are not in that result set. Doing this would require finding all the required (ie. not in the result set) documents in the index, but without a key to look them up by. This would be done by iterating over the documents in the index - a task it is not optimised for, and hence speed would suffer.
If you really need this functionality, you could maintain your own list of items when indexing, so that a not sweden search becomes a sweden search using Lucene, followed by an inversion of the results using your set of items.
OK, I see what you are trying to do.
You can use it as a query refinement since there are no unary Boolean operators in Lucene. Despite the answers above, I believe this is a better and most forward approach (note the space before the wildcard):
&query= *&qf=-country:Canada

Lucene Fuzzy Match on Phrase instead of Single Word

I'm trying to do a fuzzy match on the Phrase "Grand Prarie" (deliberately misspelled) using Apache Lucene. Part of my issue is that the ~ operator only does fuzzy matches on single word terms and behaves as a proximity match for phrases.
Is there a way to do a fuzzy match on a phrase with lucene?
Lucene 3.0 has ComplexPhraseQueryParser that supports fuzzy phrase query. This is in the contrib package.
Came across this through Google and felt solutions where not what I was after.
In my case, solution was to simply repeat the search sequence against the solr API.
So for example if I was looking for: title_t to include match for "dog~" and "cat~", I added some manual code to generate query as:
((title_t:dog~) and (title_t:cat~))
It might just be what above queries are about, however links seems dead.
There's no direct support for a fuzzy phrase, but you can simulate it by explicitly enumerating the fuzzy terms and then adding them to a MultiPhraseQuery. The resulting query would look like:
<MultiPhraseQuery: "grand (prarie prairie)">

All of these words feature

I have a "description" field indexed in Lucene.This field contains a book's description.
How do i achieve "All of these words" functionality on this field using BooleanQuery class?
For example if a user types in "top selling book" then it should return books which have all of these words in its description.
Thanks!
There are two pieces to get this to work:
You need the incoming documents to be analysed properly, so that individual words are tokenised and indexed separately
The user query needs to be tokenised, and the tokens combined with the AND operator.
For #1, there are a number of Analyzers and Tokenizers that come with Lucene - have a look in the org.apache.lucene.analysis package. There are options for many different languages, stemming, stopwords and so on.
For #2, there are again a lot of query parsers that come with Lucene, mainly in the org.apache.lucene.queryParser packagage. MultiFieldQueryParser might be good for you: to require every term to be present, just call
QueryParser.setDefaultOperator(QueryParser.AND_OPERATOR)
Lucene in Action, although a few versions old, is still accurate and extremely useful for more information on analysis and query parsing.
I believe if you add all query parts (one per term) via
BooleanQuery.add(Query, BooleanClause.Occur)
and set that second parameter to the constant BooleanClause.Occur.MUST, then you should get what you want. The equivalent query syntax would be "+term1+term2 +term3 ...".