Lucene term query - lucene

In my application we have deals and each deal has a target user group which may include several fields like gender, age and city. For the gender part a deal's target could be MALE FEMALE or BOTH. I wanted to find deals which are either for males or both.I created the following query but it doesn't work...
TermQuery maleQuery = new TermQuery(new Term("gender","MALE"));
TermQuery bothQuery = new TermQuery(new Term("gender","BOTH"));
BooleanQuery query = new BooleanQuery();
query.add(maleQuery,BooleanClause.Occur.SHOULD);
query.add(bothQuery,BooleanClause.Occur.SHOULD);
Please suggest if I am making some mistake. Somehow it seems to spit out only MALE deals,and not BOTH.
I am using version 4.2.1 and Standard Analyzer as the analyzer.

Several solutions possible are:
Use a QueryParser to construct the query instead of using TermQuery using the same analyser used at indexing time.For eg in my case it would be:
Query query = new QueryParser(version,"gender",new StandardAnalyzer()).parse("MALE BOTH");
Use a different analyzer for indexing that does a case insensitive indexing.
(this one applies for StandardAnalyzer, for other analyzers solutions may be different) LOWERCASE your search terms before searching.
Explaination
A brief explaination to the situation would be:
I used a StandardAnalyzer for indexing which lower cases input tokens so that a case insensitive search could be materialised.
Then I used a QueryParser configured with the same analyzer to construct a query instance for searching the user's query at front end. The search worked because the parser working in accordance with standard analyzer lower cased the user's terms and made a case sensitive search.
Then I needed to filter search results for which I wrote TermQuerys instead of using parsers, in which I used capitalized text which was not indexed that way, so the search failed.

Your query seems perfectly valid, I'd look for something else that might be wrong (e.g. are you not using LowerCaseFilter for MALE/BOTH/FEMALE terms by a chance when indexing?).
You might want to read this article on how to combine various queries into a single BooleanQuery.

Related

Lucene.NET 2.9 - MultiFieldQueryParser, boosted fields, stemming and prefixes

I have a system where the search queries multiple fields with different boost values. It is running on Lucene.NET 2.9.4 because it's an Umbraco (6.x) site, and that's what version of Lucene.NET the CMS uses.
My client asked me if I could add stemming, so I wrote a custom analyzer that does Standard / Lowercase / Stop / PorterStemmer. The stemming filter seems to work fine.
But now, when I try to use my new analyzer with the MultiFieldQueryParser, it's not finding anything.
The MultiFieldQueryParser is returning a query containing stemmed words - e.g. if I search for "the figure", what I get as part of the query it returns is:
keywords:figur^4.0 Title:figur^3.0 Collection:figur^2.0
i.e. it's searching the correct fields and applying the correct boosts, but trying to do an exact search on stemmed terms on indexes that contained unstemmed words.
I think what's actually needed is for the MultiFieldQueryParser to return a list of clauses which are of type PrefixQuery. so it'll output a query like
keywords:figur*^4.0 Title:figur*^3.0 Collection:figur*^2.0
If I try to just add a wildcard to the end of the term, and feed that into the parser, the stemmer doesn't kick in. i.e. it builds a query to look for "figure*".
Is there any way to combine MultiFieldQueryParser boosting and prefix queries?
You need to reindex using your custom analyzer. Applying a stemmer only at query time is useless. You might kludge together something using wildcards, but it would remain an ugly, unreliable kludge.

Lucene query fails with mixed MUST/MUST_NOT

Given a document with this text, indexed in a field named Content:
The dish ran away with the spoon.
The following query fails to match that document:
+Content:dish +(-Content:xyz) <-- no results!
I want the query to be treated as must include "dish", must not include "xyz". It's the "must not" part that is failing.
I know the +- combination looks funny but syntactically it should be correct, especially considering that the following variations all work:
+Content:dish +(-Content:xyz +Content:spoon) <-- this works
+Content:dish -Content:xyz <-- this works
So why doesn't +(-Content:xyz) work? Is that by design, or a bug, or am I just missing something? I'm using Lucene.Net but I assume regular Lucene behaves the same.
Lucene doesn't start with a full view of everything, like a SQL database. Lucene starts with no documents matched, and finds things based on the clauses searched on. This is why:
-Content:xyz
On it's own doesn't really work. It knows not to bring in content:xyz, but hasn't been given any documents to match. The same is true of your query, because it's placed in a subquery.
-Content:xyz is evaluated first, which gets no docs on it's own. So then you have, effectively
+Content:dish +(no documents)
It's useful to think of - as an AND NOT rather than simply a NOT (though don't take that to imply the +/- and AND/OR/NOT syntax necessarily map to each other directly).
If you want to be able to execute a lonely negative query like that, you need to bring in all documents first. The MatchAllDocsQuery is the best way to accomplish that, something like:
BooleanQuery query = new BooleanQuery();
query.add(new BooleanClause(new MatchAllDocsQuery(), BooleanClause.Occur.SHOULD));
query.add(new BooleanClause(new TermQuery(new Term("Content","xyz")), BooleanClause.Occur.MUST_NOT));
Would be the equivalent of a SQL style query with only a negation for a WHERE clause.
Of course, this isn't really necessary in the case you've listed since:
+Content:dish -Content:xyz
Is perfectly adequate.

Lucene mutli-language analyzer/index approach

I have a working Lucene index supporting a suggestion service. When a user types into a search box it queries the index by the SUGGESTION_FIELD. Each entry in SUGGESTION_FIELD can be one of many supported languages and each is stored using an appropriate language specific analyzer. In order to know what analyzer was used there is second field per entry which stores the LOCALE. So during a query I can say something like the code below to do a language specific query using appropriate analyzer
QueryParser parser = new QueryParser(Version.LUCENE_33, SUGGESTION_FIELD, getLangaugeAnalyzer(locale));
return searcher.search(parser.parse("SUGGESTION_FIELD:" + queryString + " AND LOCALE:"
+ locale), 100);
The works.... But now the client wants to be able to search using multiple languages at once.
My Question: What would be the fastest querying solution bearing in mind that a suggestion service needs to be very fast?...
Sol. #1. The simplest solution would seem to be; do the query multiple times. Once for each locale, thereby applying the corresponding language analyser each time. Finally append the results from each query in some sensible fashion
Sol. #2. Alternatively I could re-index using a column for each locale such that:
SUGGESTION_FIELD_en, SUGGESTION_FIELD_fr, SUGGESTION_FIELD_es etc..
using a different analyzer for each field (using PerFieldAnalyzerWrapper) and then query using a more complex query string such that:
"SUGGESTION_FIELD_en:" + queryString + " AND SUGGESTION_FIELD_fr:" + queryString + " AND SUGGESTION_FIELD_es:" + queryString
Please help if you think you :)
Your query is going to be something like this: (sugField:queryString1 AND locale:loc1) OR (sugField:queryString2 AND locale:loc2) OR .... This is a top-level BooleanQuery with subordinate BooleanQueries added with occurs=SHOULD, where each subordinate query has its terms with occurs=MUST. The queryString1, queryString2, etc. are the outputs from different language analyzers having the same input, the string the user entered.
Each subordinate query involves mandatory terms (from your query string) that are rare in the index and Lucene knows this at the outset (it knows the total doc count for each Term in the index) so it will first constrain the result by the queryString and then additionally intersect that with the locale terms. This will be VERY efficient no matter how large your index.
As for the different analyzers, I suggest you don't use the QueryParser, but create the entire query programmatically. This is a good general advice whenever you don't enter the query by hand and in your case it is the only way to gain control of the analyzing aspect. Run your query string through each of the language-specific analyzers and add their output tokens as TermQueries to the subordinate BooleanQueries.

Lucene: search within search using FuzzyQuery

I need to make a FuzzyQuery using an index that contains around 8 million lines. That kind of query is pretty slow, needing about 20 seconds for every match. The fact is that I can narrow down the results using another field to about 5000 hits before doing the fuzzy search. For this to work, I should be able to make a search by the "narrower" field first, and then use the fuzzy search within those results.
According to the lucene FAQ, the only thing I have to do is a BooleanQuery, where the "narrower" should be required (BooleanClause.Occur.MUST in lucene 3).
Now I have tried two different approaches:
a) Using the Query Parser, with an input like:
narrower:+narrowing_text fuzzy:fuzzy_text~0.9
b) Constructing a BooleanQuery with a TermQuery and a FuzzyQuery
Neither did work, I'm getting about the same times than the ones when the narrower is not used.
Also, just to check that if the narrower was working the times should be much better, I reindexed only the 5000 items that match the narrower, and the search went fast as hell.
In case anyone wonders, I'm using pylucene 3.0.2.
Doppleganger, you can probably use a Filter, specifically a QueryWrapperFilter.
Follow the example from Lucene in Action. You may have to make some modifications for use in python, but otherwise it should be simple:
Create the query that narrows this down to 5000 hits.
Use it to build a QueryWrapperFilter.
Use the filter in a search involving the fuzzy query.

All of these words feature

I have a "description" field indexed in Lucene.This field contains a book's description.
How do i achieve "All of these words" functionality on this field using BooleanQuery class?
For example if a user types in "top selling book" then it should return books which have all of these words in its description.
Thanks!
There are two pieces to get this to work:
You need the incoming documents to be analysed properly, so that individual words are tokenised and indexed separately
The user query needs to be tokenised, and the tokens combined with the AND operator.
For #1, there are a number of Analyzers and Tokenizers that come with Lucene - have a look in the org.apache.lucene.analysis package. There are options for many different languages, stemming, stopwords and so on.
For #2, there are again a lot of query parsers that come with Lucene, mainly in the org.apache.lucene.queryParser packagage. MultiFieldQueryParser might be good for you: to require every term to be present, just call
QueryParser.setDefaultOperator(QueryParser.AND_OPERATOR)
Lucene in Action, although a few versions old, is still accurate and extremely useful for more information on analysis and query parsing.
I believe if you add all query parts (one per term) via
BooleanQuery.add(Query, BooleanClause.Occur)
and set that second parameter to the constant BooleanClause.Occur.MUST, then you should get what you want. The equivalent query syntax would be "+term1+term2 +term3 ...".