Lucene.NET 2.9 - MultiFieldQueryParser, boosted fields, stemming and prefixes - lucene

I have a system where the search queries multiple fields with different boost values. It is running on Lucene.NET 2.9.4 because it's an Umbraco (6.x) site, and that's what version of Lucene.NET the CMS uses.
My client asked me if I could add stemming, so I wrote a custom analyzer that does Standard / Lowercase / Stop / PorterStemmer. The stemming filter seems to work fine.
But now, when I try to use my new analyzer with the MultiFieldQueryParser, it's not finding anything.
The MultiFieldQueryParser is returning a query containing stemmed words - e.g. if I search for "the figure", what I get as part of the query it returns is:
keywords:figur^4.0 Title:figur^3.0 Collection:figur^2.0
i.e. it's searching the correct fields and applying the correct boosts, but trying to do an exact search on stemmed terms on indexes that contained unstemmed words.
I think what's actually needed is for the MultiFieldQueryParser to return a list of clauses which are of type PrefixQuery. so it'll output a query like
keywords:figur*^4.0 Title:figur*^3.0 Collection:figur*^2.0
If I try to just add a wildcard to the end of the term, and feed that into the parser, the stemmer doesn't kick in. i.e. it builds a query to look for "figure*".
Is there any way to combine MultiFieldQueryParser boosting and prefix queries?

You need to reindex using your custom analyzer. Applying a stemmer only at query time is useless. You might kludge together something using wildcards, but it would remain an ugly, unreliable kludge.

Related

Lucene term query

In my application we have deals and each deal has a target user group which may include several fields like gender, age and city. For the gender part a deal's target could be MALE FEMALE or BOTH. I wanted to find deals which are either for males or both.I created the following query but it doesn't work...
TermQuery maleQuery = new TermQuery(new Term("gender","MALE"));
TermQuery bothQuery = new TermQuery(new Term("gender","BOTH"));
BooleanQuery query = new BooleanQuery();
query.add(maleQuery,BooleanClause.Occur.SHOULD);
query.add(bothQuery,BooleanClause.Occur.SHOULD);
Please suggest if I am making some mistake. Somehow it seems to spit out only MALE deals,and not BOTH.
I am using version 4.2.1 and Standard Analyzer as the analyzer.
Several solutions possible are:
Use a QueryParser to construct the query instead of using TermQuery using the same analyser used at indexing time.For eg in my case it would be:
Query query = new QueryParser(version,"gender",new StandardAnalyzer()).parse("MALE BOTH");
Use a different analyzer for indexing that does a case insensitive indexing.
(this one applies for StandardAnalyzer, for other analyzers solutions may be different) LOWERCASE your search terms before searching.
Explaination
A brief explaination to the situation would be:
I used a StandardAnalyzer for indexing which lower cases input tokens so that a case insensitive search could be materialised.
Then I used a QueryParser configured with the same analyzer to construct a query instance for searching the user's query at front end. The search worked because the parser working in accordance with standard analyzer lower cased the user's terms and made a case sensitive search.
Then I needed to filter search results for which I wrote TermQuerys instead of using parsers, in which I used capitalized text which was not indexed that way, so the search failed.
Your query seems perfectly valid, I'd look for something else that might be wrong (e.g. are you not using LowerCaseFilter for MALE/BOTH/FEMALE terms by a chance when indexing?).
You might want to read this article on how to combine various queries into a single BooleanQuery.

Lucene Tag Searching problems with C#, escape problems?

I am using lucene 2.9.2 (.NET doesnt have a lucene 3)
"tag:C#" Gets me the same results as "tag:c". How do i allow 'C#' to be a searchword? i tried changing Field.Index.ANALYZED to Field.Index.NOT_ANALYZED but that gave me no results.
I assuming i need to escape each tag, how might i do that?
The problem isn't the query, its the query analyzer you are using which is removing the "#" from both the query and (if you are using the same analyzer for insertion - which you should be) and the field.
You will need to find an analyzer that preserves special characters like that or write a custom one.
Edit: Check out KeywordAnalyzer - it might just do the trick:
"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.
According to the Java Documentation for Lucene 2.9.2 '#' is not a special character, which needs escaping in the Query. Can you check out (i.e. by opening the index with Luke), how the value 'C#' is actually stored in the index?

Lucene: search within search using FuzzyQuery

I need to make a FuzzyQuery using an index that contains around 8 million lines. That kind of query is pretty slow, needing about 20 seconds for every match. The fact is that I can narrow down the results using another field to about 5000 hits before doing the fuzzy search. For this to work, I should be able to make a search by the "narrower" field first, and then use the fuzzy search within those results.
According to the lucene FAQ, the only thing I have to do is a BooleanQuery, where the "narrower" should be required (BooleanClause.Occur.MUST in lucene 3).
Now I have tried two different approaches:
a) Using the Query Parser, with an input like:
narrower:+narrowing_text fuzzy:fuzzy_text~0.9
b) Constructing a BooleanQuery with a TermQuery and a FuzzyQuery
Neither did work, I'm getting about the same times than the ones when the narrower is not used.
Also, just to check that if the narrower was working the times should be much better, I reindexed only the 5000 items that match the narrower, and the search went fast as hell.
In case anyone wonders, I'm using pylucene 3.0.2.
Doppleganger, you can probably use a Filter, specifically a QueryWrapperFilter.
Follow the example from Lucene in Action. You may have to make some modifications for use in python, but otherwise it should be simple:
Create the query that narrows this down to 5000 hits.
Use it to build a QueryWrapperFilter.
Use the filter in a search involving the fuzzy query.

Setting wildcard queries as default for QueryParser

When my users enter a term like "word" I would like it be treated as a wildcard query "word*" so all terms beginning "word" are found. Is there a way to tell the QueryParser to automatically create wildcard queries or do I have to parse the query myself? This shouldn't be a problem for simple queries but it may become tricky for more complex queries.
Unless I am missing something - a wildcard query for every query is usually inadvisable - it is very expensive and could cause a lot of problems. If you are trying find results including variants of a stem (e.g. win -> winner, winning, etc.) You should consider a n-gram approach.

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article