Lucene search and underscores - lucene

When I use Luke to search my Lucene index using a standard analyzer, I can see the field I am searchng for contains values of the form MY_VALUE.
When I search for field:"MY_VALUE" however, the query is parsed as field:"my value"
Is there a simple way to escape the underscore (_) character so that it will search for it?
EDIT:
4/1/2010 11:08AM PST
I think there is a bug in the tokenizer for Lucene 2.9.1 and it was probably there before.
Load up Luke and try to search for "BB_HHH_FFFF5_SSSS", when there is a number, the following tokens are returned:
"bb hhh_ffff5_ssss"
After some testing, I've found that this is because of the number. If I input
"BB_HHH_FFFF_SSSS", I get
"bb hhh ffff ssss"
At this point, I'm leaning towards a tokenizer bug unless the presence of the number is supposed to have this behavior but I fail to see why.
Can anyone confirm this?

It doesn't look like you used the StandardAnalyzer to index that field. In Luke you'll need to select the analyzer that you used to index that field in order to match MY_VALUE correctly.
Incidentally, you might be able to match MY_VALUE by using the KeywordAnalyzer.

I don't think you'll be able to use the standard analyser for this use case.
Judging what I think your requirements are, the keyword analyser should work fine for little effort (the whole field becomes a single term).
I think some of the confusion arises when looking at the field with luke. The stored value is not what's used by queries, what you need are the terms. I suspect that when you look at the terms stored for your field, they'll be "my" and "value".
Hope this helps,

Related

Lucene, certain keywords in queries (e.g. "TO" in range queries) are case sensitive

In Lucene, searches look case-insensitive to the user by default due to the standard analyzer. That is what users expect, and that works fine.
However, for a few words like "TO" in range queries, or "AND"/"OR", those keywords are case sensitive. That's not what user's expect.
Is there a reason for this? Lucene basically "just works" by default so am a little surprised by that. Maybe there's a good reason behind it and I shouldn't touch it.
How would I go about making those keywords case insensitive? As the rest of the query is case insensitive by default, I could just convert the entire query to uppercase? Are there any problems I'm going to encounter if I do that? Is there a better way?
Is there a reason for this?
The real question here might not be "why does lucene do this?", but rather "why does google do this?", as I believe Google's use of this pattern predates Lucene's. Regardless, though, the reasoning isn't too hard to deduce. There needs to be a way of differentiating the word "and" from the the query operator "AND".
Say my query is: Jack and Jill went up the hill
I'm just searching a phrase that happens to contain the word "and". The end result I want is (eliminating stop words, and such):
field:jack field:jill field:went field:up field:hill
Rather than:
+field:jack +field:jill field:went field:up field:hill
If the word is uppercased, it's a decent indicator the user intended the word as an operator.
If all ands became operands, users might be confused why a search for "bread and butter pickles" (becomes +bread +butter pickles) turns up hits about toast, but not about other types of pickles.
Similar for lists of things, like "Abby, Ben, Chris, Dave and Elmer" (becomes abby ben chris +dave +elmer), which all hits would require Dave and Elmer to be present, but the rest of the names would be optional.
How to make them case insensitive?
Uppercasing the whole thing, or every instance of an AND, OR or TO, could be a bit promblematic. Take these, for example:
[to TO tz] works, [TO TO TZ] throws an exception
and another thing works, AND ANOTHER THING throws an exception
You could check for a ParseException after uppercasing, and try parsing the original query in that case. Might create a bit of an inconsistency, but it beats just failing entirely.

Lucene.NET 2.9 - MultiFieldQueryParser, boosted fields, stemming and prefixes

I have a system where the search queries multiple fields with different boost values. It is running on Lucene.NET 2.9.4 because it's an Umbraco (6.x) site, and that's what version of Lucene.NET the CMS uses.
My client asked me if I could add stemming, so I wrote a custom analyzer that does Standard / Lowercase / Stop / PorterStemmer. The stemming filter seems to work fine.
But now, when I try to use my new analyzer with the MultiFieldQueryParser, it's not finding anything.
The MultiFieldQueryParser is returning a query containing stemmed words - e.g. if I search for "the figure", what I get as part of the query it returns is:
keywords:figur^4.0 Title:figur^3.0 Collection:figur^2.0
i.e. it's searching the correct fields and applying the correct boosts, but trying to do an exact search on stemmed terms on indexes that contained unstemmed words.
I think what's actually needed is for the MultiFieldQueryParser to return a list of clauses which are of type PrefixQuery. so it'll output a query like
keywords:figur*^4.0 Title:figur*^3.0 Collection:figur*^2.0
If I try to just add a wildcard to the end of the term, and feed that into the parser, the stemmer doesn't kick in. i.e. it builds a query to look for "figure*".
Is there any way to combine MultiFieldQueryParser boosting and prefix queries?
You need to reindex using your custom analyzer. Applying a stemmer only at query time is useless. You might kludge together something using wildcards, but it would remain an ugly, unreliable kludge.

Searching "AND" in lucene index

I have lucene indexes indexed using StandardAnalyzer. The index consist of a value "AND".
When I try to search for the field value AND using MultiFieldQueryParser, the search is resulting in error.
EG: field1:* AND field2:AND
filed1:* AND field:"AND"
I have tried escape but is that is escaping the field value. I have aslo tried in double coutes("AND"). But could not succed in getting correct value.
Any advice in this regard would be helpful.
Thanks in advance.
I suspect that there are probably two issues in play here:
Query syntax, I think you'll get further by putting the "and" in lower case. Boolean terms in the standard query parser must be in upper case. Anyway, given that one of the steps of the standard analyser is to drop case sensitivity, this shouldn't be an issue
The next problem is stop words: I suspect that "and" is excluded from the set of analysed terms by the standard analysers stop word list. You could get around this by using a different stop word list with the standard analyser that doesn't exclude "and" as a term.
Good luck,

Lucene Tag Searching problems with C#, escape problems?

I am using lucene 2.9.2 (.NET doesnt have a lucene 3)
"tag:C#" Gets me the same results as "tag:c". How do i allow 'C#' to be a searchword? i tried changing Field.Index.ANALYZED to Field.Index.NOT_ANALYZED but that gave me no results.
I assuming i need to escape each tag, how might i do that?
The problem isn't the query, its the query analyzer you are using which is removing the "#" from both the query and (if you are using the same analyzer for insertion - which you should be) and the field.
You will need to find an analyzer that preserves special characters like that or write a custom one.
Edit: Check out KeywordAnalyzer - it might just do the trick:
"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.
According to the Java Documentation for Lucene 2.9.2 '#' is not a special character, which needs escaping in the Query. Can you check out (i.e. by opening the index with Luke), how the value 'C#' is actually stored in the index?

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article