Stemming + wildcarding: unexpected effects - lucene

I am editing a lucene .net implementation (2.3.2) at work to include stemming and automatic wildcarding (adding of * at the ends of words).
I have found that exact words with wildcarding don't work. (so stack* works for stackoverflow, but stackoverflow* does not get a hit), and was wondering what causes this, and how it might be fixed.
Thanks in advance. (Also thanks for not asking why I am implementing both automatic wildcarding and stemming.)
I am about to make the query always prefix query so I don't have to do any nasty adding "*"s to queries, so we will see if anything becomes clear then.
Edit: Only words that are stemmed do not work wildcarded. Example Silicate* doesn't work, but silic* does.

The reason it doesnt work is because you stem the content, thus changing the Term.
For example consider the word "valve". The snowball analyzer will stem it down to "valv".
So at search time, since you stem the input query, both "valve" and "valves" will be stemmed down to "valv". A TermQuery using the stemmed Term "valv" will yield a match on both "valve" and "valves" occurences.
But now, since in the Index you stored the Term "valv", a query for "valve*" will not match anything. That is because the QueryParser does not run the Analyzer on Wildcard Queries.
There is the AnalyzingQueryParser than can handle some of these cases, but I don't think it was in 2.3.x versions of Lucene. Anyway its not a universal fit, the documentation says:
Warning: This class should only be used with analyzers that do not use stopwords or that add tokens. Also, several stemming analyzers are inappropriate: for example, GermanAnalyzer will turn Häuser into hau, but H?user will become h?user when using this parser and thus no match would be found (i.e. using this parser will be no improvement over QueryParser in such cases).
The solution mentionned in the duplicate I linked works for all cases, but you will get bigger indexes.

Related

Stopwords and stemming in Lucene demo

I have two major questions about the Lucene Demo. Does the Lucene demo use stopwords before any modification?
What about stemming? If so, what stemmer does it use?
Which demo are you referring to?
If it's this one, then the answers are:
(a) Stop words: no, it does not. It uses the StandardAnalyzer() which does not use stop words when created with no arguments (but it can, if you choose to provide some).
(b) Stemming: no it does not use stemming - there are no stemming classes involved in the demo code, because there is no stemming used by the standard analyzer.
Take a look at the javadoc for the StandardAnalyzer. You will see the following:
Filters StandardTokenizer with LowerCaseFilter and StopFilter, using a configurable list of stop words.
So, this tells you how your input documents are analyzed:
Using the StanadardTokenizer, the rules for which you can read about here.
Using the LowerCaseFilter - which works like you would expect.
Using the StopFilter - for which you may or may not have provided any stop words.

Lucene query-time boosting culture code

I'm using the Lucene.Net implementation packaged with the Kentico CMS. The site that we're indexing has articles in various languages. If a user is viewing the Japanese version of the site (for example) and runs a search for 'VPN', we'd like them to see Japanese articles about VPN first, but also see other language articles in the results.
I'm trying to achieve this with query-time boosting of the _culture field. Since we're using the standard analyzer (really don't want to change that), and the standard analyzer treats hyphens as whitespace, I thought I'd try appending '(_culture:jp)^4' to the user's query. As you can see from the Luke tool's Explain output, that isn't doing anything to boost the documents with 'jp' in the field. What gives?
I've also tried:
_culture:"en-jp"
_culture:en AND _culture:jp
_culture:"en jp"
Update: It's something with the field. There's another field in the index named 'documentculture' that contains the same data (don't know why). But when I try '(documentculture:jp)^4', it works as I expect. That solves my problem, but I still have an academic question of how the fields are different.
Even though the standard analyzer ignores hyphens I don't believe it will treat the two parts of your culture code as separate terms. Therefore under normal circumstances a wildcard would help you here. For example, the query vpn (_culture:en*)^4 would boost all documents with a culture starting with en.
However, in your case you want to match the end of the term. Unfortunately, Lucene syntax doesn't support wildcards at the start of terms for some reason (according to this reference). Therefore I think you're going to have to consider changing the analyzer you're using. I generally find the Whitespace analyzer fits my needs best. I've just tried your scenario using Whitespace analyzer and have found vpn (_culture:en-jp)^4 will give you what you need.
I understand if you don't accept this answer though since you stated you didn't want to change the analyzer!

Lucene searching stack traces: splitting on dots

I'm writing an app that embeds Lucene to search for, amongst other things, parts of stack traces, including class names etc. For example, if a document contained:
java.lang.NullPointerException
The documents can also contain ordinary English text.
I'd like to be able to query for either NullPointerException or java.lang.NullPointerException and find the document. Using the StandardAnalyzer, I only get a match if I search for the full java.lang.NullPointerException.
What's the best way to go about supporting this? Can I get multiple tokens emitted? e.g. java, lang, NullPointerException and java.lang.NullPointerException? Or would I be better replacing all the . characters with spaces up front? Or something else?
the dot character is considered an "ambiguous terminator" for the purposes of the algorithm used by StandardAnalyzer. Lucene attempts to be intelligent about this and make the best possible guess for the situation.
You have a couple of options here:
If you don't want Lucene to apply a bunch of complicated lexical tokenization rules, you can try a simpler analyzer, such as SimpleAnalyzer, which will just create tokens of uninterrupted strings of letters.
Implement a filter that applies your own specialized rules, and incorporate it into an Analyzer similar to the StandardAnalyzer. This would allow you to test whatever identification techniques you like to recognize that a token is an exception, and split them up during the analysis phase.
As you said, you can replace the periods with spaces before they ever hit the analyzer at all.

lucene / solr remove common phrases (stop phrases)

I would like to eliminate from the search query the words/phrases that bring no meaning to the query (we could call them stop phrases). Example:
"How to .."
"Where can I find .."
"What is the meaning of .."
etc.
Where to find / how to compute a list of 'common phrases' for English and for French?
How to implement it in Solr (Is there anything more advanced than the stopwords feature?)
I think that you shouldn't try to completely get rid of these phrases, because they reveal intent of the searcher. You can try to leverage the existence of them by using a natural language question answering system like Ephyra. There is even a project aimed at integration of it with Lucene. I haven't used it myself, but maybe at least evaluating it is
worth a try.
If you are determined to remove them, then I think that you need to write custom QueryParser that will filter the query, delegating the further processing to a parser of your choice.

Lucene gotchas with punctuation

Whilst building some unit tests for my Lucene queries I noticed some strange behavior related to punctuation, in particular around parentheses.
What are some of the best ways to deal with search fields that contain significant amounts of punctuation?
If you haven't customized the query parser, Lucene should behave according to the default query parser syntax. Are you getting something different than that? Do you want punctuation to have a special meaning or just to remove the punctuation from searches?
The other usual suspect here is the Analyzer, which determines how your field is indexed and how the query is broken into pieces for searching. Can you post specific examples of bad behavior?
It is not not just parentheses, other punctuations such as the colon, hyphen etc. will cause issues. Here is a way to deal with them.