Lucene searching stack traces: splitting on dots

Lucene searching stack traces: splitting on dots - lucene

I'm writing an app that embeds Lucene to search for, amongst other things, parts of stack traces, including class names etc. For example, if a document contained:
java.lang.NullPointerException
The documents can also contain ordinary English text.
I'd like to be able to query for either NullPointerException or java.lang.NullPointerException and find the document. Using the StandardAnalyzer, I only get a match if I search for the full java.lang.NullPointerException.
What's the best way to go about supporting this? Can I get multiple tokens emitted? e.g. java, lang, NullPointerException and java.lang.NullPointerException? Or would I be better replacing all the . characters with spaces up front? Or something else?

the dot character is considered an "ambiguous terminator" for the purposes of the algorithm used by StandardAnalyzer. Lucene attempts to be intelligent about this and make the best possible guess for the situation.
You have a couple of options here:
If you don't want Lucene to apply a bunch of complicated lexical tokenization rules, you can try a simpler analyzer, such as SimpleAnalyzer, which will just create tokens of uninterrupted strings of letters.
Implement a filter that applies your own specialized rules, and incorporate it into an Analyzer similar to the StandardAnalyzer. This would allow you to test whatever identification techniques you like to recognize that a token is an exception, and split them up during the analysis phase.
As you said, you can replace the periods with spaces before they ever hit the analyzer at all.

Related

Lucene query-time boosting culture code

I'm using the Lucene.Net implementation packaged with the Kentico CMS. The site that we're indexing has articles in various languages. If a user is viewing the Japanese version of the site (for example) and runs a search for 'VPN', we'd like them to see Japanese articles about VPN first, but also see other language articles in the results.
I'm trying to achieve this with query-time boosting of the _culture field. Since we're using the standard analyzer (really don't want to change that), and the standard analyzer treats hyphens as whitespace, I thought I'd try appending '(_culture:jp)^4' to the user's query. As you can see from the Luke tool's Explain output, that isn't doing anything to boost the documents with 'jp' in the field. What gives?
I've also tried:
_culture:"en-jp"
_culture:en AND _culture:jp
_culture:"en jp"
Update: It's something with the field. There's another field in the index named 'documentculture' that contains the same data (don't know why). But when I try '(documentculture:jp)^4', it works as I expect. That solves my problem, but I still have an academic question of how the fields are different.

Even though the standard analyzer ignores hyphens I don't believe it will treat the two parts of your culture code as separate terms. Therefore under normal circumstances a wildcard would help you here. For example, the query vpn (_culture:en*)^4 would boost all documents with a culture starting with en.
However, in your case you want to match the end of the term. Unfortunately, Lucene syntax doesn't support wildcards at the start of terms for some reason (according to this reference). Therefore I think you're going to have to consider changing the analyzer you're using. I generally find the Whitespace analyzer fits my needs best. I've just tried your scenario using Whitespace analyzer and have found vpn (_culture:en-jp)^4 will give you what you need.
I understand if you don't accept this answer though since you stated you didn't want to change the analyzer!

Stemming + wildcarding: unexpected effects

I am editing a lucene .net implementation (2.3.2) at work to include stemming and automatic wildcarding (adding of * at the ends of words).
I have found that exact words with wildcarding don't work. (so stack* works for stackoverflow, but stackoverflow* does not get a hit), and was wondering what causes this, and how it might be fixed.
Thanks in advance. (Also thanks for not asking why I am implementing both automatic wildcarding and stemming.)
I am about to make the query always prefix query so I don't have to do any nasty adding "*"s to queries, so we will see if anything becomes clear then.
Edit: Only words that are stemmed do not work wildcarded. Example Silicate* doesn't work, but silic* does.

The reason it doesnt work is because you stem the content, thus changing the Term.
For example consider the word "valve". The snowball analyzer will stem it down to "valv".
So at search time, since you stem the input query, both "valve" and "valves" will be stemmed down to "valv". A TermQuery using the stemmed Term "valv" will yield a match on both "valve" and "valves" occurences.
But now, since in the Index you stored the Term "valv", a query for "valve*" will not match anything. That is because the QueryParser does not run the Analyzer on Wildcard Queries.
There is the AnalyzingQueryParser than can handle some of these cases, but I don't think it was in 2.3.x versions of Lucene. Anyway its not a universal fit, the documentation says:
Warning: This class should only be used with analyzers that do not use stopwords or that add tokens. Also, several stemming analyzers are inappropriate: for example, GermanAnalyzer will turn Häuser into hau, but H?user will become h?user when using this parser and thus no match would be found (i.e. using this parser will be no improvement over QueryParser in such cases).
The solution mentionned in the duplicate I linked works for all cases, but you will get bigger indexes.

Creating a Lucene Analyzer

I want to do some basic hebrew stemming.
All the examples of custom analyzers I could find always merge other analyzers and and filters but never do any string level processing themselves.
What do I have to do for example if I want to create an analyzer that for each term in the stream it gets, emits either one or two terms by the following rules:
if the incoming term begins with anything other then "a" it should be passed as is.
if the incoming term begins with "a" then two terms should be emmited: the original term and a second one without the leading "a" and with a lower boost.
So that if the document has "help away" it will return "help", "away", and "way^0.8".
What methods of the analyzer should I override to do this?
(A pointer to a similar nature example would be very helpful).
Thanks

Here's one example: http://www.java2s.com/Open-Source/Java-Document/Search-Engine/lucene/org/apache/lucene/wordnet/SynonymTokenFilter.java.htm
Briefly scanning the code, it seems it should emit additional tokens at the same position (a synonym). It does that by overriding incrementToken() which you'll have to do for your problem (maintain a stack of next tokens, returning one by one).
If this example doesn't work, just try to find one that explains how you could implement a synonym filter with Lucene, it's almost identical to your problem. Lucene in Action book has a good example of this, the code is available here: http://www.manning.com/hatcher3/LIAsourcecode.zip, class SynonymFilter.

lucene / solr remove common phrases (stop phrases)

I would like to eliminate from the search query the words/phrases that bring no meaning to the query (we could call them stop phrases). Example:
"How to .."
"Where can I find .."
"What is the meaning of .."
etc.
Where to find / how to compute a list of 'common phrases' for English and for French?
How to implement it in Solr (Is there anything more advanced than the stopwords feature?)

I think that you shouldn't try to completely get rid of these phrases, because they reveal intent of the searcher. You can try to leverage the existence of them by using a natural language question answering system like Ephyra. There is even a project aimed at integration of it with Lucene. I haven't used it myself, but maybe at least evaluating it is
worth a try.
If you are determined to remove them, then I think that you need to write custom QueryParser that will filter the query, delegating the further processing to a parser of your choice.

Lucene gotchas with punctuation

Whilst building some unit tests for my Lucene queries I noticed some strange behavior related to punctuation, in particular around parentheses.
What are some of the best ways to deal with search fields that contain significant amounts of punctuation?

If you haven't customized the query parser, Lucene should behave according to the default query parser syntax. Are you getting something different than that? Do you want punctuation to have a special meaning or just to remove the punctuation from searches?
The other usual suspect here is the Analyzer, which determines how your field is indexed and how the query is broken into pieces for searching. Can you post specific examples of bad behavior?

It is not not just parentheses, other punctuations such as the colon, hyphen etc. will cause issues. Here is a way to deal with them.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene searching stack traces: splitting on dots - lucene

Related

Lucene query-time boosting culture code

Stemming + wildcarding: unexpected effects

Creating a Lucene Analyzer

lucene / solr remove common phrases (stop phrases)

Lucene gotchas with punctuation

Categories

Resources