Creating a Lucene Analyzer - lucene

I want to do some basic hebrew stemming.
All the examples of custom analyzers I could find always merge other analyzers and and filters but never do any string level processing themselves.
What do I have to do for example if I want to create an analyzer that for each term in the stream it gets, emits either one or two terms by the following rules:
if the incoming term begins with anything other then "a" it should be passed as is.
if the incoming term begins with "a" then two terms should be emmited: the original term and a second one without the leading "a" and with a lower boost.
So that if the document has "help away" it will return "help", "away", and "way^0.8".
What methods of the analyzer should I override to do this?
(A pointer to a similar nature example would be very helpful).
Thanks

Here's one example: http://www.java2s.com/Open-Source/Java-Document/Search-Engine/lucene/org/apache/lucene/wordnet/SynonymTokenFilter.java.htm
Briefly scanning the code, it seems it should emit additional tokens at the same position (a synonym). It does that by overriding incrementToken() which you'll have to do for your problem (maintain a stack of next tokens, returning one by one).
If this example doesn't work, just try to find one that explains how you could implement a synonym filter with Lucene, it's almost identical to your problem. Lucene in Action book has a good example of this, the code is available here: http://www.manning.com/hatcher3/LIAsourcecode.zip, class SynonymFilter.

Related

find indexed terms in non-indexed document/string

Sorry if I'm using the wrong terminology here, I'm new to Lucene :D
Assume that I have indexed all titles of the English Wikipedia in Lucene.
Let's say I'm visiting a news website. Within the article I want to convert all phrases (that match a title in the Wikipedia) into a link to the Wikipedia page.
To clarify: I don't want to put the news article into the Lucene index, but rather use the indexed WP titles to find matches within a given string (the article). We also don't want to bother with the JS/HTML stuff, just focus on Lucene for now.
I'd also like to match greedy: i.e. if the text contains "Stack Overflow", I'd like to link to SO, rather than to "Stack" and "Overflow". But if I can get shorter matches as well, that would be neat, too. (I kindof want to do both, but I'll settle for either one if having both is difficult).
naive solution: I can see that I'd be able to query for single words iteratively and whenever I hit an index, try to find the current word plus the next word until I miss. Then convert the last match into a link and continue after that, until I'm through the complete document.
But, that seems really awkward and I have a suspicion that Lucene might have some functionality that could support me here (or at least I hope so :D), but I have no clue what I'd be looking for.
Lucene's inverted index should make this a pretty fast operation, so I guess this might perform reasonably well, even with the naive approach.
Any pointers? I'm stuck :3
There is such a thing, it's called the Tagger Handler:
Given a dictionary (a Solr index) with a name-like field, you can post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired. It’s used for named entity recognition (NER).
It seems a bit fiddly to set-up, but it's exactly what I wanted :D

Azure Search - issues with Phonetic Analyzer

Our clients query on our Azure Search index, mostly for people's names. We are using the Lucene analyzer for all of our fields. We build the query string by making the client's input name into a phrase, and adding proximity rate of 3. Because we search using a phrase, we can not use the Fuzzy Search capability of the Lucene analyzer, as it only works on single words.
We were therefore in search of a solution for being able to bring back results with names that weren't spelled exactly as the client input them. We came across the phonetic analyzer, and have just implemented the Metaphone algorithm into our index. We've run some tests and while it gets us closer to what we need, we still see some issues:
The analyzer's scope is so wide, that it's bringing back a lot of false positives. For example, when searching on Kenneth Gooden, it brings back Kenneth Cotton. That's just a little too far to be considered phonetically similar, in our opinion. Can the sensitivity be tweaked in any way, or, can something be done to boost some other parameter to remedy this?
When doing a search on Barry Soper, the first and highest-scored result that comes back is "Barry Spear." The second result, scored lower, is "Soper, Barry Russell." To a certain extent, I can maybe see why it's scored that way (b/c of the 2nd one being last name first) but then... not really. The 2nd result contains both exact terms within the required proximity. Maybe Azure Search gives priority to the order of words in the phrase before applying the analyzer? Still doesn't make sense to me. (Side note - this query also brings back "Barh Super" - see issue #1 above)
I would like to know if someone could offer suggestions to tweak Azure Search's behavior to work more along the lines of what we need, OR, perhaps suggest an alternative to the phonetic analyzer. We haven't tried any of the the other available phonetic algorithms either yet, only b/c it seems Metaphone is the best and most commonly-used. But we're open to suggestions regarding the other algorithms as well.
Thanks.
You are correct that the fuzzy operator only works on single terms. In this case, you can use a custom analyzer (phonetic tokenfilter) or Synonyms feature (in preview). I am not sure what you meant by "we have just implemented the Metaphone algorithm into our index" but there are several phonetic tokenfilters you can choose from in Azure Search custom analysis stack. Synonyms is a newer feature only available in preview, you can take a look here. For synonyms, you will need to define synonyms rules, say 'Nate, Nathan, Nathaniel' for example, and at query time, searching for one automatically includes the results for the others.
Okay, then how should I use these building blocks in a way to control relevance for my search? One way to model is to use separate field for each expansion strategy. For example, instead of a single field for the name, you can have three fields, say 'name', 'name_synonym', and 'name_phonetic'. The first field 'name' is for exact matches, 'name_synonym' field has synonyms enabled and the third uses a phonetic analyzer and broadens the search the most. You can then use the scoring profile to boost scores from matches in each field. You can give the boost value of 10 for exact matches, 5 for synonyms and 1 for phonetic expansions, for example. Your search will be issued against these three internal fields.
Regarding your question as to why 'Soper, Barry Russell' is ranked lower than 'Barry Spear'. After the phonetic analysis. the words 'soper' and 'spear' reduce to the same form both at indexing and query time and treated as if they were identical terms. In computing the score and ranking, the search engine uses analyzed form of the terms and phonetic similarity makes no influence to the score. That’s why, secondary factors, like field length, will play a more significant role influencing the relevance score.
Hope this helps. I provided one example to model this but you could also take a look at term boosting in the full lucene query syntax.
Let me know if you have any additional questions.
Nate

Lucene searching stack traces: splitting on dots

I'm writing an app that embeds Lucene to search for, amongst other things, parts of stack traces, including class names etc. For example, if a document contained:
java.lang.NullPointerException
The documents can also contain ordinary English text.
I'd like to be able to query for either NullPointerException or java.lang.NullPointerException and find the document. Using the StandardAnalyzer, I only get a match if I search for the full java.lang.NullPointerException.
What's the best way to go about supporting this? Can I get multiple tokens emitted? e.g. java, lang, NullPointerException and java.lang.NullPointerException? Or would I be better replacing all the . characters with spaces up front? Or something else?
the dot character is considered an "ambiguous terminator" for the purposes of the algorithm used by StandardAnalyzer. Lucene attempts to be intelligent about this and make the best possible guess for the situation.
You have a couple of options here:
If you don't want Lucene to apply a bunch of complicated lexical tokenization rules, you can try a simpler analyzer, such as SimpleAnalyzer, which will just create tokens of uninterrupted strings of letters.
Implement a filter that applies your own specialized rules, and incorporate it into an Analyzer similar to the StandardAnalyzer. This would allow you to test whatever identification techniques you like to recognize that a token is an exception, and split them up during the analysis phase.
As you said, you can replace the periods with spaces before they ever hit the analyzer at all.

Stemming + wildcarding: unexpected effects

I am editing a lucene .net implementation (2.3.2) at work to include stemming and automatic wildcarding (adding of * at the ends of words).
I have found that exact words with wildcarding don't work. (so stack* works for stackoverflow, but stackoverflow* does not get a hit), and was wondering what causes this, and how it might be fixed.
Thanks in advance. (Also thanks for not asking why I am implementing both automatic wildcarding and stemming.)
I am about to make the query always prefix query so I don't have to do any nasty adding "*"s to queries, so we will see if anything becomes clear then.
Edit: Only words that are stemmed do not work wildcarded. Example Silicate* doesn't work, but silic* does.
The reason it doesnt work is because you stem the content, thus changing the Term.
For example consider the word "valve". The snowball analyzer will stem it down to "valv".
So at search time, since you stem the input query, both "valve" and "valves" will be stemmed down to "valv". A TermQuery using the stemmed Term "valv" will yield a match on both "valve" and "valves" occurences.
But now, since in the Index you stored the Term "valv", a query for "valve*" will not match anything. That is because the QueryParser does not run the Analyzer on Wildcard Queries.
There is the AnalyzingQueryParser than can handle some of these cases, but I don't think it was in 2.3.x versions of Lucene. Anyway its not a universal fit, the documentation says:
Warning: This class should only be used with analyzers that do not use stopwords or that add tokens. Also, several stemming analyzers are inappropriate: for example, GermanAnalyzer will turn Häuser into hau, but H?user will become h?user when using this parser and thus no match would be found (i.e. using this parser will be no improvement over QueryParser in such cases).
The solution mentionned in the duplicate I linked works for all cases, but you will get bigger indexes.

Lucene Based Searching

I've a problem in Lucene based searching. I have designed a document with five fields. Consider the document be Address with addressline1, addressline2, city, state and pin. If a search is to be performed, then the search has be done in all the fields, so I'm using boolean term queries. So the results would be retrieved. Now I also have to respond not only with responses but also with the matching field. For eg if the city field matches the search, then I should respond as city matches the search along with the actual search response. Is there are any lucene api to accommodate this?
AFAIK there's no simple solution to find out which field matched the query.
Your options are:
try using hit highlighter (it knows where the match occurred but it's noticeably slow on large result sets)
fiddle with IndexSearcher's explain method
build your custom solution
Hit highlighter experience and workaround findings.
IMHO it shouldn't be hard to implement that yourself, since Lucene in some point in time surely knows which field yielded a match, but it discards that information as unnecessary weight by the time it composes your response.
I stumbled upon this custom approach.
Try to find more resources on search-lucene.com, the best Lucene/Solr related search engine.