LUCENE Standard Analyzer Hyphen consideration - lucene

While indexing my document using lucene Standard Analyzer I got a plroblem.
For example:
my document had a word "plag-iarism" ... here this analyzer indexed it as "plag" and "iarism". But I want like "plagiarism". What I have to do to get a whole word?

StandardAnalyzer delegates tokanization to StandardTokenizer.
You create your own tokanizer to match your exact needs (you could base it on StandardTokenizer).
Alternatively, if you prefer, you could do a dirty hack of a String.replace(), with the relevant regular expression, just the analyzer runs. Yeah. Ugly.

Related

Lucene Incomplete Word Spellchecking

I am using Lucene to perform spell checking. I am using
https://lucene.apache.org/core/5_4_1/suggest/org/apache/lucene/search/spell/SpellChecker.html
What I really want is, for example, I have a word:
spellingmistake
And now I type:
speli.
In this case, I want to spellchecker to return me the correct word or at least return spell. So to achieve this while indexing the dictionary in the indexwriter config I used a EdgeNGramTokenizer assuming it would work for this case. But unfortunately, it is not working.
How do I get this working?
Thanks.

looking for indonesian language stemmer

I'm processing some Indonesian texts in a Java application, and I need to stem them.
Currently I am using lucene indonesian stemmer.
org.apache.lucene.analysis.id.IndonesianAnalyzer;
but results are not satisfactory.
Could anyone suggest me different stemmer?
"enang" is a stem. Stems need not be actual words. For instance, in English, "argue" "argues" and "arguing" reduce to the stem "argu". "argu" isn't an english word, but it is a meaningful stem. This is how stemmers work. As long as you apply the stemmer the same way to the indexed data and the query, it should work well.
If you don't want behavior like that, it doesn't make any sense to use a stemmer at all.
Aside from the stemmer, IndonesianAnalyzer is fairly easily replicated. It's other components just involve a StandardTokenizer, StandardFilter, LowercaseAnalyzer, and a StopFilter. That's just a StandardAnalyzer with an Indonesian stopword set, when you get right down to it, so you can create an Indonesiananalyzer without the stemmer as simply as:
//If you are using the default stopword location defined in the IndonesianAnalyzer you could load them like this.
CharArraySet defaultStopSet = StopwordAnalyzerBaseloadStopwordSet(false, IndonesianAnalyzer.class, IndonesianAnalyzer.DEFAULT_STOPWORD_FILE, "#");
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43, defaultStopSet);
I'm not sure whether you would run into problems just passing a reader on the default stop word file into the StandardAnalyzer constructor.

Luke Lucene QueryParser Case Sensitivity

In Luke, if I enter the search expression docfile:Tomatoes.jpg* the parsed query is docfile:Tomatoes.jpg*. When the search expression is docfile:Tomatoes.jpg, (no asterisk *) the parsed query is docfile:tomatoes.jpg with a lowercase 't'.
Why?
How can I change this?
BTW, using org.apache.lucene.analysis.standard.StandardAnalyzer.
StandardAnalyzer uses LowerCaseFilter which means it lowercases your queries and data. This is described in the Javadocs http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html.
If I remember correctly WhitespaceAnalyzer does not lowercase, but verify it suits your needs http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/WhitespaceAnalyzer.html.
For Lucene 5.3.0 the problem was solved by using the SimpleAnalyzer.
Example:
Analyzer analyzer = new org.apache.lucene.analysis.core.SimpleAnalyzer();
Finally, use the same analyzer for building the index and searching.

Lucene - Which analyzer to use to avoid prepositions

I am using the Lucene standard analyzer to parse text. however, it is returning prepositions as well as words like "i", "the", "and" etc...
Is there an Analyzer I can use that will not return these words?
Thanks
StandardAnalyzer uses StopFilter.
By default the words in the STOP_WORDS_SET are excluded. If this is not sufficient, there are constructors which allow you to pass in a list of stop words which should be removed from the token stream. You can provide the list using a File, a Set, or a Reader.

Using MultiFieldQueryParser

Am using MultiFieldQueryParser for parsing strings like a.a., b.b., etc
But after parsing, its removing the dots in the string.
What am i missing here?
Thanks.
I'm not sure the MultiFieldQueryParser does what you think it does. Also...I'm not sure I know what you're trying to do.
I do know that with any query parser, strings like 'a.a.' and 'b.b.' will have the periods stripped out because, at least with the default Analyzer, all punctuation is treated as white space.
As far as the MultiFieldQueryParser goes, that's just a QueryParser that allows you to specify multiple default fields to search. So with the query
title:"Of Mice and Men" "John Steinbeck"
The string "John Steinbeck" will be looked for in all of your default fields whereas "Of Mice and Men" will only be searched for in the title field.
What analyzer is your parser using? If it's StopAnalyzer then the dot could be a stop word and is thus ignored. Same thing if it's StandardAnalyzer which cleans up input (includes removing dots).
(Repeating my answer from the dupe. One of these should be deleted).
The StandardAnalyzer specifically handles acronyms, and converts C.F.A. (for example) to cfa. This means you should be able to do the search, as long as you make sure you use the same analyzer for the indexing and for the query parsing.
I would suggest you run some more basic test cases to eliminate other factors. Try to user an ordinary QueryParser instead of a multi-field one.
Here's some code I wrote to play with the StandardAnalyzer:
StringReader testReader = new StringReader("C.F.A. C.F.A word");
StandardAnalyzer analyzer = new StandardAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("title", testReader);
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
The output for this, by the way was:
(cfa,0,6,type=<ACRONYM>)
(c.f.a,7,12,type=<HOST>)
(word,13,17,type=<ALPHANUM>)
Note, for example, that if the acronym doesn't end with a dot then the analyzer assumes it's an internet host name, so searching for "C.F.A" will not match "C.F.A." in the text.