Lucene 4.1 : How split words that contains "dots" when indexing? - lucene

I'l trying to figure out what I should do to index my keywords that contains "." .
ex : this.name
I want to index the terms : this and name in my index.
I use the StandardAnalyser. I try to extends the WhitespaceTokensizer or extends TokenFilter, but I'm not sure if I'm in the right direction.
if I use the StandardAnalyser, I'll obtain "this.name" as a keyword, and that's not what I want, but the analyser do the rest correctly for me.

You can put a CharFilter in front of StandardTokenizer that converts periods and underscores to spaces. MappingCharFilter will work.
Here's MappingCharFilter added to a stripped-down StandardAnalyzer (see the original 4.1 version here):
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.charfilter.MappingCharFilter;
import org.apache.lucene.analysis.charfilter.NormalizeCharMap;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;
import java.io.IOException;
import java.io.Reader;
public final class MyAnalyzer extends StopwordAnalyzerBase {
private int maxTokenLength = 255;
public MyAnalyzer() {
super(Version.LUCENE_41, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
}
#Override
protected TokenStreamComponents createComponents
(final String fieldName, final Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, stopwords);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(MyAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(".", " ");
builder.add("_", " ");
NormalizeCharMap normMap = builder.build();
return new MappingCharFilter(normMap, reader);
}
}
Here's a quick test to demonstrate it works:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
public class TestMyAnalyzer extends BaseTokenStreamTestCase {
private Analyzer analyzer = new MyAnalyzer();
public void testPeriods() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"this.name; here.i.am; sentences ... end with periods.",
new String[] { "name", "here", "i", "am", "sentences", "end", "periods" } );
}
public void testUnderscores() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"some_underscore_term _and____ stuff that is_not in it",
new String[] { "some", "underscore", "term", "stuff" } );
}
}

If I understand you correctly, you need to use a tokenizer that removes dots -- that is, any name that contains a dot should be split at that point ("here.i.am" becomes "here" + "i" + "am").

you are getting caught by behavior documented here:
However, a dot that's not followed by whitespace is considered part of a token.
StandardTokenizer introduces some more complex to parsing rules than you may not be looking for. This one, in particular, is intended to prevent tokenization of URLs, IPs, idenifiers, etc. A simpler implementation might suit your needs, like LetterTokenizer.
If that doesn't really suit your needs (and it might well turn out to be throwing the baby out with the bathwater), then you may need to modify StandardTokenizer yourself, which is explicitly encouraged by the Lucene docs:
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

Sebastien Dionne: I didn't understand how to split a word, do I have to parse the document char by char ?
Sebastien Dionne: I still want to know how to split a token into multiple part, and index them all
You may have to write a custom analyzer.
Analyzer is a combination of Tokenizer and possibly a chain of TokenFilter instances.
Tokenizer : Takes in the input text passed by you probably as a java.io.Reader. It
JUST breakdowns the text. Doesn't alter, just breaks it down.
TokenFilter : Takes in the token emitted by Tokenizer, adds / removes / alters tokens and emits the same one by one until all are finished.
If it replaces a token with multiple tokens based on requirements, buffers all, emits them one by one to the Indexer.
You may check following resource, unfortunately, you may have to sign-up for a trial membership.
By writing a custom analyzer, you can breakdown the text the way you want to. You may even use some existing components like LowercaseFilter. Fortunately, it is achievable with Lucene to come up with some Analyzer that serves your purpose if you couldn't find that as a built-in or on the web.
" Writing Custom Filters: Lucene in Action 2"

Related

Why WordNet and JWI stemmer gives "ord" and "orde" in result of "order" stemming?

I'm working on a project using WordNet and JWI 2.4.0.
Currently, I'm putting a lot of words within the included stemmer, it seems to work, until I asked for "order".
The stemmer answers me that "order", "orde", and "ord", are the possible stems of "order".
I'm not a native english speaker, but... I never saw the word "ord" in my life... and when I asked the WordNet dictionary for this definition : obviously there is nothing. (in BabelNet online, I found that it is a Nebraska's town !)
Well, why is there this strange stem ?
How can I filter the stems that are not present in the WordNet dictionary ? (because when I re-use the stemmed words, "orde" is making the program crash)
Thank you !
ANSWER : I didn't understood well what was a stem. So, this question has no sense.
Here is some code to test :
package JWIExplorer;
import java.io.File;
import java.io.IOException;
import java.net.URL;
import java.util.Arrays;
import java.util.Date;
import java.util.Iterator;
import java.util.List;
import edu.mit.jwi.Dictionary;
import edu.mit.jwi.IDictionary;
import edu.mit.jwi.morph.WordnetStemmer;
public class TestJWI
{
public static void main(String[] args) throws IOException
{
List<String> WordList_Research = Arrays.asList("dog", "cat", "mouse");
List<String> WordList_Research2 = Arrays.asList("order");
String path = "./" + File.separator + "dict";
URL url;
url = new URL("file", null, path);
System.out.println("BEGIN : " + new Date());
for (Iterator<String> iterstr = WordList_Research2.iterator(); iterstr.hasNext();)
{
String str = iterstr.next();
TestStem(url, str);
}
System.out.println("END : " + new Date());
}
public static void TestStem(URL url, String ResearchedWord) throws IOException
{
// construct the dictionary object and open it
IDictionary dict = new Dictionary(url);
dict.open();
// First, let's check for the stem word
WordnetStemmer Stemmer = new WordnetStemmer(dict);
List<String> StemmedWords;
// null for all words, POS.NOUN for nouns
StemmedWords = Stemmer.findStems(ResearchedWord, null);
if (StemmedWords.isEmpty())
return;
for (Iterator<String> iterstr = StemmedWords.iterator(); iterstr.hasNext();)
{
String str = iterstr.next();
System.out.println("Local stemmed iteration on : " + str);
}
}
}
Stems do not necessarily need to be words by themselves. "Order" and "Ordinal" share the stem "Ord".
The fundamental problem here is that stems are related to spelling, but language evolution and spelling are only weakly related (especially in English). As a programmer, we'd much rather describe a stem as a regex, e.g. ^ord[ie]. This captures that it's not the stem of "ordained"

Tess4J doOCR() for *First Page* of pdf / tif

Is there a way to tell Tess4J to only OCR a certain amount of pages / characters?
I will potentially be working with 200+ page PDF's, but I really only want to OCR the first page, if that!
As far as I understand, the common sample
package net.sourceforge.tess4j.example;
import java.io.File;
import net.sourceforge.tess4j.*;
public class TesseractExample {
public static void main(String[] args) {
File imageFile = new File("eurotext.tif");
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
// Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}
}
Would attempt to OCR the entire, 200+ page into a single String.
For my particular case, that is way more than I need it to do, and I'm worried it could take a very long time if I let it do all 200+ pages and then just substring the first 500 or so.
The library has a PdfUtilities class that can extract certain pages of a PDF.

How to implement just some basic keywords highlighting in text editor?

I'm a novice programmer trying to learn plug-in development. I'd like to upgrade the sample XML editor so that some words like "cat", "dog", "hamster", "rabbit" and "bird" would be highlighted when it appears in an XML file (it's just for learning purpose). Can anyone give me some implementation tips or suggestions? I am clueless.. (But I am carrying out my research on this as well, I'm not being lazy. You have my word.) Thanks in advance.
You can detect words in the plain text part of the XML by modifying the sample XML editor as follows.
We can use the provided WordRule class to detect the words. The XMLScanner class which scans the plain text needs to be updated to include the word rule:
public XMLScanner(final ColorManager manager)
{
IToken procInstr = new Token(new TextAttribute(manager.getColor(IXMLColorConstants.PROC_INSTR)));
WordRule words = new WordRule(new WordDetector());
words.addWord("cat", procInstr);
words.addWord("dog", procInstr);
// TODO add more words here
IRule [] rules = new IRule [] {
// Add rule for processing instructions
new SingleLineRule("<?", "?>", procInstr),
// Add generic whitespace rule.
new WhitespaceRule(new XMLWhitespaceDetector()),
// Words rules
words
};
setRules(rules);
}
I have used the existing processing instruction token here to reduce the amount of new code, but you should define a new color and use a new token.
The WordRule constructor requires an IWordDetector class, we can use a very simple detector here:
class WordDetector implements IWordDetector
{
#Override
public boolean isWordStart(final char c)
{
return Character.isLetter(c);
}
#Override
public boolean isWordPart(final char c)
{
return Character.isLetter(c);
}
}
This is just accepting letters in words.

How to add options for Analyze in Apache Lucene?

Lucene has Analyzers that basically tokenize and filter the corpus when indexing. Operations include converting tokens to lowercase, stemming, removing stopwords, etc.
I'm running an experiment where I want to try all possible combinations of analysis operations: stemming only, stopping only, stemming and stopping, ...
In total, there 36 combinations that I want to try.
How can I do easily and gracefully do this?
I know that I can extend the Analyzer class and implement the tokenStream() function to create my own Analyzer:
public class MyAnalyzer extends Analyzer
{
public TokenStream tokenStream(String field, final Reader reader){
return new NameFilter(
CaseNumberFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader)
)
), StopAnalyzer.ENGLISH_STOP_WORDS)
)
);
}
What I'd like to do is write one such class, which can somehow take boolean values for each of the possible operations (doStopping, doStemming, etc.). I don't want to have to write 36 different Analyzer classes that each perform one of the 36 combinations. What makes it difficult is the way the filters are all combined together in their constructors.
Any ideas on how to do this gracefully?
EDIT: By "gracefully", I mean that I can easily create a new Analyzer in some sort of loop:
analyzer = new MyAnalyzer(doStemming, doStopping, ...)
where doStemming and doStopping change with each loop iteration.
Solr solves this problem by using Tokenizer and TokenFilter factories. You could do the same, for example:
public interface TokenizerFactory {
Tokenizer newTokenizer(Reader reader);
}
public interface TokenFilterFactory {
TokenFilter newTokenFilter(TokenStream source);
}
public class ConfigurableAnalyzer {
private final TokenizerFactory tokenizerFactory;
private final List<TokenFilterFactory> tokenFilterFactories;
public ConfigurableAnalyzer(TokenizerFactory tokenizerFactory, TokenFilterFactory... tokenFilterFactories) {
this.tokenizerFactory = tokenizerFactory;
this.tokenFilterFactories = Arrays.asList(tokenFilterFactories);
}
public TokenStream tokenStream(String field, Reader source) {
TokenStream sink = tokenizerFactory.newTokenizer(source);
for (TokenFilterFactory tokenFilterFactory : tokenFilterFactories) {
sink = tokenFilterFactory.newTokenFilter(sink);
}
return sink;
}
}
This way, you can configure your analyzer by passing a factory for one tokenizer and 0 to n filters as constructor arguments.
Add some class variables to the custom Analyzer class which can be easily set and unset on the fly. Then, in the tokenStream() function, use those variables to determine which filters to perform.
public class MyAnalyzer extends Analyzer {
private Set customStopSet;
public static final String[] STOP_WORDS = ...;
private boolean doStemming = false;
private boolean doStopping = false;
public JavaSourceCodeAnalyzer(){
super();
customStopSet = StopFilter.makeStopSet(STOP_WORDS);
}
public void setDoStemming(boolean val){
this.doStemming = val;
}
public void setDoStopping(boolean val){
this.doStopping = val;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
// First, convert to lower case
TokenStream out = new LowerCaseTokenizer(reader);
if (this.doStopping){
out = new StopFilter(true, out, customStopSet);
}
if (this.doStemming){
out = new PorterStemFilter(out);
}
return out;
}
}
There is one gotcha: LowerCaseTokenizer takes as input the reader variable, and returns a TokenStream. This is fine for the following filters (StopFilter, PorterStemFilter), because they take TokenStreams as input and return them as output, and so we can chain them together nicely. However, this means you can't have a filter before the LowerCaseTokenizer that returns a TokenStream. In my case, I wanted to split camelCase words into parts, and this has to be done before converting to lower case. My solution was to perform the splitting manually in the custom Indexer class, so by the time MyAnalyzer sees the text, it has already been split.
(I have also added a boolean flag to my customer Indexer class, so now both can work based solely on flags.)
Is there a better answer?

What analyzer should I use for a URL in lucene.net?

I'm having problems getting a simple URL to tokenize properly so that you can search it as expected.
I'm indexing "http://news.bbc.co.uk/sport1/hi/football/internationals/8196322.stm" with the StandardAnalyzer and it is tokenizing the string as the following (debug output):
(http,0,4,type=<ALPHANUM>)
(news.bbc.co.uk,7,21,type=<HOST>)
(sport1/hi,22,31,type=<NUM>)
(football,32,40,type=<ALPHANUM>)
(internationals/8196322.stm,41,67,type=<NUM>)
In general it looks good, http itself, then the hostname but the issue seems to come with the forward slashes. Surely it should consider them as seperate words?
What do I need to do to correct this?
Thanks
P.S. I'm using Lucene.NET but I really don't think it makes much of a difference with regards to the answers.
The StandardAnalyzer, which uses the StandardTokenizer, doesn't tokenize urls (although it recognised emails and treats them as one token). What you are seeing is it's default behaviour - splitting on various punctuation characters. The simplest solution might be to use a write a custom Analyzer and supply a UrlTokenizer, that extends/modifies the code in StandardTokenizer, to tokenize URLs. Something like:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer() {
super();
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new MyUrlTokenizer(reader);
result = new LowerCaseFilter(result);
result = new StopFilter(result);
result = new SynonymFilter(result);
return result;
}
}
Where the URLTokenizer splits on /, - _ and whatever else you want. Nutch may also have some relevant code, but I don't know if there's a .NET version.
Note that if you have a distinct fieldName for urls then you can modify the above code the use the StandardTokenizer by default, else use the UrlTokenizer.
e.g.
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = null;
if (fieldName.equals("url")) {
result = new MyUrlTokenizer(reader);
} else {
result = new StandardTokenizer(reader);
}
You should parse the URL yourself (I imagine there's at least one .Net class that can parse a URL string and tease out the different elements), then add those elements (such as the host, or whatever else you're interested in filtering on) as Keywords; don't Analyze them at all.