Using stop words with WhitespaceAnalyzer

Using stop words with WhitespaceAnalyzer - lucene

Lucene's StandardAnalyzer removes dots from string/acronyms when indexing it.
I want Lucene to retain dots and hence I'm using WhitespaceAnalyzer class.
I can give my list of stop words to StandardAnalyzer...but how do i give it to WhitespaceAnalyzer?
Thanks for reading.

Create your own analyzer by extending WhiteSpaceAnalyzer and override tokenStream method as follows.
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = super.tokenStream(fieldName, reader);
result = new StopFilter(result, stopSet);
return result;
}
Here the stopSet is the Set of stop words, which you could get by adding a constructor to your analyzer which accepts a list of stop words.
You may also wish to override reusableTokenStream() method in similar fashion if you plan to reuse the TokenStream.

Related

Lucene - Specialized TokenStream/Analyzer given a set of indexable keywords

I have the following situation
I have a collection of documents to index. But I need to be selective in what I index.
Selection criteria: the document must contain one of the keywords from a given Set.
That part is easy, I can check if any of those keywords are present in the document and only then index the document.
The tricky situation is (for me anyway!), that I want to index only these keywords. And these keywords can be multiworded, or regex expressions as well, say.
What these keywords are going to be is meaningless to this post, because I can abstract that out - I can generate the list of keywords that need to be indexed.
Is there an existing TokenStream, Analyzer, Filter combination that I can use?
And if there isn't, please could someone point me in the right direction.
If my question isn't clear enough:
HashSet<String> impKeywords = new HashSet<String>(new String[] {"Java", "Lucene"});
I have a class Content which I use, say:
Content content = new Content("I am only interested in Java, Lucene, Nutch, Luke, CommonLisp.");
And, say I have a method to get matching keywords:
HashSet<String> matchingKeywords = content.getMatchingKeywords(impKeywords); // returns a set with "Java" and "Lucene"
And if there are matchingKeywords, only then proceed to index the document; so:
if(!matchingKeywords.isEmpty()) {
// prepare document for indexing, and index.
// But what should be my Analyzer and TokenStream?
}
I want to be able to create an Analyzer with a TokenStream that only returns these matching keywords, so only these tokens are indexed.
End notes: One possibility appears to be that for each document I add a variable number of fields with each of the matching keywords. Where these fields are Indexed but not Analyzed using Field.Index.NOT_ANALYZED. However, it would be better if I'm able to figure out a pre-existing Analyzer/TokenStream for this purpose instead of playing around with fields.

Following #femtoRgon's advise I have resolved the said problem as follows.
As explained in the question, I have:
HashSet<String> impKeywords = new HashSet<String>(new String[] {"Java", "Lucene"});
And I have a class Content which I use, say as follows:
Content content = new Content("I am only interested in Java, Lucene, Nutch, Luke, CommonLisp.");
And, I have a method to get matching keywords:
HashSet<String> matchingKeywords = content.getMatchingKeywords(impKeywords); // returns a set with "Java" and "Lucene" for this example `content`.
And if there are matchingKeywords, only then proceed to index the document; so while indexing I did:
if(!matchingKeywords.isEmpty()) {
Document doc = new Document();
for(String keyword: matchingKeywords) {
doc.add(new Field("keyword", keyword, Field.Store.YES, Field.Index.NOT_ANALYZED);
}
iwriter.addDocument(doc); // iwriter is the instance of IndexWriter
}
Then, while searching I created the following boolean query:
BooleanQuery boolQuery = new BooleanQuery();
for(String queryKeyword: searchKeywords)) {
boolQuery.add(new TermQuery(new Term("keyword", queryKeyword)), BooleanClause.Occur.SHOULD);
}
ScoreDoc[] hits = isearcher.search(boolQuery, null, 1000).scoreDocs; // isearcher is the instance of IndexSearcher
Hope this answer helps someone with similar needs.

What's the difference between Lucene StandardAnalyzer and EnglishAnalyzer?

I'm working on indexing tweets that are in English using Lucene 4.3, however I'm not sure which Analyzer to use. What's the difference between Lucene StandardAnalyzer and EnglishAnalyzer?
Also I tried to test the StandardAnalyzer with this text: "XY&Z Corporation - xyz#example.com". The output is: [xy] [z] [corporation] [xyz] [example.com], however I thought the output will be: [XY&Z] [Corporation] [xyz#example.com]
Am I doing something wrong?

Take a look at the source. Generally, analyzers are pretty readable. You just need to look into CreateComponents method to see the Tokenizer and Filters being used by it:
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
final Tokenizer source = new StandardTokenizer(matchVersion, reader);
TokenStream result = new StandardFilter(matchVersion, source);
// prior to this we get the classic behavior, standardfilter does it for us.
if (matchVersion.onOrAfter(Version.LUCENE_31))
result = new EnglishPossessiveFilter(matchVersion, result);
result = new LowerCaseFilter(matchVersion, result);
result = new StopFilter(matchVersion, result, stopwords);
if(!stemExclusionSet.isEmpty())
result = new KeywordMarkerFilter(result, stemExclusionSet);
result = new PorterStemFilter(result);
return new TokenStreamComponents(source, result);
}
Whereas, StandardAnalyzer is just a StandardTokenizer, StandardFilter, LowercaseFilter, and StopFilter. EnglishAnalyzer rolls in an EnglishPossesiveFilter, KeywordMarkerFilter, and PorterStemFilter.
Mainly, the EnglishAnalyzer rolls in some English stemming enhancements, which should work well for plain English text.
For StandardAnalyzer, really the only assumption I'm aware of that ties it directly to English analysis, is the default stopword set, which is of course, just a default and can be changed. StandardAnalyzer now implements Unicode Standard Annex #29, which attempts to provide a non-language-specific text segmentation.

Lucene 4.1 : How split words that contains "dots" when indexing?

I'l trying to figure out what I should do to index my keywords that contains "." .
ex : this.name
I want to index the terms : this and name in my index.
I use the StandardAnalyser. I try to extends the WhitespaceTokensizer or extends TokenFilter, but I'm not sure if I'm in the right direction.
if I use the StandardAnalyser, I'll obtain "this.name" as a keyword, and that's not what I want, but the analyser do the rest correctly for me.

You can put a CharFilter in front of StandardTokenizer that converts periods and underscores to spaces. MappingCharFilter will work.
Here's MappingCharFilter added to a stripped-down StandardAnalyzer (see the original 4.1 version here):
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.charfilter.MappingCharFilter;
import org.apache.lucene.analysis.charfilter.NormalizeCharMap;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;
import java.io.IOException;
import java.io.Reader;
public final class MyAnalyzer extends StopwordAnalyzerBase {
private int maxTokenLength = 255;
public MyAnalyzer() {
super(Version.LUCENE_41, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
}
#Override
protected TokenStreamComponents createComponents
(final String fieldName, final Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, stopwords);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(MyAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(".", " ");
builder.add("_", " ");
NormalizeCharMap normMap = builder.build();
return new MappingCharFilter(normMap, reader);
}
}
Here's a quick test to demonstrate it works:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
public class TestMyAnalyzer extends BaseTokenStreamTestCase {
private Analyzer analyzer = new MyAnalyzer();
public void testPeriods() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"this.name; here.i.am; sentences ... end with periods.",
new String[] { "name", "here", "i", "am", "sentences", "end", "periods" } );
}
public void testUnderscores() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"some_underscore_term _and____ stuff that is_not in it",
new String[] { "some", "underscore", "term", "stuff" } );
}
}

If I understand you correctly, you need to use a tokenizer that removes dots -- that is, any name that contains a dot should be split at that point ("here.i.am" becomes "here" + "i" + "am").

you are getting caught by behavior documented here:
However, a dot that's not followed by whitespace is considered part of a token.
StandardTokenizer introduces some more complex to parsing rules than you may not be looking for. This one, in particular, is intended to prevent tokenization of URLs, IPs, idenifiers, etc. A simpler implementation might suit your needs, like LetterTokenizer.
If that doesn't really suit your needs (and it might well turn out to be throwing the baby out with the bathwater), then you may need to modify StandardTokenizer yourself, which is explicitly encouraged by the Lucene docs:
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

Sebastien Dionne: I didn't understand how to split a word, do I have to parse the document char by char ?
Sebastien Dionne: I still want to know how to split a token into multiple part, and index them all
You may have to write a custom analyzer.
Analyzer is a combination of Tokenizer and possibly a chain of TokenFilter instances.
Tokenizer : Takes in the input text passed by you probably as a java.io.Reader. It
JUST breakdowns the text. Doesn't alter, just breaks it down.
TokenFilter : Takes in the token emitted by Tokenizer, adds / removes / alters tokens and emits the same one by one until all are finished.
If it replaces a token with multiple tokens based on requirements, buffers all, emits them one by one to the Indexer.
You may check following resource, unfortunately, you may have to sign-up for a trial membership.
By writing a custom analyzer, you can breakdown the text the way you want to. You may even use some existing components like LowercaseFilter. Fortunately, it is achievable with Lucene to come up with some Analyzer that serves your purpose if you couldn't find that as a built-in or on the web.
" Writing Custom Filters: Lucene in Action 2"

Lucene Autocomplete with multiple words using Shingle filter

I am trying to make a Lucene autocomplete using Lucene's Dictionary and spellcheck classes, but so far only successful in making it work for single terms.
I googled and found out that we need to make use of Shingle Matrix filter to get the work done.. Can someone experienced with Lucene show me a way to do it ?
All I need is it has to generate words for autocomplete with phrases. For example, if I have a doc like this : "This is a long line with very long rant with too many words in it", Then I should be able to generate words like "long line", "long rant", "many words" etc...
Possible ?
Thanks.

writer = new IndexWriter(dir,
new ShingleAnalyzerWrapper(new StandardAnalyzer(
Version.LUCENE_CURRENT,
Collections.emptySet()),3),
false,
IndexWriter.MaxFieldLength.UNLIMITED);
This did the job for me...

You can write your own Analyzer implementing TokenStream function in inheriting Lucene.Net.Analysis.Analyzer class. There u can use this shingleFilter to get multiword from the tokenstream Code Stream:
public override Lucene.Net.Analysis.TokenStream TokenStream(String fieldName, System.IO.TextReader
reader)
{
Lucene.Net.Analysis.TokenStream tokenStream = new
Lucene.Net.Analysis.Standard.StandardTokenizer(Lucene.Net.Util.Version.LUCENE_30, reader);
tokenStream = new Lucene.Net.Analysis.Shingle.ShingleFilter(tokenStream, maxShingleSize);
return tokenStream;
}
max Shingle size identifies max length of multi word unit

What analyzer should I use for a URL in lucene.net?

I'm having problems getting a simple URL to tokenize properly so that you can search it as expected.
I'm indexing "http://news.bbc.co.uk/sport1/hi/football/internationals/8196322.stm" with the StandardAnalyzer and it is tokenizing the string as the following (debug output):
(http,0,4,type=<ALPHANUM>)
(news.bbc.co.uk,7,21,type=<HOST>)
(sport1/hi,22,31,type=<NUM>)
(football,32,40,type=<ALPHANUM>)
(internationals/8196322.stm,41,67,type=<NUM>)
In general it looks good, http itself, then the hostname but the issue seems to come with the forward slashes. Surely it should consider them as seperate words?
What do I need to do to correct this?
Thanks
P.S. I'm using Lucene.NET but I really don't think it makes much of a difference with regards to the answers.

The StandardAnalyzer, which uses the StandardTokenizer, doesn't tokenize urls (although it recognised emails and treats them as one token). What you are seeing is it's default behaviour - splitting on various punctuation characters. The simplest solution might be to use a write a custom Analyzer and supply a UrlTokenizer, that extends/modifies the code in StandardTokenizer, to tokenize URLs. Something like:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer() {
super();
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new MyUrlTokenizer(reader);
result = new LowerCaseFilter(result);
result = new StopFilter(result);
result = new SynonymFilter(result);
return result;
}
}
Where the URLTokenizer splits on /, - _ and whatever else you want. Nutch may also have some relevant code, but I don't know if there's a .NET version.
Note that if you have a distinct fieldName for urls then you can modify the above code the use the StandardTokenizer by default, else use the UrlTokenizer.
e.g.
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = null;
if (fieldName.equals("url")) {
result = new MyUrlTokenizer(reader);
} else {
result = new StandardTokenizer(reader);
}

You should parse the URL yourself (I imagine there's at least one .Net class that can parse a URL string and tease out the different elements), then add those elements (such as the host, or whatever else you're interested in filtering on) as Keywords; don't Analyze them at all.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using stop words with WhitespaceAnalyzer - lucene

Lucene's StandardAnalyzer removes dots from string/acronyms when indexing it. I want Lucene to retain dots and hence I'm using WhitespaceAnalyzer class. I can give my list of stop words to StandardAnalyzer...but how do i give it to WhitespaceAnalyzer? Thanks for reading.

Related

Lucene - Specialized TokenStream/Analyzer given a set of indexable keywords

What's the difference between Lucene StandardAnalyzer and EnglishAnalyzer?

Lucene 4.1 : How split words that contains "dots" when indexing?

Lucene Autocomplete with multiple words using Shingle filter

What analyzer should I use for a URL in lucene.net?

Categories

Resources