What analyzer should I use for a URL in lucene.net? - lucene

I'm having problems getting a simple URL to tokenize properly so that you can search it as expected.
I'm indexing "http://news.bbc.co.uk/sport1/hi/football/internationals/8196322.stm" with the StandardAnalyzer and it is tokenizing the string as the following (debug output):
(http,0,4,type=<ALPHANUM>)
(news.bbc.co.uk,7,21,type=<HOST>)
(sport1/hi,22,31,type=<NUM>)
(football,32,40,type=<ALPHANUM>)
(internationals/8196322.stm,41,67,type=<NUM>)
In general it looks good, http itself, then the hostname but the issue seems to come with the forward slashes. Surely it should consider them as seperate words?
What do I need to do to correct this?
Thanks
P.S. I'm using Lucene.NET but I really don't think it makes much of a difference with regards to the answers.

The StandardAnalyzer, which uses the StandardTokenizer, doesn't tokenize urls (although it recognised emails and treats them as one token). What you are seeing is it's default behaviour - splitting on various punctuation characters. The simplest solution might be to use a write a custom Analyzer and supply a UrlTokenizer, that extends/modifies the code in StandardTokenizer, to tokenize URLs. Something like:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer() {
super();
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new MyUrlTokenizer(reader);
result = new LowerCaseFilter(result);
result = new StopFilter(result);
result = new SynonymFilter(result);
return result;
}
}
Where the URLTokenizer splits on /, - _ and whatever else you want. Nutch may also have some relevant code, but I don't know if there's a .NET version.
Note that if you have a distinct fieldName for urls then you can modify the above code the use the StandardTokenizer by default, else use the UrlTokenizer.
e.g.
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = null;
if (fieldName.equals("url")) {
result = new MyUrlTokenizer(reader);
} else {
result = new StandardTokenizer(reader);
}

You should parse the URL yourself (I imagine there's at least one .Net class that can parse a URL string and tease out the different elements), then add those elements (such as the host, or whatever else you're interested in filtering on) as Keywords; don't Analyze them at all.

Related

How to implement just some basic keywords highlighting in text editor?

I'm a novice programmer trying to learn plug-in development. I'd like to upgrade the sample XML editor so that some words like "cat", "dog", "hamster", "rabbit" and "bird" would be highlighted when it appears in an XML file (it's just for learning purpose). Can anyone give me some implementation tips or suggestions? I am clueless.. (But I am carrying out my research on this as well, I'm not being lazy. You have my word.) Thanks in advance.
You can detect words in the plain text part of the XML by modifying the sample XML editor as follows.
We can use the provided WordRule class to detect the words. The XMLScanner class which scans the plain text needs to be updated to include the word rule:
public XMLScanner(final ColorManager manager)
{
IToken procInstr = new Token(new TextAttribute(manager.getColor(IXMLColorConstants.PROC_INSTR)));
WordRule words = new WordRule(new WordDetector());
words.addWord("cat", procInstr);
words.addWord("dog", procInstr);
// TODO add more words here
IRule [] rules = new IRule [] {
// Add rule for processing instructions
new SingleLineRule("<?", "?>", procInstr),
// Add generic whitespace rule.
new WhitespaceRule(new XMLWhitespaceDetector()),
// Words rules
words
};
setRules(rules);
}
I have used the existing processing instruction token here to reduce the amount of new code, but you should define a new color and use a new token.
The WordRule constructor requires an IWordDetector class, we can use a very simple detector here:
class WordDetector implements IWordDetector
{
#Override
public boolean isWordStart(final char c)
{
return Character.isLetter(c);
}
#Override
public boolean isWordPart(final char c)
{
return Character.isLetter(c);
}
}
This is just accepting letters in words.

What's the difference between Lucene StandardAnalyzer and EnglishAnalyzer?

I'm working on indexing tweets that are in English using Lucene 4.3, however I'm not sure which Analyzer to use. What's the difference between Lucene StandardAnalyzer and EnglishAnalyzer?
Also I tried to test the StandardAnalyzer with this text: "XY&Z Corporation - xyz#example.com". The output is: [xy] [z] [corporation] [xyz] [example.com], however I thought the output will be: [XY&Z] [Corporation] [xyz#example.com]
Am I doing something wrong?
Take a look at the source. Generally, analyzers are pretty readable. You just need to look into CreateComponents method to see the Tokenizer and Filters being used by it:
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
final Tokenizer source = new StandardTokenizer(matchVersion, reader);
TokenStream result = new StandardFilter(matchVersion, source);
// prior to this we get the classic behavior, standardfilter does it for us.
if (matchVersion.onOrAfter(Version.LUCENE_31))
result = new EnglishPossessiveFilter(matchVersion, result);
result = new LowerCaseFilter(matchVersion, result);
result = new StopFilter(matchVersion, result, stopwords);
if(!stemExclusionSet.isEmpty())
result = new KeywordMarkerFilter(result, stemExclusionSet);
result = new PorterStemFilter(result);
return new TokenStreamComponents(source, result);
}
Whereas, StandardAnalyzer is just a StandardTokenizer, StandardFilter, LowercaseFilter, and StopFilter. EnglishAnalyzer rolls in an EnglishPossesiveFilter, KeywordMarkerFilter, and PorterStemFilter.
Mainly, the EnglishAnalyzer rolls in some English stemming enhancements, which should work well for plain English text.
For StandardAnalyzer, really the only assumption I'm aware of that ties it directly to English analysis, is the default stopword set, which is of course, just a default and can be changed. StandardAnalyzer now implements Unicode Standard Annex #29, which attempts to provide a non-language-specific text segmentation.

Lucene 4.1 : How split words that contains "dots" when indexing?

I'l trying to figure out what I should do to index my keywords that contains "." .
ex : this.name
I want to index the terms : this and name in my index.
I use the StandardAnalyser. I try to extends the WhitespaceTokensizer or extends TokenFilter, but I'm not sure if I'm in the right direction.
if I use the StandardAnalyser, I'll obtain "this.name" as a keyword, and that's not what I want, but the analyser do the rest correctly for me.
You can put a CharFilter in front of StandardTokenizer that converts periods and underscores to spaces. MappingCharFilter will work.
Here's MappingCharFilter added to a stripped-down StandardAnalyzer (see the original 4.1 version here):
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.charfilter.MappingCharFilter;
import org.apache.lucene.analysis.charfilter.NormalizeCharMap;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;
import java.io.IOException;
import java.io.Reader;
public final class MyAnalyzer extends StopwordAnalyzerBase {
private int maxTokenLength = 255;
public MyAnalyzer() {
super(Version.LUCENE_41, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
}
#Override
protected TokenStreamComponents createComponents
(final String fieldName, final Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, stopwords);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(MyAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(".", " ");
builder.add("_", " ");
NormalizeCharMap normMap = builder.build();
return new MappingCharFilter(normMap, reader);
}
}
Here's a quick test to demonstrate it works:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
public class TestMyAnalyzer extends BaseTokenStreamTestCase {
private Analyzer analyzer = new MyAnalyzer();
public void testPeriods() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"this.name; here.i.am; sentences ... end with periods.",
new String[] { "name", "here", "i", "am", "sentences", "end", "periods" } );
}
public void testUnderscores() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"some_underscore_term _and____ stuff that is_not in it",
new String[] { "some", "underscore", "term", "stuff" } );
}
}
If I understand you correctly, you need to use a tokenizer that removes dots -- that is, any name that contains a dot should be split at that point ("here.i.am" becomes "here" + "i" + "am").
you are getting caught by behavior documented here:
However, a dot that's not followed by whitespace is considered part of a token.
StandardTokenizer introduces some more complex to parsing rules than you may not be looking for. This one, in particular, is intended to prevent tokenization of URLs, IPs, idenifiers, etc. A simpler implementation might suit your needs, like LetterTokenizer.
If that doesn't really suit your needs (and it might well turn out to be throwing the baby out with the bathwater), then you may need to modify StandardTokenizer yourself, which is explicitly encouraged by the Lucene docs:
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
Sebastien Dionne: I didn't understand how to split a word, do I have to parse the document char by char ?
Sebastien Dionne: I still want to know how to split a token into multiple part, and index them all
You may have to write a custom analyzer.
Analyzer is a combination of Tokenizer and possibly a chain of TokenFilter instances.
Tokenizer : Takes in the input text passed by you probably as a java.io.Reader. It
JUST breakdowns the text. Doesn't alter, just breaks it down.
TokenFilter : Takes in the token emitted by Tokenizer, adds / removes / alters tokens and emits the same one by one until all are finished.
If it replaces a token with multiple tokens based on requirements, buffers all, emits them one by one to the Indexer.
You may check following resource, unfortunately, you may have to sign-up for a trial membership.
By writing a custom analyzer, you can breakdown the text the way you want to. You may even use some existing components like LowercaseFilter. Fortunately, it is achievable with Lucene to come up with some Analyzer that serves your purpose if you couldn't find that as a built-in or on the web.
" Writing Custom Filters: Lucene in Action 2"

Using stop words with WhitespaceAnalyzer

Lucene's StandardAnalyzer removes dots from string/acronyms when indexing it.
I want Lucene to retain dots and hence I'm using WhitespaceAnalyzer class.
I can give my list of stop words to StandardAnalyzer...but how do i give it to WhitespaceAnalyzer?
Thanks for reading.
Create your own analyzer by extending WhiteSpaceAnalyzer and override tokenStream method as follows.
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = super.tokenStream(fieldName, reader);
result = new StopFilter(result, stopSet);
return result;
}
Here the stopSet is the Set of stop words, which you could get by adding a constructor to your analyzer which accepts a list of stop words.
You may also wish to override reusableTokenStream() method in similar fashion if you plan to reuse the TokenStream.

Does the "Cartridge" pattern exist?

AndroMDA uses the term "cartridge" (e.g. for out-of-the-box NHibernate support).
As I understood it, it takes an API/component, wrapps it, never adds new features, simplifies it, often taking away "the full power", but works well for most cases.
My questions:
Is the term widely used?
Can one properly define it?
Should the suffix "Cartridge" be used in class/method names?
An example: is the following Base64 helper a cartridge for Base64 conversion?
You give away all the power for performance-tuning, but if you simply want to decode a simple (and small) string it works fine:
Usage:
Base64StringCartridge.Decode(input);
Implementation
public static string Decode(string data)
{
try
{
System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
System.Text.Decoder utf8Decode = encoder.GetDecoder();
byte[] todecode_byte = System.Convert.FromBase64String(data);
int charCount = utf8Decode.GetCharCount(todecode_byte, 0, todecode_byte.Length);
char[] decoded_char = new char[charCount];
utf8Decode.GetChars(todecode_byte, 0, todecode_byte.Length, decoded_char, 0);
string result = new String(decoded_char);
return result;
}
catch
{
return "";
}
}
It's called the Facade Pattern. Presumably the AndroMDA folks are big fans of old video game machines...