I'm working on indexing tweets that are in English using Lucene 4.3, however I'm not sure which Analyzer to use. What's the difference between Lucene StandardAnalyzer and EnglishAnalyzer?
Also I tried to test the StandardAnalyzer with this text: "XY&Z Corporation - xyz#example.com". The output is: [xy] [z] [corporation] [xyz] [example.com], however I thought the output will be: [XY&Z] [Corporation] [xyz#example.com]
Am I doing something wrong?
Take a look at the source. Generally, analyzers are pretty readable. You just need to look into CreateComponents method to see the Tokenizer and Filters being used by it:
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
final Tokenizer source = new StandardTokenizer(matchVersion, reader);
TokenStream result = new StandardFilter(matchVersion, source);
// prior to this we get the classic behavior, standardfilter does it for us.
if (matchVersion.onOrAfter(Version.LUCENE_31))
result = new EnglishPossessiveFilter(matchVersion, result);
result = new LowerCaseFilter(matchVersion, result);
result = new StopFilter(matchVersion, result, stopwords);
if(!stemExclusionSet.isEmpty())
result = new KeywordMarkerFilter(result, stemExclusionSet);
result = new PorterStemFilter(result);
return new TokenStreamComponents(source, result);
}
Whereas, StandardAnalyzer is just a StandardTokenizer, StandardFilter, LowercaseFilter, and StopFilter. EnglishAnalyzer rolls in an EnglishPossesiveFilter, KeywordMarkerFilter, and PorterStemFilter.
Mainly, the EnglishAnalyzer rolls in some English stemming enhancements, which should work well for plain English text.
For StandardAnalyzer, really the only assumption I'm aware of that ties it directly to English analysis, is the default stopword set, which is of course, just a default and can be changed. StandardAnalyzer now implements Unicode Standard Annex #29, which attempts to provide a non-language-specific text segmentation.
Related
I am creating a search engine for a large number of HTML documents using lucene.
I know I can use PostingsHighlighter and friends to show snippets, with bold words, similar to Google Search results, also similar to this random lucene-based example.
However, unlike these examples, I need a solution that preserves highlighted words, even after the matched document is opened by the user, similar to Google Books.
Some words are hyphenated, in the form <div> ... an inter-</div><div...>national audience ...</div> I am thinking I need to convert these to plain text first, and write some code to merge words that were hyphenated, before I send them to lucene.
Once the resulting document is opened by the user, I'm hoping that I can use lucene to get character offsets of each matched word in the document.
I will have to cross-reference the offsets in the plain text back to the original HTML, and write code to highlight <b> the words based on said offsets.
<div> ... an <b>inter-</b></div><div...><b>national</b> audience ...</div>
How can I get what I need from lucene? Surely I don't have to write my own search for this 'final inch'?
OK, I figured out something I can get started with. :)
To index:
StandardAnalyzer analyzer - new StandardAnalyzer()
Directory index = FSDirectory.open(new File("...").toPath());
IndexWriterConfig config = new IndexWriterConfig(analyzer);
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
// documents need to be read from the data source..
// only add once, or else your docs will be duplicated as you continue to use the system
writer.close();
specify offsets to store for highlighting
private static final FieldType typeOffsets;
static {
typeOffsets = new FieldType(textField.TYPE_STORED);
typeOffsets.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
}
method addDoc
void addDoc(IndexWriter writer, String title, String body) {
Document doc = new Document();
doc.add(new Field("title", body, typeOffsets));
doc.add(new Field("body", body, typeOffsets));
// you can also add an store a TextField that does not have offsets,
// like a file ID that you wouldn't search on, just need to reference original doc.
writer.addDocument(doc);
}
Perform your first search
String q = "...";
String[] fields = new String[] {"title", "body"};
QueryParser parser = new MultiFieldQueryParser(fields, analyzer)
Query query = parser.parse(q)
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(index));
PostingsHighlighter highlighter = new PostingsHighlighter();
TopDocs topDocs = searcher.search(query, 10, Sort.RELEVANCE);
Get highlighted snippets with highlighter.highlightFields(fields, query, searcher, topDocs). You can iterate over the results.
When you want to highlight the end document (i.e. after the search is completed and user selected the result), use this solution (needs minor edits). It works by using NullFragmenter to turn the whole thing into one snippet.
public static String highlight(String pText, String pQuery) throws Exception
{
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
QueryParser parser = new QueryParser(Version.LUCENE_30, "", analyzer);
Highlighter highlighter = new Highlighter(new QueryScorer(parser.parse(pQuery)));
highlighter.setTextFragmenter(new NullFragmenter());
String text = highlighter.getBestFragment(analyzer, "", pText);
if (text != null)
{
return text;
}
return pText;
}
Edit: You can actually use PostingsHighlighter for this last step instead of Highlighter, but you have to override getBreakIterator, and then override your BreakIterator so that it thinks the whole document is one sentance.
Edit: You can override getFormatter to capture the offsets, rather than trying to parse the <b> tags normally output by PostingsHighlighter.
Iam Using lucene 4.6 version with Phrase Query for searching the words from PDF. Below is my code. Here Iam able to get the out put text from the PDF also getting the query as contents:"Following are the". But No.of hits is showing as 0. Any suggestions on it?? Thanks in advance.
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
iwriter.deleteAll();
iwriter.commit();
Document doc = new Document();
PDDocument document = null;
try {
document = PDDocument.load(strFilepath);
}
catch (IOException ex) {
System.out.println("Exception Occured while Loading the document: " + ex);
}
String output=new PDFTextStripper().getText(document);
System.out.println(output);
//String text = "This is the text to be indexed";
doc.add(new Field("contents", output, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
String sentence = "Following are the";
//IndexSearcher searcher = new IndexSearcher(directory);
if(output.contains(sentence)){
System.out.println("");
}
PhraseQuery query = new PhraseQuery();
String[] words = sentence.split(" ");
for (String word : words) {
query.add(new Term("contents", word));
}
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
// Iterate through the results:
if(hits.length>0){
System.out.println("Searched text existed in the PDF.");
}
ireader.close();
directory.close();
}
catch(Exception e){
System.out.println("Exception: "+e.getMessage());
}
There are two reasons why your PhraseQuery is not working
StandardAnalyzer uses ENGLISH_STOP_WORDS_SET which contains a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with these words which will be removed from TokenStream while indexing. That means when you search "Following are the" in index, are and the will not be found. so you will never get any result for such a PhraseQuery as are and the will never be there in first place to search with.
Solution for this is use this constructor for
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46, CharArraySet.EMPTY_SET); while indexing this will make sure that StopFilter will not remove any word from TokenStream while indexing.
StandardAnalyzer also uses LowerCaseFilter that means all tokens will be normalized to lower case. so Following will be indexed as following that means searching "Following" won't give you result. For this .toLowerCase() will come to your rescue, just use this on your sentence and you should get results from search.
Also have a look at this link which specify Unicode Standard Annex #29 which is followed by StandardTokenizer. And from brief look at it, it looks like APOSTROPHE, QUOTATION MARK, FULL STOP, SMALL COMMA and many other characters under certain condition will be ignored while indexing.
I am trying to make a Lucene autocomplete using Lucene's Dictionary and spellcheck classes, but so far only successful in making it work for single terms.
I googled and found out that we need to make use of Shingle Matrix filter to get the work done.. Can someone experienced with Lucene show me a way to do it ?
All I need is it has to generate words for autocomplete with phrases. For example, if I have a doc like this : "This is a long line with very long rant with too many words in it", Then I should be able to generate words like "long line", "long rant", "many words" etc...
Possible ?
Thanks.
writer = new IndexWriter(dir,
new ShingleAnalyzerWrapper(new StandardAnalyzer(
Version.LUCENE_CURRENT,
Collections.emptySet()),3),
false,
IndexWriter.MaxFieldLength.UNLIMITED);
This did the job for me...
You can write your own Analyzer implementing TokenStream function in inheriting Lucene.Net.Analysis.Analyzer class. There u can use this shingleFilter to get multiword from the tokenstream Code Stream:
public override Lucene.Net.Analysis.TokenStream TokenStream(String fieldName, System.IO.TextReader
reader)
{
Lucene.Net.Analysis.TokenStream tokenStream = new
Lucene.Net.Analysis.Standard.StandardTokenizer(Lucene.Net.Util.Version.LUCENE_30, reader);
tokenStream = new Lucene.Net.Analysis.Shingle.ShingleFilter(tokenStream, maxShingleSize);
return tokenStream;
}
max Shingle size identifies max length of multi word unit
I'm having problems getting a simple URL to tokenize properly so that you can search it as expected.
I'm indexing "http://news.bbc.co.uk/sport1/hi/football/internationals/8196322.stm" with the StandardAnalyzer and it is tokenizing the string as the following (debug output):
(http,0,4,type=<ALPHANUM>)
(news.bbc.co.uk,7,21,type=<HOST>)
(sport1/hi,22,31,type=<NUM>)
(football,32,40,type=<ALPHANUM>)
(internationals/8196322.stm,41,67,type=<NUM>)
In general it looks good, http itself, then the hostname but the issue seems to come with the forward slashes. Surely it should consider them as seperate words?
What do I need to do to correct this?
Thanks
P.S. I'm using Lucene.NET but I really don't think it makes much of a difference with regards to the answers.
The StandardAnalyzer, which uses the StandardTokenizer, doesn't tokenize urls (although it recognised emails and treats them as one token). What you are seeing is it's default behaviour - splitting on various punctuation characters. The simplest solution might be to use a write a custom Analyzer and supply a UrlTokenizer, that extends/modifies the code in StandardTokenizer, to tokenize URLs. Something like:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer() {
super();
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new MyUrlTokenizer(reader);
result = new LowerCaseFilter(result);
result = new StopFilter(result);
result = new SynonymFilter(result);
return result;
}
}
Where the URLTokenizer splits on /, - _ and whatever else you want. Nutch may also have some relevant code, but I don't know if there's a .NET version.
Note that if you have a distinct fieldName for urls then you can modify the above code the use the StandardTokenizer by default, else use the UrlTokenizer.
e.g.
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = null;
if (fieldName.equals("url")) {
result = new MyUrlTokenizer(reader);
} else {
result = new StandardTokenizer(reader);
}
You should parse the URL yourself (I imagine there's at least one .Net class that can parse a URL string and tease out the different elements), then add those elements (such as the host, or whatever else you're interested in filtering on) as Keywords; don't Analyze them at all.
Lucene's StandardAnalyzer removes dots from string/acronyms when indexing it.
I want Lucene to retain dots and hence I'm using WhitespaceAnalyzer class.
I can give my list of stop words to StandardAnalyzer...but how do i give it to WhitespaceAnalyzer?
Thanks for reading.
Create your own analyzer by extending WhiteSpaceAnalyzer and override tokenStream method as follows.
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = super.tokenStream(fieldName, reader);
result = new StopFilter(result, stopSet);
return result;
}
Here the stopSet is the Set of stop words, which you could get by adding a constructor to your analyzer which accepts a list of stop words.
You may also wish to override reusableTokenStream() method in similar fashion if you plan to reuse the TokenStream.