Stop words Removal in Lucene - lucene

How to remove stop words in Lucene for the given String "This is the chemical orientation"

I think that Lucene's StopFilter is what you are looking for.

you should use standardAnalyser ,that knows about certain token types, lowercases, removes stop words, ...
example of creating an IndexWriter with standardAnalyser:
public IndexWriter Indexer(String dir) throws IOException {
IndexWriter writer;
Directory indexDir = FSDirectory.open(new File(dir).toPath());
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig cfg = new IndexWriterConfig(analyzer);
cfg.setOpenMode(OpenMode.CREATE);
writer = new IndexWriter(indexDir, cfg);
return writer;
}

Related

PhraseQuery+Lucene 4.6 is not working for PDF Word search

Iam Using lucene 4.6 version with Phrase Query for searching the words from PDF. Below is my code. Here Iam able to get the out put text from the PDF also getting the query as contents:"Following are the". But No.of hits is showing as 0. Any suggestions on it?? Thanks in advance.
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
iwriter.deleteAll();
iwriter.commit();
Document doc = new Document();
PDDocument document = null;
try {
document = PDDocument.load(strFilepath);
}
catch (IOException ex) {
System.out.println("Exception Occured while Loading the document: " + ex);
}
String output=new PDFTextStripper().getText(document);
System.out.println(output);
//String text = "This is the text to be indexed";
doc.add(new Field("contents", output, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
String sentence = "Following are the";
//IndexSearcher searcher = new IndexSearcher(directory);
if(output.contains(sentence)){
System.out.println("");
}
PhraseQuery query = new PhraseQuery();
String[] words = sentence.split(" ");
for (String word : words) {
query.add(new Term("contents", word));
}
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
// Iterate through the results:
if(hits.length>0){
System.out.println("Searched text existed in the PDF.");
}
ireader.close();
directory.close();
}
catch(Exception e){
System.out.println("Exception: "+e.getMessage());
}
There are two reasons why your PhraseQuery is not working
StandardAnalyzer uses ENGLISH_STOP_WORDS_SET which contains a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with these words which will be removed from TokenStream while indexing. That means when you search "Following are the" in index, are and the will not be found. so you will never get any result for such a PhraseQuery as are and the will never be there in first place to search with.
Solution for this is use this constructor for
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46, CharArraySet.EMPTY_SET); while indexing this will make sure that StopFilter will not remove any word from TokenStream while indexing.
StandardAnalyzer also uses LowerCaseFilter that means all tokens will be normalized to lower case. so Following will be indexed as following that means searching "Following" won't give you result. For this .toLowerCase() will come to your rescue, just use this on your sentence and you should get results from search.
Also have a look at this link which specify Unicode Standard Annex #29 which is followed by StandardTokenizer. And from brief look at it, it looks like APOSTROPHE, QUOTATION MARK, FULL STOP, SMALL COMMA and many other characters under certain condition will be ignored while indexing.

Set Lucene IndexWriter max fields

I was started working my way through the second edition of 'Lucene in Action' which uses the 3.0 API, the author creates a basic INdexWriter with the following method
private IndexWriter getIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
return new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.Unlimited);
}
In the code Below I've made the changes according the current API, with the exception that I cannot figure out how to set the writer's max field length to unlimited like the constant in the book example. I've just inserted the int 1000 below. Is this unlimited constant just gone completely in the current API?
private IndexWriter getIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_36,
new LimitTokenCountAnalyzer(new WhitespaceAnalyzer(Version.LUCENE_36), 1000));
return new IndexWriter(directory, iwc);
}
Thanks, this is just for curiosity.
IndexWriter javadoc says:
#deprecated use LimitTokenCountAnalyzer instead. Note that the
behvaior slightly changed - the analyzer limits the number of
tokens per token stream created, while this setting limits the
total number of tokens to index. This only matters if you index
many multi-valued fields though.
So, in other words, a hard-wired method has been replaced with a nice adapter/delegate pattern.

Lucene behaviour in mocked unit tests

Now this is just strange:
The code as it is below works fine in a NUnit unit test with RhinoMocks (the assert passes).
This is creating an IndexSearcher in the code.
Now if I use the mocked version of Get (swap the commented assignment of IndexSearcher) so now the searcher is returned by the mock, it doesn't pass the assertion.
Can anyone figure out why that is? (NUnit 2.5.2 - RhinoMocks 3.6 - Lucene 2.9.2)
[Test]
public void Test()
{
ISearcherManager searcherManager = _repository.StrictMock<ISearcherManager>();
Directory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), true);
searcherManager.Expect(item => item.Get()).Return(new IndexSearcher(writer.GetReader())).Repeat.AtLeastOnce();
_repository.ReplayAll();
//searcherManager.Get();
Document doc = new Document();
doc.Add(new Field("F", "hello you", Field.Store.YES, Field.Index.ANALYZED));
writer.AddDocument(doc);
IndexSearcher searcher = searcherManager.Get();
//IndexSearcher searcher = new IndexSearcher(writer.GetReader());
QueryParser parser = new QueryParser("F", new StandardAnalyzer());
Query q = parser.Parse("hello");
TopDocs hits = searcher.Search(q, 2);
Assert.AreEqual(1, hits.totalHits);
}
I'm not familiar with Lucene, but the only difference I see is that via the Expect call, you are creating your IndexSearcher before adding the document to the writer. In the code that is commented out, the creation of the IndexSearcher is happening after you add the document to the writer. Is that an important distinction?

Lucene HTMLFormatter skipping last character

I have this simple Lucene search code (Modified from http://www.lucenetutorial.com/lucene-in-5-minutes.html)
class Program
{
static void Main(string[] args)
{
StandardAnalyzer analyzer = new StandardAnalyzer();
Directory index = new RAMDirectory();
IndexWriter w = new IndexWriter(index, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);
addDoc(w, "Table 1 <table> content </table>");
addDoc(w, "Table 2");
addDoc(w, "<table> content </table>");
addDoc(w, "The Art of Computer Science");
w.Close();
String querystr = "table";
Query q = new QueryParser("title", analyzer).Parse(querystr);
Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(index);
Hits hitsFound = searcher.Search(q);
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("*", "*");
Highlighter highlighter = null;
highlighter = new Highlighter(formatter, new QueryScorer(searcher.Rewrite(q)));
for (int i = 0; i < hitsFound.Length(); i++)
{
Console.WriteLine(highlighter.GetBestFragment(analyzer, "title", hitsFound.Doc(i).Get("title")));
// Console.WriteLine(hitsFound.Doc(i).Get("title"));
}
Console.ReadKey();
}
private static void addDoc(IndexWriter w, String value)
{
Document doc = new Document();
doc.Add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
w.AddDocument(doc);
}
}
The highlighted results always seem to skip the closing '>' of my last table tag. Any suggestions?
Lucene's highlighter, out of the box, is geared to handle plain text. It will work incorrectly if you try to highlight HTML or any mark-up text.
I recently ran into the same problem and found a solution in Solr's HTMLStripReader which skips the content in tags. The solution is outlined on my blog at following URL.
http://sigabrt.blogspot.com/2010/04/highlighting-query-in-entire-html.html
I could have posted the code here, but my solution is applicable for Lucene Java. For .Net, you have to find out equivalent of HTMLStripReader.
Solved. Apparently my Highlighter.Net version was archaic. Upgrading to 2.3.2.1 Solved the problem

Using RAMDirectory

When should I use Lucene's RAMDirectory? What are its advantages over other storage mechanisms? Finally, where can I find a simple code example?
When you don’t want to permanently store your index data. I use this for testing purposes. Add data to your RAMDirectory, Do your unit tests in RAMDir.
e.g.
public static void main(String[] args) {
try {
Directory directory = new RAMDirectory();
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer, true);
OR
public void testRAMDirectory () throws IOException {
Directory dir = FSDirectory.getDirectory(indexDir);
MockRAMDirectory ramDir = new MockRAMDirectory(dir);
// close the underlaying directory
dir.close();
// Check size
assertEquals(ramDir.sizeInBytes(), ramDir.getRecomputedSizeInBytes());
// open reader to test document count
IndexReader reader = IndexReader.open(ramDir);
assertEquals(docsToAdd, reader.numDocs());
// open search zo check if all doc's are there
IndexSearcher searcher = new IndexSearcher(reader);
// search for all documents
for (int i = 0; i < docsToAdd; i++) {
Document doc = searcher.doc(i);
assertTrue(doc.getField("content") != null);
}
// cleanup
reader.close();
searcher.close();
}
Usually if things work out with RAMDirectory, it will pretty much work fine with others. i.e. to permanently store your index.
Alternate to this is FSDirectory. You will have to take care of filesystem permissions in this case(which is not valid with RAMDirectory)
Functionally,there is not distinct advantage of RAMDirectory over FSDirectory(other than the fact that RAMDirectory will be visibly faster than FSDirectory). They both server two different needs.
RAMDirectory -> Primary memory
FSDirectory -> Secondary memory
Pretty similar to RAM & Hard disk .
I am not sure what will happen to RAMDirectory if it exceeds memory limit. I’d except a
OutOfMemoryException :
System.SystemException
thrown.