Why I can't get the doc that added recently by IndexWriter in the search result in Lucene 4.0? - lucene

Such as the title said, I have encountered a puzzled problem.
I have built an index for my test program, then I use IndexWriter to add a document into index. The code is :
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc1 = new Document();
doc1.add(new Field("name", "张三", Field.Store.YES, Field.Index.ANALYZED));
doc1.add(new IntField("year", 2013, Field.Store.YES));
doc1.add(new TextField("content", "123456789", Field.Store.YES));
iwriter.addDocument(doc1);
iwriter.commit();
iwriter.close();
When I try to search in this index, I can't get this doc. I really get a correct result count, it is one more than before. But when I try to print the doc.get('name'), the output is wrong.
The code in search part is:
DirectoryReader ireader = DirectoryReader.open(directory);
System.out.println(ireader.numDeletedDocs());
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "name", analyzer);
Query query = parser.parse("张");
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
System.out.println(hits.length);
In results, there is a "Name: 李四".
I'm sure that I use the StandardAnalyzer during indexing and searching. And StandardAnalyzer will make one Chinese character as a single token. Why when I search "张", I will get "李四"? Is there anything wrong when I add a doc? Or the docid is mismatch?

Did you (re)open the index after adding the doc? Lucene searches only return the documents that existed as of the time the index was opened for searching.
[edit...]
Use IndexReader.Open() or IndexReader.doOpenIfChanged() to open the index again. doOpenIfChanged() has the advantage that it returns null if you can still use the old IndexReader instance (because the index has not changed).
(If I recall correctly, DirectoryReader.Open() just opens the index directory, so the higher-level Lucene code does not realize that the index has changed if you just call DirectoryReader.Open.)

Related

Index field using: new TextField(String fieldName, Reader reader)

I've been trying to index a field using the readerValue() that Lucene provides in the Fields. The thing is that the terms are not being indexed. This is the interesting part of the code:
IndexWriterConfig config = new IndexWriterConfig(new SimpleAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, config);
indexWriter.deleteAll();
String str = "Some random text to be indexed";
Reader reader = new StringReader(str);
Document doc = new Document();
doc.add(new TextField("content", reader));
indexWriter.addDocument(doc);
Now, if I index that text as a String with the other TextField constructor it works fine, but like this it does not index the terms, instead returns null when I try to get the value of the field after a search:
QueryParser queryParser = new QueryParser("content", new SimpleAnalyzer());
Query query = queryParser.parse(text);
TopDocs topDocs = indexSearcher.search(query,10);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document document = indexSearcher.doc(scoreDoc.doc);
Reader r = document.getField("content").readerValue();
I really can't see the problem, maybe it is some dumb thing that I missed, or maybe I'm using it wrong? Thanks in advance for any help.
By default, TextField is unstored. The behavior you are seeing is expected for an un-stored field. You should be able to search on it, but not retrieve it from the index. The constructor that takes a string argument for the field contents allows you to set whether to store the field or not, thus the different behavior.
The reason the store option is not available on that constructor, is that Lucene explicitly disallows a stored field to be set with a Reader or TokenStream value. If you want to store the field, you will simply need get the string value from the Reader yourself.

After Lucene search, get character offsets of all matched words in document? (not just preview snippet)

I am creating a search engine for a large number of HTML documents using lucene.
I know I can use PostingsHighlighter and friends to show snippets, with bold words, similar to Google Search results, also similar to this random lucene-based example.
However, unlike these examples, I need a solution that preserves highlighted words, even after the matched document is opened by the user, similar to Google Books.
Some words are hyphenated, in the form <div> ... an inter-</div><div...>national audience ...</div> I am thinking I need to convert these to plain text first, and write some code to merge words that were hyphenated, before I send them to lucene.
Once the resulting document is opened by the user, I'm hoping that I can use lucene to get character offsets of each matched word in the document.
I will have to cross-reference the offsets in the plain text back to the original HTML, and write code to highlight <b> the words based on said offsets.
<div> ... an <b>inter-</b></div><div...><b>national</b> audience ...</div>
How can I get what I need from lucene? Surely I don't have to write my own search for this 'final inch'?
OK, I figured out something I can get started with. :)
To index:
StandardAnalyzer analyzer - new StandardAnalyzer()
Directory index = FSDirectory.open(new File("...").toPath());
IndexWriterConfig config = new IndexWriterConfig(analyzer);
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
// documents need to be read from the data source..
// only add once, or else your docs will be duplicated as you continue to use the system
writer.close();
specify offsets to store for highlighting
private static final FieldType typeOffsets;
static {
typeOffsets = new FieldType(textField.TYPE_STORED);
typeOffsets.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
}
method addDoc
void addDoc(IndexWriter writer, String title, String body) {
Document doc = new Document();
doc.add(new Field("title", body, typeOffsets));
doc.add(new Field("body", body, typeOffsets));
// you can also add an store a TextField that does not have offsets,
// like a file ID that you wouldn't search on, just need to reference original doc.
writer.addDocument(doc);
}
Perform your first search
String q = "...";
String[] fields = new String[] {"title", "body"};
QueryParser parser = new MultiFieldQueryParser(fields, analyzer)
Query query = parser.parse(q)
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(index));
PostingsHighlighter highlighter = new PostingsHighlighter();
TopDocs topDocs = searcher.search(query, 10, Sort.RELEVANCE);
Get highlighted snippets with highlighter.highlightFields(fields, query, searcher, topDocs). You can iterate over the results.
When you want to highlight the end document (i.e. after the search is completed and user selected the result), use this solution (needs minor edits). It works by using NullFragmenter to turn the whole thing into one snippet.
public static String highlight(String pText, String pQuery) throws Exception
{
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
QueryParser parser = new QueryParser(Version.LUCENE_30, "", analyzer);
Highlighter highlighter = new Highlighter(new QueryScorer(parser.parse(pQuery)));
highlighter.setTextFragmenter(new NullFragmenter());
String text = highlighter.getBestFragment(analyzer, "", pText);
if (text != null)
{
return text;
}
return pText;
}
Edit: You can actually use PostingsHighlighter for this last step instead of Highlighter, but you have to override getBreakIterator, and then override your BreakIterator so that it thinks the whole document is one sentance.
Edit: You can override getFormatter to capture the offsets, rather than trying to parse the <b> tags normally output by PostingsHighlighter.

PhraseQuery+Lucene 4.6 is not working for PDF Word search

Iam Using lucene 4.6 version with Phrase Query for searching the words from PDF. Below is my code. Here Iam able to get the out put text from the PDF also getting the query as contents:"Following are the". But No.of hits is showing as 0. Any suggestions on it?? Thanks in advance.
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
iwriter.deleteAll();
iwriter.commit();
Document doc = new Document();
PDDocument document = null;
try {
document = PDDocument.load(strFilepath);
}
catch (IOException ex) {
System.out.println("Exception Occured while Loading the document: " + ex);
}
String output=new PDFTextStripper().getText(document);
System.out.println(output);
//String text = "This is the text to be indexed";
doc.add(new Field("contents", output, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
String sentence = "Following are the";
//IndexSearcher searcher = new IndexSearcher(directory);
if(output.contains(sentence)){
System.out.println("");
}
PhraseQuery query = new PhraseQuery();
String[] words = sentence.split(" ");
for (String word : words) {
query.add(new Term("contents", word));
}
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
// Iterate through the results:
if(hits.length>0){
System.out.println("Searched text existed in the PDF.");
}
ireader.close();
directory.close();
}
catch(Exception e){
System.out.println("Exception: "+e.getMessage());
}
There are two reasons why your PhraseQuery is not working
StandardAnalyzer uses ENGLISH_STOP_WORDS_SET which contains a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with these words which will be removed from TokenStream while indexing. That means when you search "Following are the" in index, are and the will not be found. so you will never get any result for such a PhraseQuery as are and the will never be there in first place to search with.
Solution for this is use this constructor for
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46, CharArraySet.EMPTY_SET); while indexing this will make sure that StopFilter will not remove any word from TokenStream while indexing.
StandardAnalyzer also uses LowerCaseFilter that means all tokens will be normalized to lower case. so Following will be indexed as following that means searching "Following" won't give you result. For this .toLowerCase() will come to your rescue, just use this on your sentence and you should get results from search.
Also have a look at this link which specify Unicode Standard Annex #29 which is followed by StandardTokenizer. And from brief look at it, it looks like APOSTROPHE, QUOTATION MARK, FULL STOP, SMALL COMMA and many other characters under certain condition will be ignored while indexing.

Lucene - Specialized TokenStream/Analyzer given a set of indexable keywords

I have the following situation
I have a collection of documents to index. But I need to be selective in what I index.
Selection criteria: the document must contain one of the keywords from a given Set.
That part is easy, I can check if any of those keywords are present in the document and only then index the document.
The tricky situation is (for me anyway!), that I want to index only these keywords. And these keywords can be multiworded, or regex expressions as well, say.
What these keywords are going to be is meaningless to this post, because I can abstract that out - I can generate the list of keywords that need to be indexed.
Is there an existing TokenStream, Analyzer, Filter combination that I can use?
And if there isn't, please could someone point me in the right direction.
If my question isn't clear enough:
HashSet<String> impKeywords = new HashSet<String>(new String[] {"Java", "Lucene"});
I have a class Content which I use, say:
Content content = new Content("I am only interested in Java, Lucene, Nutch, Luke, CommonLisp.");
And, say I have a method to get matching keywords:
HashSet<String> matchingKeywords = content.getMatchingKeywords(impKeywords); // returns a set with "Java" and "Lucene"
And if there are matchingKeywords, only then proceed to index the document; so:
if(!matchingKeywords.isEmpty()) {
// prepare document for indexing, and index.
// But what should be my Analyzer and TokenStream?
}
I want to be able to create an Analyzer with a TokenStream that only returns these matching keywords, so only these tokens are indexed.
End notes: One possibility appears to be that for each document I add a variable number of fields with each of the matching keywords. Where these fields are Indexed but not Analyzed using Field.Index.NOT_ANALYZED. However, it would be better if I'm able to figure out a pre-existing Analyzer/TokenStream for this purpose instead of playing around with fields.
Following #femtoRgon's advise I have resolved the said problem as follows.
As explained in the question, I have:
HashSet<String> impKeywords = new HashSet<String>(new String[] {"Java", "Lucene"});
And I have a class Content which I use, say as follows:
Content content = new Content("I am only interested in Java, Lucene, Nutch, Luke, CommonLisp.");
And, I have a method to get matching keywords:
HashSet<String> matchingKeywords = content.getMatchingKeywords(impKeywords); // returns a set with "Java" and "Lucene" for this example `content`.
And if there are matchingKeywords, only then proceed to index the document; so while indexing I did:
if(!matchingKeywords.isEmpty()) {
Document doc = new Document();
for(String keyword: matchingKeywords) {
doc.add(new Field("keyword", keyword, Field.Store.YES, Field.Index.NOT_ANALYZED);
}
iwriter.addDocument(doc); // iwriter is the instance of IndexWriter
}
Then, while searching I created the following boolean query:
BooleanQuery boolQuery = new BooleanQuery();
for(String queryKeyword: searchKeywords)) {
boolQuery.add(new TermQuery(new Term("keyword", queryKeyword)), BooleanClause.Occur.SHOULD);
}
ScoreDoc[] hits = isearcher.search(boolQuery, null, 1000).scoreDocs; // isearcher is the instance of IndexSearcher
Hope this answer helps someone with similar needs.

Use lucene index in java application

Recently i stared working on solr. I have created index in solr and i want to query on it through my java application. I don't want to use solr.war in my application. How can i use it through solrj api or lucene java api? My thinking is to add those index in project context and use it. I gone through some examples/tutorials but did not find any on how to work with already created index. Please tell me a proper solution for it or any link specifying the solution will be appreciated.
You can use Lucene apis to create/update and search on an index.
As solr is based on lucene, the underlying index is the lucene index.
Lucene exposes classes as IndexWriter and IndexSearcher, which would help you interact with index.
Example for searching over an solr/lucene index -
Directory index = FSDirectory.open(new File("/path/to/index"));
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
Should be able to find examples on this.
Yes, you can use a Solr-created index with Lucene, there's nothing particular about it because Solr itself uses Lucene. So all Lucene documentation applies unchanged.
Or if you don't want to use Solr as a server you can use it embedded in your Java application.
I made it this way..
String realPath = request.getRealPath("/");
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
Directory index = FSDirectory.open(new File(realPath+"/index"));
IndexSearcher indexSearcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(2000, true);
QueryParser query = new QueryParser(Version.LUCENE_CURRENT, "name", analyzer);
Query q = null;
try {
q = query.parse("*:*");
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
indexSearcher.search(q, collector);
ScoreDoc[] scoreDoc = collector.topDocs().scoreDocs;