How to get unique documents when searching indexed documents in lucene - lucene

I need to index documents and search them. I have multiple fields to query. When I search the indexed files, I am having repeated documents. This is how I made the query:
MultiFieldQueryParser parser = new MultiFieldQueryParser( Version.LUCENE_40, new String[] {"title", "abs"}, analyzer);
Query query = parser.parse(querystr);
Here is my display:
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("pmid") + "\t" + d.get("title"));
}

Your code looks OK, however it would be helpful if you printed docId value.
If this value is different, then you most likely have your document added multiple times to the index which you need to fix.
Remember, Lucene doesn't have built-in concept of identity. For Lucene all documents are different even if they have exactly same terms. The standard approach of fixing this problem is to have non-analyzed field containing some external ID (e.g. database id) and removing+re-adding updated document when changes happen.

Related

In lucene, how to find documents that contains words only from the search query

The indexed documents are:
1) experience in
2) with experience
3) with experience in
4) proficiency in
5) knowledge in
6) experience of
7) strong knowledge
8) knowledge of
9) responsible for
If my search query is "Candidates with experience in", then only first 3 documents should be retrieved. That's the documents that contains words only from the search query. For e.g., Considering 4th document, "proficiency" is not present in the search query, then it should not be retrieved.
I tried BooleanQuery with Should clause. But it returned partial match documents (4-6) also.
String[] searchWords = searchQuery.split(" ");
for(String searchWord: searchWords) {
TermQuery tq = new TermQuery(new Term("fieldName",searchWord));
bq.add(new BooleanClause(tq, BooleanClause.Occur.SHOULD));
}
search(bq);
I need only documents 1-3.
1) Create a vocabulary with all possible words
2) Add must_not for all words other than that in the query.
ArrayList<String> searchWords = Arrays.asList(searchQuery.split(" "));
for(String searchWord: searchWords) {
TermQuery tq = new TermQuery(new Term("fieldName",searchWord));
bq.add(new BooleanClause(tq, BooleanClause.Occur.SHOULD));
}
for(String word: vocabulary){
if(!searchWords.contains(word)) {
TermQuery tq = new TermQuery(new Term("fieldName",word));
bq.add(new BooleanClause(tq, BooleanClause.Occur.MUST_NOT));
}
}
search(bq);
3) If looping through the vocabulary takes time, use java 8 stream features and try to optimize it.

Number of PhraseQuery matches in a document

This is my code to perform a PhraseQuery using Lucene. While it is clear how to get score matches for each document inside the index, i am not understanding how to extract the total number of matches for a single document.
The following is my code performing the query:
PhraseQuery.Builder builder = new PhraseQuery.Builder();
builder.add(new Term("contents", "word1"), 0);
builder.add(new Term("contents", "word2"), 1);
builder.add(new Term("contents", "word3"), 2);
builder.setSlop(3);
PhraseQuery pq = builder.build();
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(pq, hitsPerPage);
ScoreDoc[] hits = docs.scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i)
{
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(docId + " " + hits[i].score);
}
Is there a method to extract the total number of matches for each document rather than the score?
Approach A. This might not be the best way but it will give you a quick insight. You can use explain() function of IndexSearcher class which will return a string containing lots of information and phrase frequency in a document. Add this code inside your for loop:
System.out.println(searcher.explain(pq, searcher.doc(docId)));
Approach B. A more systematic way of doing this is to do the same thing that explain() function does. To compute the phrase frequency, explain() builds a scorer object for the phrase query and calls freq() on it. Most of the methods/classes used to do this are private/protected so I am not sure if you can really use them. However it might be helpful to look at the code of explain() in PhraseWeight class inside PhraseQuery and ExactPhraseScorer class. (Some of these classes are not public and you should download the source code to be able to see them).

Lucene not indexing String field with value "this"

I am adding the document to lucene index as follows:
Document doc = new Document();
String stringObj = (String)field.get(obj);
doc.add(new TextField(fieldName, stringObj.toLowerCase(), org.apache.lucene.document.Field.Store.YES));
indexWriter.addDocument(doc);
I am doing a wild card search as follows:
searchTerm = "*" + searchTerm + "*";
term = new Term(field, sTerm.toLowerCase());
Query query = new WildcardQuery(term);
TotalHitCountCollector collector = new TotalHitCountCollector();
indexSearcher.search(query, collector);
if(collector.getTotalHits() > 0){
TopDocs hits = indexSearcher.search(query, collector.getTotalHits());
}
When I have a string with a "this" value, it is not getting added to the index, hence i do not get the result on searching by "this". I am using a StandardAnalyzer.
Common terms of English language like prepositions, pronouns etc are marked as stop words and omitted before indexing. You can define a custom analyzer or custom stop word list for your analyzer. That way you will be able to omit words that you don't want to be indexed and keep the stop words that you need.

How to score a small set of docs in Lucene

I would like to compute the scores for a small number of documents (rather than for the entire collection) for a given query. My attempt, as follows, returns 0 scores for each document, even though the queries I test with were derived from the terms in the documents I am trying to score. I am using Lucene 3.0.3.
List<Float> score(IndexReader reader, Query query, List<Integer> newDocs ) {
List<Float> scores = new List<Float>();
IndexSearcher searcher = reader.getSearcher();
Collector collector = TopScoreDocCollector.create(newDocs.size(), true);
Weight weight = query.createWeight(searcher);
Scorer scorer = weight.scorer(reader, true, true);
collector.setScorer(scorer);
float score = 0.0f;
for(Integer d: newDocs) {
scorer.advance(d);
collector.collect(d);
score = scorer.score();
System.out.println( "doc: " + d + "; score=" + score);
scores.add( new Float(score) );
}
return scores;
}
I am obviously missing something in the setup of scoring, but I cannot figure out from the Lucene source code what that might be.
Thanks in advance,
Gene
Use a filter, and do a search with that filter. Then just iterate through the results as you would with a normal search - Lucene will handle the filtering.
In general, if you are looking at DocIds, you're probably using a lower-level API than you need to, and it's going to give you trouble.

Is it possible to iterate through documents stored in Lucene Index?

I have some documents stored in a Lucene index with a docId field.
I want to get all docIds stored in the index. There is also a problem. Number of documents is about 300 000 so I would prefer to get this docIds in chunks of size 500. Is it possible to do so?
IndexReader reader = // create IndexReader
for (int i=0; i<reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
String docId = doc.get("docId");
// do something with docId here...
}
Lucene 4
Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i=0; i<reader.maxDoc(); i++) {
if (liveDocs != null && !liveDocs.get(i))
continue;
Document doc = reader.document(i);
}
See LUCENE-2600 on this page for details: https://lucene.apache.org/core/4_0_0/MIGRATE.html
There is a query class named MatchAllDocsQuery, I think it can be used in this case:
Query query = new MatchAllDocsQuery();
TopDocs topDocs = getIndexSearcher.search(query, RESULT_LIMIT);
Document numbers (or ids) will be subsequent numbers from 0 to IndexReader.maxDoc()-1. These numbers are not persistent and are valid only for opened IndexReader. You could check if the document is deleted with IndexReader.isDeleted(int documentNumber) method
If you use .document(i) as in above examples and skip over deleted documents be careful if you use this method for paginating results.
i.e.: You have a 10 docs/per page list and you need to get the docs. for page 6. Your input might be something like this: offset=60,count = 10 (documents from 60 to 70).
IndexReader reader = // create IndexReader
for (int i=offset; i<offset + 10; i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
String docId = doc.get("docId");
}
You will have some problems with the deleted ones because you should not start from offset=60, but from offset=60 + the number of deleted documents that appear before 60.
An alternative I found is something like this:
is = getIndexSearcher(); //new IndexSearcher(indexReader)
//get all results without any conditions attached.
Term term = new Term([[any mandatory field name]], "*");
Query query = new WildcardQuery(term);
topCollector = TopScoreDocCollector.create([[int max hits to get]], true);
is.search(query, topCollector);
TopDocs topDocs = topCollector.topDocs(offset, count);
note: replace text between [[ ]] with own values.
Ran this on large index with 1.5million entries and got random 10 results in less than a second.
Agree is slower but at least you can ignore deleted documents if you need pagination.