Number of PhraseQuery matches in a document - lucene

This is my code to perform a PhraseQuery using Lucene. While it is clear how to get score matches for each document inside the index, i am not understanding how to extract the total number of matches for a single document.
The following is my code performing the query:
PhraseQuery.Builder builder = new PhraseQuery.Builder();
builder.add(new Term("contents", "word1"), 0);
builder.add(new Term("contents", "word2"), 1);
builder.add(new Term("contents", "word3"), 2);
builder.setSlop(3);
PhraseQuery pq = builder.build();
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(pq, hitsPerPage);
ScoreDoc[] hits = docs.scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i)
{
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(docId + " " + hits[i].score);
}
Is there a method to extract the total number of matches for each document rather than the score?

Approach A. This might not be the best way but it will give you a quick insight. You can use explain() function of IndexSearcher class which will return a string containing lots of information and phrase frequency in a document. Add this code inside your for loop:
System.out.println(searcher.explain(pq, searcher.doc(docId)));
Approach B. A more systematic way of doing this is to do the same thing that explain() function does. To compute the phrase frequency, explain() builds a scorer object for the phrase query and calls freq() on it. Most of the methods/classes used to do this are private/protected so I am not sure if you can really use them. However it might be helpful to look at the code of explain() in PhraseWeight class inside PhraseQuery and ExactPhraseScorer class. (Some of these classes are not public and you should download the source code to be able to see them).

Related

Lucene not indexing String field with value "this"

I am adding the document to lucene index as follows:
Document doc = new Document();
String stringObj = (String)field.get(obj);
doc.add(new TextField(fieldName, stringObj.toLowerCase(), org.apache.lucene.document.Field.Store.YES));
indexWriter.addDocument(doc);
I am doing a wild card search as follows:
searchTerm = "*" + searchTerm + "*";
term = new Term(field, sTerm.toLowerCase());
Query query = new WildcardQuery(term);
TotalHitCountCollector collector = new TotalHitCountCollector();
indexSearcher.search(query, collector);
if(collector.getTotalHits() > 0){
TopDocs hits = indexSearcher.search(query, collector.getTotalHits());
}
When I have a string with a "this" value, it is not getting added to the index, hence i do not get the result on searching by "this". I am using a StandardAnalyzer.
Common terms of English language like prepositions, pronouns etc are marked as stop words and omitted before indexing. You can define a custom analyzer or custom stop word list for your analyzer. That way you will be able to omit words that you don't want to be indexed and keep the stop words that you need.

How to get unique documents when searching indexed documents in lucene

I need to index documents and search them. I have multiple fields to query. When I search the indexed files, I am having repeated documents. This is how I made the query:
MultiFieldQueryParser parser = new MultiFieldQueryParser( Version.LUCENE_40, new String[] {"title", "abs"}, analyzer);
Query query = parser.parse(querystr);
Here is my display:
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("pmid") + "\t" + d.get("title"));
}
Your code looks OK, however it would be helpful if you printed docId value.
If this value is different, then you most likely have your document added multiple times to the index which you need to fix.
Remember, Lucene doesn't have built-in concept of identity. For Lucene all documents are different even if they have exactly same terms. The standard approach of fixing this problem is to have non-analyzed field containing some external ID (e.g. database id) and removing+re-adding updated document when changes happen.

High performance unique document id retrieval

Currently I am working on high-performance NRT system using Lucene 4.9.0 on Java platform which detects near-duplicate text documents.
For this purpose I query Lucene to return some set of matching candidates and do near-duplicate calculation locally (by retrieving and caching term vectors). But my main concern is performance issue of binding Lucene's docId (which can change) to my own unique and immutable document id stored within index.
My flow is as follows:
query for documents in Lucene
for each document:
fetch my unique document id based on Lucene docId
get term vector from cache for my document id (if it doesn't exists - fetch it from Lucene and populate the cache)
do maths...
My major bottleneck is "fetch my unique document id" step which introduces huge performance degradation (especially that sometimes I have to do calculation for, let's say, 40000 term vectors in single loop).
try {
Document document = indexReader.document(id);
return document.getField(ID_FIELD_NAME).numericValue().intValue();
} catch (IOException e) {
throw new IndexException(e);
}
Possible solutions I was considering was:
try of using Zoie which handles unique and persistent doc identifiers,
use of FieldCache (still very inefficient),
use of Payloads (according to http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html) - but I do not have any idea how to apply it.
Any other suggestions?
I have figured out how to solve the issue partially using benefits of Lucene's AtomicReader. For this purpose I use global cache in order to keep already instantiated segments' FieldCache.
Map<Object, FieldCache.Ints> fieldCacheMap = new HashMap<Object, FieldCache.Ints>();
In my method I use the following piece of code:
Query query = new TermQuery(new Term(FIELD_NAME, fieldValue));
IndexReader indexReader = DirectoryReader.open(indexWriter, true);
List<AtomicReaderContext> leaves = indexReader.getContext().leaves();
// process each segment separately
for (AtomicReaderContext leave : leaves) {
AtomicReader reader = leave.reader();
FieldCache.Ints fieldCache;
Object fieldCacheKey = reader.getCoreCacheKey();
synchronized (fieldCacheMap) {
fieldCache = fieldCacheMap.get(fieldCacheKey);
if (fieldCache == null) {
fieldCache = FieldCache.DEFAULT.getInts(reader, ID_FIELD_NAME, true);
fieldCacheMap.put(fieldCacheKey, fieldCache);
}
usedReaderSet.add(fieldCacheKey);
}
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (int i = 0; i < scoreDocs.length; i++) {
int docID = scoreDocs[i].doc;
int offerId = fieldCache.get(docID);
// do your processing here
}
}
// remove unused entries in cache set
synchronized(fieldCacheMap) {
Set<Object> inCacheSet = fieldCacheMap.keySet();
Set<Object> toRemove = new HashSet();
for(Object inCache : inCacheSet) {
if(!usedReaderSet.contains(inCache)) {
toRemove.add(inCache);
}
}
for(Object subject : toRemove) {
fieldCacheMap.remove(subject);
}
}
indexReader.close();
It works pretty fast. My main concern is memory usage which can be really high when using large index.

Lucene 4.4. How to get term frequency over all index?

I'm trying to compute tf-idf value of each term in a document. So, I iterate through the terms in a document and want to find the frequency of the term in the whole corpus and the number of documents in which the term appears. Following is my code:
//#param index path to index directory
//#param docNbr the document number in the index
public void readingIndex(String index, int docNbr) {
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
Document doc = reader.document(docNbr);
System.out.println("Processing file: "+doc.get("id"));
Terms termVector = reader.getTermVector(docNbr, "contents");
TermsEnum itr = termVector.iterator(null);
BytesRef term = null;
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
long termFreq = itr.totalTermFreq(); //FIXME: this only return frequency in this doc
long docCount = itr.docFreq(); //FIXME: docCount = 1 in all cases
System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);
}
reader.close();
}
Although the documentation says totalTermFreq() returns the total number of occurrences of this term across all documents, when testing I found it only returns the frequency of the term in the document given by docNbr. and docFreq() always return 1.
How can I get frequency of a term across the whole index?
Update
Of course, I can create a map to map a term to its frequency. Then iterate through each document to count the total number of time a term occur. However, I thought Lucene should have a built in method for that purpose.
Thank you,
IndexReader.TotalTermFreq(Term) will provide this for you. Your calls to the similar methods on the TermsEnum are indeed providing the stats for all documents, in the enumeration. Using the reader should get you the stats for all the documents in the index itself. Something like:
String termText = term.utf8ToString();
Term termInstance = new Term("contents", term);
long termFreq = reader.totalTermFreq(termInstance);
long docCount = reader.docFreq(termInstance);
System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);

How to score a small set of docs in Lucene

I would like to compute the scores for a small number of documents (rather than for the entire collection) for a given query. My attempt, as follows, returns 0 scores for each document, even though the queries I test with were derived from the terms in the documents I am trying to score. I am using Lucene 3.0.3.
List<Float> score(IndexReader reader, Query query, List<Integer> newDocs ) {
List<Float> scores = new List<Float>();
IndexSearcher searcher = reader.getSearcher();
Collector collector = TopScoreDocCollector.create(newDocs.size(), true);
Weight weight = query.createWeight(searcher);
Scorer scorer = weight.scorer(reader, true, true);
collector.setScorer(scorer);
float score = 0.0f;
for(Integer d: newDocs) {
scorer.advance(d);
collector.collect(d);
score = scorer.score();
System.out.println( "doc: " + d + "; score=" + score);
scores.add( new Float(score) );
}
return scores;
}
I am obviously missing something in the setup of scoring, but I cannot figure out from the Lucene source code what that might be.
Thanks in advance,
Gene
Use a filter, and do a search with that filter. Then just iterate through the results as you would with a normal search - Lucene will handle the filtering.
In general, if you are looking at DocIds, you're probably using a lower-level API than you need to, and it's going to give you trouble.