Lucene 4.4. How to get term frequency over all index? - lucene

I'm trying to compute tf-idf value of each term in a document. So, I iterate through the terms in a document and want to find the frequency of the term in the whole corpus and the number of documents in which the term appears. Following is my code:
//#param index path to index directory
//#param docNbr the document number in the index
public void readingIndex(String index, int docNbr) {
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
Document doc = reader.document(docNbr);
System.out.println("Processing file: "+doc.get("id"));
Terms termVector = reader.getTermVector(docNbr, "contents");
TermsEnum itr = termVector.iterator(null);
BytesRef term = null;
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
long termFreq = itr.totalTermFreq(); //FIXME: this only return frequency in this doc
long docCount = itr.docFreq(); //FIXME: docCount = 1 in all cases
System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);
}
reader.close();
}
Although the documentation says totalTermFreq() returns the total number of occurrences of this term across all documents, when testing I found it only returns the frequency of the term in the document given by docNbr. and docFreq() always return 1.
How can I get frequency of a term across the whole index?
Update
Of course, I can create a map to map a term to its frequency. Then iterate through each document to count the total number of time a term occur. However, I thought Lucene should have a built in method for that purpose.
Thank you,

IndexReader.TotalTermFreq(Term) will provide this for you. Your calls to the similar methods on the TermsEnum are indeed providing the stats for all documents, in the enumeration. Using the reader should get you the stats for all the documents in the index itself. Something like:
String termText = term.utf8ToString();
Term termInstance = new Term("contents", term);
long termFreq = reader.totalTermFreq(termInstance);
long docCount = reader.docFreq(termInstance);
System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);

Related

Number of PhraseQuery matches in a document

This is my code to perform a PhraseQuery using Lucene. While it is clear how to get score matches for each document inside the index, i am not understanding how to extract the total number of matches for a single document.
The following is my code performing the query:
PhraseQuery.Builder builder = new PhraseQuery.Builder();
builder.add(new Term("contents", "word1"), 0);
builder.add(new Term("contents", "word2"), 1);
builder.add(new Term("contents", "word3"), 2);
builder.setSlop(3);
PhraseQuery pq = builder.build();
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(pq, hitsPerPage);
ScoreDoc[] hits = docs.scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i)
{
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(docId + " " + hits[i].score);
}
Is there a method to extract the total number of matches for each document rather than the score?
Approach A. This might not be the best way but it will give you a quick insight. You can use explain() function of IndexSearcher class which will return a string containing lots of information and phrase frequency in a document. Add this code inside your for loop:
System.out.println(searcher.explain(pq, searcher.doc(docId)));
Approach B. A more systematic way of doing this is to do the same thing that explain() function does. To compute the phrase frequency, explain() builds a scorer object for the phrase query and calls freq() on it. Most of the methods/classes used to do this are private/protected so I am not sure if you can really use them. However it might be helpful to look at the code of explain() in PhraseWeight class inside PhraseQuery and ExactPhraseScorer class. (Some of these classes are not public and you should download the source code to be able to see them).

lucene, efficient way to get offsets of a set of terms in documents

Suppose I have indexed a set of documents and now I am given a set of terms that I known are generated by the indexing process. I would like to get the occurrence of each of these terms, i.e., which document, what offsets. I have done this using, for each term, one postingnums that let me to iterate through the set of documents a term appear in; then within each document, one postingenums to get document vector that contains the offset information of that term in that document.
But this is not very efficient as there is loop inside loop and can go quite slow. The code is below. Any suggestions if this can be done in a better way?
The field schema:
<field name="terms" type="token_ngram" indexed="true" stored="false" multiValued="false" termVectors="true" termPositions="true" termOffsets="true"/>
Code:
IndexReader indexReader = ...//init an index reader
Set<String> termSet = .... //set containing e.g., 10000 terms.
for(String term: termSet){
//get a postingenum used to iterate through docs containing the term
//this "postings" does not have valid offset information (see comment below)
PostingsEnum postings =
MultiFields.getTermDocsEnum(indexReader, "terms", new BytesRef(term.getBytes()));
/*I also tried:
*PostingsEnum postings =
* MultiFields.getTermDocsEnum(indexReader, "terms", new BytesRef(term.getBytes()), PostingsEnum.OFFSETS);
* But the resulting "postings" object also does not contain valid offset info (always -1)
*/
//now go through each document
int docId = postings.nextDoc();
while (docId != PostingsEnum.NO_MORE_DOCS) {
//get the term vector for that document.
TermsEnum it = indexReader.getTermVector(docId, ngramInfoFieldname).iterator();
//find the term of interest
it.seekExact(new BytesRef(term.getBytes()));
//get its posting info. this will contain offset info
PostingsEnum postingsInDoc = it.postings(null, PostingsEnum.OFFSETS);
//From below, Line A to Line B if I replace "postingsInDoc" with "postings", method "posting.startOffset()" and "endoffset()" always returns -1;
postingsInDoc.nextDoc(); //line A
int totalFreq = postingsInDoc.freq();
for (int i = 0; i < totalFreq; i++) {
postingsInDoc.nextPosition();
System.out.println(postingsInDoc.startOffset(), postingsInDoc.endOffset());
} //Line B
docId=postings.nextDoc();
}
}

Lucene - Iterating through TermsEnum for docfreq

I am trying to get the doc frequency for each term in term enum. But I getting everytime only a "1" for the document frequency for all terms. Any hint, what the problem could be? This is my code:
Terms terms = reader.getTermVector(docId, field);
TermsEnum termsEnum = null;
termsEnum = terms.iterator(termsEnum);
BytesRef termText = null;
while((termsEnum.next()) != null){
int docNumbersWithTerm = termsEnum.docfreq();
System.out.println(docNumbersWithTerm);
}
The Terms instance from IndexReader.getTermVector acts as if you have a single-document index, comprised entirely of the document specified. Since there is only one document to consider in this context, you should always get docfreq() = 1. You could generate the docfreq from the full index using the IndexReader.docFreq method:
int docNumbersWithTerm = reader.docFreq(new Term(termsEnum.term(), field));
System.out.println(docNumbersWithTerm);

Find list of terms indexed by Lucene

Is it possible to extract the list of all the terms in a Lucene index as a list of strings? I couldn't find that functionality in the doc. Thanks!
In Lucene 4 (and 5):
Terms terms = SlowCompositeReaderWrapper.wrap(directoryReader).terms("field");
Edit:
This seems to be the 'correct' way now (Lucene 6 and up):
LuceneDictionary ld = new LuceneDictionary( indexReader, "field" );
BytesRefIterator iterator = ld.getWordsIterator();
BytesRef byteRef = null;
while ( ( byteRef = iterator.next() ) != null )
{
String term = byteRef.utf8ToString();
}
Lucene 3:
C#: C# Lucene get all the index
Java:
IndexReader indexReader = IndexReader.open(path);
TermEnum termEnum = indexReader.terms();
while (termEnum.next()) {
Term term = termEnum.term();
System.out.println(term.text());
}
termEnum.close();
indexReader.close();
Java (all terms for a specific field): How can I get the list of unique terms from a specific field in Lucene?
Python: Finding a single fields terms with Lucene (PyLucene)

How to score a small set of docs in Lucene

I would like to compute the scores for a small number of documents (rather than for the entire collection) for a given query. My attempt, as follows, returns 0 scores for each document, even though the queries I test with were derived from the terms in the documents I am trying to score. I am using Lucene 3.0.3.
List<Float> score(IndexReader reader, Query query, List<Integer> newDocs ) {
List<Float> scores = new List<Float>();
IndexSearcher searcher = reader.getSearcher();
Collector collector = TopScoreDocCollector.create(newDocs.size(), true);
Weight weight = query.createWeight(searcher);
Scorer scorer = weight.scorer(reader, true, true);
collector.setScorer(scorer);
float score = 0.0f;
for(Integer d: newDocs) {
scorer.advance(d);
collector.collect(d);
score = scorer.score();
System.out.println( "doc: " + d + "; score=" + score);
scores.add( new Float(score) );
}
return scores;
}
I am obviously missing something in the setup of scoring, but I cannot figure out from the Lucene source code what that might be.
Thanks in advance,
Gene
Use a filter, and do a search with that filter. Then just iterate through the results as you would with a normal search - Lucene will handle the filtering.
In general, if you are looking at DocIds, you're probably using a lower-level API than you need to, and it's going to give you trouble.