Find list of terms indexed by Lucene - lucene

Is it possible to extract the list of all the terms in a Lucene index as a list of strings? I couldn't find that functionality in the doc. Thanks!

In Lucene 4 (and 5):
Terms terms = SlowCompositeReaderWrapper.wrap(directoryReader).terms("field");
Edit:
This seems to be the 'correct' way now (Lucene 6 and up):
LuceneDictionary ld = new LuceneDictionary( indexReader, "field" );
BytesRefIterator iterator = ld.getWordsIterator();
BytesRef byteRef = null;
while ( ( byteRef = iterator.next() ) != null )
{
String term = byteRef.utf8ToString();
}

Lucene 3:
C#: C# Lucene get all the index
Java:
IndexReader indexReader = IndexReader.open(path);
TermEnum termEnum = indexReader.terms();
while (termEnum.next()) {
Term term = termEnum.term();
System.out.println(term.text());
}
termEnum.close();
indexReader.close();
Java (all terms for a specific field): How can I get the list of unique terms from a specific field in Lucene?
Python: Finding a single fields terms with Lucene (PyLucene)

Related

What is an alternative for Lucene Query's extractTerms?

In Lucene 4.6.0 there was the method extractTerms that provided the extraction of terms from a query (Query 4.6.0). However, from Lucene 6.2.1, it does no longer exist (Query Lucene 6.2.1). Is there a valid alternative for it?
What I'd need is to parse terms (and corrispondent fields) of a Query built by QueryParser.
Maybe not the best answer but one way is to use the same analyzer and tokenize the query string:
Analyzer anal = new StandardAnalyzer();
TokenStream ts = anal.tokenStream("title", query); // string query
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
ts.reset();
while (ts.incrementToken()) {
System.out.println(termAtt.toString());
}
anal.close();
I have temporarely solved my problem with the following code. Smarter alternatives will be well accepted:
QueryParser qp = new QueryParser("title", a);
Query q = qp.parse(query);
Set<Term> termQuerySet = new HashSet<Term>();
Weight w = searcher.createWeight(q, true, 3.4f);
w.extractTerms(termQuerySet);

Lucene - Iterating through TermsEnum for docfreq

I am trying to get the doc frequency for each term in term enum. But I getting everytime only a "1" for the document frequency for all terms. Any hint, what the problem could be? This is my code:
Terms terms = reader.getTermVector(docId, field);
TermsEnum termsEnum = null;
termsEnum = terms.iterator(termsEnum);
BytesRef termText = null;
while((termsEnum.next()) != null){
int docNumbersWithTerm = termsEnum.docfreq();
System.out.println(docNumbersWithTerm);
}
The Terms instance from IndexReader.getTermVector acts as if you have a single-document index, comprised entirely of the document specified. Since there is only one document to consider in this context, you should always get docfreq() = 1. You could generate the docfreq from the full index using the IndexReader.docFreq method:
int docNumbersWithTerm = reader.docFreq(new Term(termsEnum.term(), field));
System.out.println(docNumbersWithTerm);

Lucene 4.4. How to get term frequency over all index?

I'm trying to compute tf-idf value of each term in a document. So, I iterate through the terms in a document and want to find the frequency of the term in the whole corpus and the number of documents in which the term appears. Following is my code:
//#param index path to index directory
//#param docNbr the document number in the index
public void readingIndex(String index, int docNbr) {
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
Document doc = reader.document(docNbr);
System.out.println("Processing file: "+doc.get("id"));
Terms termVector = reader.getTermVector(docNbr, "contents");
TermsEnum itr = termVector.iterator(null);
BytesRef term = null;
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
long termFreq = itr.totalTermFreq(); //FIXME: this only return frequency in this doc
long docCount = itr.docFreq(); //FIXME: docCount = 1 in all cases
System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);
}
reader.close();
}
Although the documentation says totalTermFreq() returns the total number of occurrences of this term across all documents, when testing I found it only returns the frequency of the term in the document given by docNbr. and docFreq() always return 1.
How can I get frequency of a term across the whole index?
Update
Of course, I can create a map to map a term to its frequency. Then iterate through each document to count the total number of time a term occur. However, I thought Lucene should have a built in method for that purpose.
Thank you,
IndexReader.TotalTermFreq(Term) will provide this for you. Your calls to the similar methods on the TermsEnum are indeed providing the stats for all documents, in the enumeration. Using the reader should get you the stats for all the documents in the index itself. Something like:
String termText = term.utf8ToString();
Term termInstance = new Term("contents", term);
long termFreq = reader.totalTermFreq(termInstance);
long docCount = reader.docFreq(termInstance);
System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);

Is there any way to extract all the tokens from solr?

How can one extract all the tokens from solr? Not from one document, but from all the documents indexed in solr?
Thanks!
You may do something like this(This sample is approved to be working on a lucene 4.x index):
IndexSearcher isearcher = new IndexSearcher(dir, true);
IndexReader reader = isearcher.getIndexReader();
Fields fields = MultiFields.getFields(reader);
Collection<String> cols = reader.getFieldNames(IndexReader.FieldOption.ALL);
for (String col : cols) {
Terms te = fields.terms(col);
if (te != null) {
TermsEnum tex = te.getThreadTermsEnum();
while (tex.next() != null)
// do something
tex.getTerm().text();
}
}
This iterates over all columns and also over every term per col. You may lookup the methods provided by TermsEnum like getTerm().

How to score a small set of docs in Lucene

I would like to compute the scores for a small number of documents (rather than for the entire collection) for a given query. My attempt, as follows, returns 0 scores for each document, even though the queries I test with were derived from the terms in the documents I am trying to score. I am using Lucene 3.0.3.
List<Float> score(IndexReader reader, Query query, List<Integer> newDocs ) {
List<Float> scores = new List<Float>();
IndexSearcher searcher = reader.getSearcher();
Collector collector = TopScoreDocCollector.create(newDocs.size(), true);
Weight weight = query.createWeight(searcher);
Scorer scorer = weight.scorer(reader, true, true);
collector.setScorer(scorer);
float score = 0.0f;
for(Integer d: newDocs) {
scorer.advance(d);
collector.collect(d);
score = scorer.score();
System.out.println( "doc: " + d + "; score=" + score);
scores.add( new Float(score) );
}
return scores;
}
I am obviously missing something in the setup of scoring, but I cannot figure out from the Lucene source code what that might be.
Thanks in advance,
Gene
Use a filter, and do a search with that filter. Then just iterate through the results as you would with a normal search - Lucene will handle the filtering.
In general, if you are looking at DocIds, you're probably using a lower-level API than you need to, and it's going to give you trouble.