TermFreqVector lucene .net - lucene

I can get docs by category like this:
IndexSearcher searcher = new IndexSearcher(dir);
Term t = new Term("category", "Feline");
Query query = new TermQuery(t);
Hits hits = searcher.Search(query);
for (int c = 0; c < hits.Length(); c++)
{
Document d = hits.Doc(c);
Console.WriteLine(c + " " + d.GetField("category").StringValue());
}
Now I would like to obtain the TermFreqVector for the docs in hits. I would usually do this like so:
for (int c = 0; c < searcher.MaxDoc(); c++)
{
TermFreqVector TermFreqVector = IndexReader.GetTermFreqVector(c, "content");
String[] terms = TermFreqVector.GetTerms();//get the terms
int[] freqs = TermFreqVector.GetTermFrequencies();//
}
However, I am not sure how to do it in my scenario (i.e. just get them for the docs in hits). The docs also have a db pk.
Thanks.
Christian

The first parameter to IndexReader.GetTermFreqVector ("c" in your example) is the document number. hits.id(c) will return the ID of the cth result. So you'd do something like:
int Id = hits.id(c);
TermFreqVector TermFreqVector = IndexReader.GetTermFreqVector(Id, "content");
// etc.
(As a side note: the Hits class is deprecated; you probably want to use something like HitCollector or a different search overload instead.)

Related

What is the right way to get term positions in a Lucene document?

The example in this question and some others I've seen on the web use postings method of a TermVector to get terms positions. Copy paste from the example in the linked question:
IndexReader ir = obtainIndexReader();
Terms tv = ir.getTermVector( doc, field );
TermsEnum terms = tv.iterator();
PostingsEnum p = null;
while( terms.next() != null ) {
p = terms.postings( p, PostingsEnum.ALL );
while( p.nextDoc() != PostingsEnum.NO_MORE_DOCS ) {
int freq = p.freq();
for( int i = 0; i < freq; i++ ) {
int pos = p.nextPosition(); // Always returns -1!!!
BytesRef data = p.getPayload();
doStuff( freq, pos, data ); // Fails miserably, of course.
}
}
}
This code works for me but what drives me mad is that the Terms type is where the position information is kept. All the documentation I've seen keep saying that term vectors keep position data. However, there are no methods on this type to get that information!
Older versions of Lucene apparently had a method but as of at least version 6.5.1 of Lucene, that is not the case.
Instead I'm supposed to use postings method and traverse the documents but I already know which document I want to work on!
The API documentation does not say anything about postings returning only the current document (the one the term vector belongs to) but when I run it, I only get the current doc.
Is this the correct and only way to get position data from term vectors? Why such an unintuitive API? Is there a document that explains why the previous approach changed in favour of this?
Don't know about "right or wrong" but for version 6.6.3 this seems to work.
private void run() throws Exception {
Directory directory = new RAMDirectory();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
IndexWriter writer = new IndexWriter(directory, indexWriterConfig);
Document doc = new Document();
// Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES
FieldType type = new FieldType();
type.setStoreTermVectors(true);
type.setStoreTermVectorPositions(true);
type.setStoreTermVectorOffsets(true);
type.setIndexOptions(IndexOptions.DOCS);
Field fieldStore = new Field("tags", "foo bar and then some", type);
doc.add(fieldStore);
writer.addDocument(doc);
writer.close();
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Term t = new Term("tags", "bar");
Query q = new TermQuery(t);
TopDocs results = searcher.search(q, 1);
for ( ScoreDoc scoreDoc: results.scoreDocs ) {
Fields termVs = reader.getTermVectors(scoreDoc.doc);
Terms f = termVs.terms("tags");
TermsEnum te = f.iterator();
PostingsEnum docsAndPosEnum = null;
BytesRef bytesRef;
while ( (bytesRef = te.next()) != null ) {
docsAndPosEnum = te.postings(docsAndPosEnum, PostingsEnum.ALL);
// for each term (iterator next) in this field (field)
// iterate over the docs (should only be one)
int nextDoc = docsAndPosEnum.nextDoc();
assert nextDoc != DocIdSetIterator.NO_MORE_DOCS;
final int fr = docsAndPosEnum.freq();
final int p = docsAndPosEnum.nextPosition();
final int o = docsAndPosEnum.startOffset();
System.out.println("p="+ p + ", o=" + o + ", l=" + bytesRef.length + ", f=" + fr + ", s=" + bytesRef.utf8ToString());
}
}
}

Lucene - Iterating through TermsEnum for docfreq

I am trying to get the doc frequency for each term in term enum. But I getting everytime only a "1" for the document frequency for all terms. Any hint, what the problem could be? This is my code:
Terms terms = reader.getTermVector(docId, field);
TermsEnum termsEnum = null;
termsEnum = terms.iterator(termsEnum);
BytesRef termText = null;
while((termsEnum.next()) != null){
int docNumbersWithTerm = termsEnum.docfreq();
System.out.println(docNumbersWithTerm);
}
The Terms instance from IndexReader.getTermVector acts as if you have a single-document index, comprised entirely of the document specified. Since there is only one document to consider in this context, you should always get docfreq() = 1. You could generate the docfreq from the full index using the IndexReader.docFreq method:
int docNumbersWithTerm = reader.docFreq(new Term(termsEnum.term(), field));
System.out.println(docNumbersWithTerm);

How to get Document ids for Document Term Vector in Lucene

I am new to Lucene world, and don't have much working knowledge of the subject. I need to extract document term vector and I found the following code online How to extract Document Term Vector in Lucene 3.5.0.
/**
* Sums the term frequency vector of each document into a single term frequency map
* #param indexReader the index reader, the document numbers are specific to this reader
* #param docNumbers document numbers to retrieve frequency vectors from
* #param fieldNames field names to retrieve frequency vectors from
* #param stopWords terms to ignore
* #return a map of each term to its frequency
* #throws IOException
*/
private Map<String,Integer> getTermFrequencyMap(IndexReader indexReader, List<Integer> docNumbers, String[] fieldNames, Set<String> stopWords)
throws IOException {
Map<String,Integer> totalTfv = new HashMap<String,Integer>(1024);
for (Integer docNum : docNumbers) {
for (String fieldName : fieldNames) {
TermFreqVector tfv = indexReader.getTermFreqVector(docNum, fieldName);
if (tfv == null) {
// ignore empty fields
continue;
}
String terms[] = tfv.getTerms();
int termCount = terms.length;
int freqs[] = tfv.getTermFrequencies();
for (int t=0; t < termCount; t++) {
String term = terms[t];
int freq = freqs[t];
// filter out single-letter words and stop words
if (StringUtils.length(term) < 2 ||
stopWords.contains(term)) {
continue; // stop
}
Integer totalFreq = totalTfv.get(term);
totalFreq = (totalFreq == null) ? freq : freq + totalFreq;
totalTfv.put(term, totalFreq);
}
}
}
return totalTfv;
}
I have created the index which resides in the following directory.
String indexDir = "C:\\Lucene\\Output\\";
Directory dir = FSDirectory.open(new File(indexDir));
IndexReader reader = IndexReader.open(dir);
My problem is that I do not know how to get the doc ids (List docNumbers) which is required for the above mentioned function. I have tried a couple of methods like
TermDocs docs = reader.termDocs();
but it did not work.
Lucene starts assigning ids from zero, and maxDoc() is the upper limit, so you can simply loop to get all ids, skipping deleted documents (Lucene marks them for deletion when you call deleteDocument):
for (int docNum=0; docNum < reader.maxDoc(); docNum++) {
if (reader.isDeleted(docNum)) {
continue;
}
TermFreqVector tfv = reader.getTermFreqVector(docNum, "fieldName");
...
}
For this to work, you have to enable them during indexing, see Field.TermVector.

loading objects from a list of composite-ids in nhibernate

what i want to do is to build an HQL Query which accepts a list of ids and returns a list of loaded objets. After a while, i found that something like this could work
from Foo foo where foo.ID in (:IdList)
However, this only works for single ids beacuse when i try to use it for composite ids the app throws the next exception:
System.ArgumentOutOfRangeException : Index was out of range. Must be non-negative and less than the size of the collection. Parameter name: index
I'm clueless...
I created a custom type for my id object hoping i could explain hibernate how to use it but it didnt work out.
So do you have any ideas?
thanks
i cant think of a sqlquery which can do this (In cant take pairs as input as far as i know)
would this suffice (on the top of my head, cant test it right now)?
var query = "from Foo foo where ";
for (int i = 0; i < idlist.Count; i++)
{
query += "OR foo.ID = :p" + i;
}
var hqlquery = session.CreateQuery(query);
for (int i = 0; i < idlist.Count; i++)
{
hqlquery.SetParameter("p" + i, idlist[0]);
}

how to achieve pagination in lucene?

Wondering how to achieve pagination in Lucene, as it does not inherently support pagination. I basically need to search for 'top 10 entries' (based on some parameter) then 'next 10 entries' and so on. And at the same time I don't want Lucene to hog memory.
Any piece of advice would be appreciated.
Thanks in advance.
You will need to apply your own paging mechanism, something similar to that below.
IList<Document> luceneDocuments = new List<Document>();
IndexReader indexReader = new IndexReader(directory);
Searcher searcher = new IndexSearcher(indexReader);
TopDocs results = searcher.Search("Your Query", null, skipRecords + takeRecords);
ScoreDoc[] scoreDocs = results.scoreDocs;
for (int i = skipRecords; i < results.totalHits; i++)
{
if (i > (skipRecords + takeRecords) - 1)
{
break;
}
luceneDocuments.Add(searcher.Doc(scoreDocs[i].doc));
}
You will find that iterating the scoreDocs array will be lightweight as the data contained within the index is not really used until the searcher.Doc method is called.
Please note that this example was written against a slightly modified version of Lucene.NET 2.3.2, but the basic principal should work against any recent version of Lucene.
Another version of loop, continuing with Kane's code snippet;
....................
ScoreDoc[] scoreDocs = results.scoreDocs;
int pageIndex = [User Value];
int pageSize = [Configured Value];
int startIndex = (pageIndex - 1) * pageSize;
int endIndex = pageIndex * pageSize;
endIndex = results.totalHits < endIndex? results.totalHits:endIndex;
for (int i = startIndex ; i < endIndex ; i++)
{
luceneDocuments.Add(searcher.Doc(scoreDocs[i].doc));
}
I use following way to paginate, may be it help someone. If you know a better strategy, specifically from performance view point, please share.
public TopDocs search(String query, int pageNumber) throws IOException, ParseException {
Query searchQuery = parser.parse(query);
TopScoreDocCollector collector = TopScoreDocCollector.create(1000, true);
int startIndex = (pageNumber - 1) * MyApp.SEARCH_RESULT_PAGE_SIZE;
searcher.search(searchQuery, collector);
TopDocs topDocs = collector.topDocs(startIndex, MyApp.SEARCH_RESULT_PAGE_SIZE);
return topDocs;
}