how to achieve pagination in lucene? - lucene

Wondering how to achieve pagination in Lucene, as it does not inherently support pagination. I basically need to search for 'top 10 entries' (based on some parameter) then 'next 10 entries' and so on. And at the same time I don't want Lucene to hog memory.
Any piece of advice would be appreciated.
Thanks in advance.

You will need to apply your own paging mechanism, something similar to that below.
IList<Document> luceneDocuments = new List<Document>();
IndexReader indexReader = new IndexReader(directory);
Searcher searcher = new IndexSearcher(indexReader);
TopDocs results = searcher.Search("Your Query", null, skipRecords + takeRecords);
ScoreDoc[] scoreDocs = results.scoreDocs;
for (int i = skipRecords; i < results.totalHits; i++)
{
if (i > (skipRecords + takeRecords) - 1)
{
break;
}
luceneDocuments.Add(searcher.Doc(scoreDocs[i].doc));
}
You will find that iterating the scoreDocs array will be lightweight as the data contained within the index is not really used until the searcher.Doc method is called.
Please note that this example was written against a slightly modified version of Lucene.NET 2.3.2, but the basic principal should work against any recent version of Lucene.

Another version of loop, continuing with Kane's code snippet;
....................
ScoreDoc[] scoreDocs = results.scoreDocs;
int pageIndex = [User Value];
int pageSize = [Configured Value];
int startIndex = (pageIndex - 1) * pageSize;
int endIndex = pageIndex * pageSize;
endIndex = results.totalHits < endIndex? results.totalHits:endIndex;
for (int i = startIndex ; i < endIndex ; i++)
{
luceneDocuments.Add(searcher.Doc(scoreDocs[i].doc));
}

I use following way to paginate, may be it help someone. If you know a better strategy, specifically from performance view point, please share.
public TopDocs search(String query, int pageNumber) throws IOException, ParseException {
Query searchQuery = parser.parse(query);
TopScoreDocCollector collector = TopScoreDocCollector.create(1000, true);
int startIndex = (pageNumber - 1) * MyApp.SEARCH_RESULT_PAGE_SIZE;
searcher.search(searchQuery, collector);
TopDocs topDocs = collector.topDocs(startIndex, MyApp.SEARCH_RESULT_PAGE_SIZE);
return topDocs;
}

Related

What is the right way to get term positions in a Lucene document?

The example in this question and some others I've seen on the web use postings method of a TermVector to get terms positions. Copy paste from the example in the linked question:
IndexReader ir = obtainIndexReader();
Terms tv = ir.getTermVector( doc, field );
TermsEnum terms = tv.iterator();
PostingsEnum p = null;
while( terms.next() != null ) {
p = terms.postings( p, PostingsEnum.ALL );
while( p.nextDoc() != PostingsEnum.NO_MORE_DOCS ) {
int freq = p.freq();
for( int i = 0; i < freq; i++ ) {
int pos = p.nextPosition(); // Always returns -1!!!
BytesRef data = p.getPayload();
doStuff( freq, pos, data ); // Fails miserably, of course.
}
}
}
This code works for me but what drives me mad is that the Terms type is where the position information is kept. All the documentation I've seen keep saying that term vectors keep position data. However, there are no methods on this type to get that information!
Older versions of Lucene apparently had a method but as of at least version 6.5.1 of Lucene, that is not the case.
Instead I'm supposed to use postings method and traverse the documents but I already know which document I want to work on!
The API documentation does not say anything about postings returning only the current document (the one the term vector belongs to) but when I run it, I only get the current doc.
Is this the correct and only way to get position data from term vectors? Why such an unintuitive API? Is there a document that explains why the previous approach changed in favour of this?
Don't know about "right or wrong" but for version 6.6.3 this seems to work.
private void run() throws Exception {
Directory directory = new RAMDirectory();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
IndexWriter writer = new IndexWriter(directory, indexWriterConfig);
Document doc = new Document();
// Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES
FieldType type = new FieldType();
type.setStoreTermVectors(true);
type.setStoreTermVectorPositions(true);
type.setStoreTermVectorOffsets(true);
type.setIndexOptions(IndexOptions.DOCS);
Field fieldStore = new Field("tags", "foo bar and then some", type);
doc.add(fieldStore);
writer.addDocument(doc);
writer.close();
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Term t = new Term("tags", "bar");
Query q = new TermQuery(t);
TopDocs results = searcher.search(q, 1);
for ( ScoreDoc scoreDoc: results.scoreDocs ) {
Fields termVs = reader.getTermVectors(scoreDoc.doc);
Terms f = termVs.terms("tags");
TermsEnum te = f.iterator();
PostingsEnum docsAndPosEnum = null;
BytesRef bytesRef;
while ( (bytesRef = te.next()) != null ) {
docsAndPosEnum = te.postings(docsAndPosEnum, PostingsEnum.ALL);
// for each term (iterator next) in this field (field)
// iterate over the docs (should only be one)
int nextDoc = docsAndPosEnum.nextDoc();
assert nextDoc != DocIdSetIterator.NO_MORE_DOCS;
final int fr = docsAndPosEnum.freq();
final int p = docsAndPosEnum.nextPosition();
final int o = docsAndPosEnum.startOffset();
System.out.println("p="+ p + ", o=" + o + ", l=" + bytesRef.length + ", f=" + fr + ", s=" + bytesRef.utf8ToString());
}
}
}

MoreLikeThis not returning 100% score rate in Lucene.Net when comparing the same document with each other

I don't know if I am calling Lucene.net correctly. I'm trying to call the MoreLikeThis function to compare a document to itself and I'm only getting a score of 0.3174651 where I think I should be getting a score of 1.0. Am I expecting the wrong expect?
This is my code:
int docId = hits[i].Doc;
var query2 = mlt.Like(docId);
TopScoreDocCollector collector = TopScoreDocCollector.Create(100, true);
searcher.Search(query2, collector);
ScoreDoc[] hits2 = collector.TopDocs().ScoreDocs;
var result = new List<string>();
for (int k = 0; k < hits2.Length; k++)
{
docId = hits2[k].Doc;
float score = hits2[k].Score;
}
Am I doing something wrong please?
The only thing you are doing wrong is thinking that Lucene scores are percentages. They aren't.
Document scores for a query are to be used to compare the strength of matches within the context of that single query. They are quite effective at sorting results, but they are not percentages, and are not generally suitable for display to a user.

TermFreqVector lucene .net

I can get docs by category like this:
IndexSearcher searcher = new IndexSearcher(dir);
Term t = new Term("category", "Feline");
Query query = new TermQuery(t);
Hits hits = searcher.Search(query);
for (int c = 0; c < hits.Length(); c++)
{
Document d = hits.Doc(c);
Console.WriteLine(c + " " + d.GetField("category").StringValue());
}
Now I would like to obtain the TermFreqVector for the docs in hits. I would usually do this like so:
for (int c = 0; c < searcher.MaxDoc(); c++)
{
TermFreqVector TermFreqVector = IndexReader.GetTermFreqVector(c, "content");
String[] terms = TermFreqVector.GetTerms();//get the terms
int[] freqs = TermFreqVector.GetTermFrequencies();//
}
However, I am not sure how to do it in my scenario (i.e. just get them for the docs in hits). The docs also have a db pk.
Thanks.
Christian
The first parameter to IndexReader.GetTermFreqVector ("c" in your example) is the document number. hits.id(c) will return the ID of the cth result. So you'd do something like:
int Id = hits.id(c);
TermFreqVector TermFreqVector = IndexReader.GetTermFreqVector(Id, "content");
// etc.
(As a side note: the Hits class is deprecated; you probably want to use something like HitCollector or a different search overload instead.)

Lucene DuplicateFilter question

Why DuplicateFilter doesn't work together with other filters? For example, if a little remake of the test DuplicateFilterTest, then the impression that the filter is not applied to other filters and first trims results:
public void testKeepsLastFilter()
throws Throwable {
DuplicateFilter df = new DuplicateFilter(KEY_FIELD);
df.setKeepMode(DuplicateFilter.KM_USE_LAST_OCCURRENCE);
Query q = new ConstantScoreQuery(new ChainedFilter(new Filter[]{
new QueryWrapperFilter(tq),
// new QueryWrapperFilter(new TermQuery(new Term("text", "out"))), // works right, it is the last document.
new QueryWrapperFilter(new TermQuery(new Term("text", "now"))) // why it doesn't work? It is the third document, but hits count is 0.
}, ChainedFilter.AND));
// this varians doesn't hit too:
// ScoreDoc[] hits = searcher.search(new FilteredQuery(tq, df), new QueryWrapperFilter(new TermQuery(new Term("text", "now"))), 1000).scoreDocs;
// ScoreDoc[] hits = searcher.search(new FilteredQuery(tq, new QueryWrapperFilter(new TermQuery(new Term("text", "now")))), df, 1000).scoreDocs;
ScoreDoc[] hits = searcher.search(q, df, 1000).scoreDocs;
assertTrue("Filtered searching should have found some matches", hits.length > 0);
for (int i = 0; i < hits.length; i++) {
Document d = searcher.doc(hits[i].doc);
String url = d.get(KEY_FIELD);
TermDocs td = reader.termDocs(new Term(KEY_FIELD, url));
int lastDoc = 0;
while (td.next()) {
lastDoc = td.doc();
}
assertEquals("Duplicate urls should return last doc", lastDoc, hits[i].doc);
}
}
DuplicateFilter independently constructs a filter which chooses either the first or last occurence of all documents containing each key. This can be cached with minimal memory overhead.
Your second filter independently selects some other documents. The two choices may not coincide. To filter duplicates according to some arbitrary subset of all docs would probably need to use a field cache to be performant and this is where things get expensive RAM-wise

Is it possible to iterate through documents stored in Lucene Index?

I have some documents stored in a Lucene index with a docId field.
I want to get all docIds stored in the index. There is also a problem. Number of documents is about 300 000 so I would prefer to get this docIds in chunks of size 500. Is it possible to do so?
IndexReader reader = // create IndexReader
for (int i=0; i<reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
String docId = doc.get("docId");
// do something with docId here...
}
Lucene 4
Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i=0; i<reader.maxDoc(); i++) {
if (liveDocs != null && !liveDocs.get(i))
continue;
Document doc = reader.document(i);
}
See LUCENE-2600 on this page for details: https://lucene.apache.org/core/4_0_0/MIGRATE.html
There is a query class named MatchAllDocsQuery, I think it can be used in this case:
Query query = new MatchAllDocsQuery();
TopDocs topDocs = getIndexSearcher.search(query, RESULT_LIMIT);
Document numbers (or ids) will be subsequent numbers from 0 to IndexReader.maxDoc()-1. These numbers are not persistent and are valid only for opened IndexReader. You could check if the document is deleted with IndexReader.isDeleted(int documentNumber) method
If you use .document(i) as in above examples and skip over deleted documents be careful if you use this method for paginating results.
i.e.: You have a 10 docs/per page list and you need to get the docs. for page 6. Your input might be something like this: offset=60,count = 10 (documents from 60 to 70).
IndexReader reader = // create IndexReader
for (int i=offset; i<offset + 10; i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
String docId = doc.get("docId");
}
You will have some problems with the deleted ones because you should not start from offset=60, but from offset=60 + the number of deleted documents that appear before 60.
An alternative I found is something like this:
is = getIndexSearcher(); //new IndexSearcher(indexReader)
//get all results without any conditions attached.
Term term = new Term([[any mandatory field name]], "*");
Query query = new WildcardQuery(term);
topCollector = TopScoreDocCollector.create([[int max hits to get]], true);
is.search(query, topCollector);
TopDocs topDocs = topCollector.topDocs(offset, count);
note: replace text between [[ ]] with own values.
Ran this on large index with 1.5million entries and got random 10 results in less than a second.
Agree is slower but at least you can ignore deleted documents if you need pagination.