Is it possible to iterate through documents stored in Lucene Index? - lucene

I have some documents stored in a Lucene index with a docId field.
I want to get all docIds stored in the index. There is also a problem. Number of documents is about 300 000 so I would prefer to get this docIds in chunks of size 500. Is it possible to do so?

IndexReader reader = // create IndexReader
for (int i=0; i<reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
String docId = doc.get("docId");
// do something with docId here...
}

Lucene 4
Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i=0; i<reader.maxDoc(); i++) {
if (liveDocs != null && !liveDocs.get(i))
continue;
Document doc = reader.document(i);
}
See LUCENE-2600 on this page for details: https://lucene.apache.org/core/4_0_0/MIGRATE.html

There is a query class named MatchAllDocsQuery, I think it can be used in this case:
Query query = new MatchAllDocsQuery();
TopDocs topDocs = getIndexSearcher.search(query, RESULT_LIMIT);

Document numbers (or ids) will be subsequent numbers from 0 to IndexReader.maxDoc()-1. These numbers are not persistent and are valid only for opened IndexReader. You could check if the document is deleted with IndexReader.isDeleted(int documentNumber) method

If you use .document(i) as in above examples and skip over deleted documents be careful if you use this method for paginating results.
i.e.: You have a 10 docs/per page list and you need to get the docs. for page 6. Your input might be something like this: offset=60,count = 10 (documents from 60 to 70).
IndexReader reader = // create IndexReader
for (int i=offset; i<offset + 10; i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
String docId = doc.get("docId");
}
You will have some problems with the deleted ones because you should not start from offset=60, but from offset=60 + the number of deleted documents that appear before 60.
An alternative I found is something like this:
is = getIndexSearcher(); //new IndexSearcher(indexReader)
//get all results without any conditions attached.
Term term = new Term([[any mandatory field name]], "*");
Query query = new WildcardQuery(term);
topCollector = TopScoreDocCollector.create([[int max hits to get]], true);
is.search(query, topCollector);
TopDocs topDocs = topCollector.topDocs(offset, count);
note: replace text between [[ ]] with own values.
Ran this on large index with 1.5million entries and got random 10 results in less than a second.
Agree is slower but at least you can ignore deleted documents if you need pagination.

Related

lucene, efficient way to get offsets of a set of terms in documents

Suppose I have indexed a set of documents and now I am given a set of terms that I known are generated by the indexing process. I would like to get the occurrence of each of these terms, i.e., which document, what offsets. I have done this using, for each term, one postingnums that let me to iterate through the set of documents a term appear in; then within each document, one postingenums to get document vector that contains the offset information of that term in that document.
But this is not very efficient as there is loop inside loop and can go quite slow. The code is below. Any suggestions if this can be done in a better way?
The field schema:
<field name="terms" type="token_ngram" indexed="true" stored="false" multiValued="false" termVectors="true" termPositions="true" termOffsets="true"/>
Code:
IndexReader indexReader = ...//init an index reader
Set<String> termSet = .... //set containing e.g., 10000 terms.
for(String term: termSet){
//get a postingenum used to iterate through docs containing the term
//this "postings" does not have valid offset information (see comment below)
PostingsEnum postings =
MultiFields.getTermDocsEnum(indexReader, "terms", new BytesRef(term.getBytes()));
/*I also tried:
*PostingsEnum postings =
* MultiFields.getTermDocsEnum(indexReader, "terms", new BytesRef(term.getBytes()), PostingsEnum.OFFSETS);
* But the resulting "postings" object also does not contain valid offset info (always -1)
*/
//now go through each document
int docId = postings.nextDoc();
while (docId != PostingsEnum.NO_MORE_DOCS) {
//get the term vector for that document.
TermsEnum it = indexReader.getTermVector(docId, ngramInfoFieldname).iterator();
//find the term of interest
it.seekExact(new BytesRef(term.getBytes()));
//get its posting info. this will contain offset info
PostingsEnum postingsInDoc = it.postings(null, PostingsEnum.OFFSETS);
//From below, Line A to Line B if I replace "postingsInDoc" with "postings", method "posting.startOffset()" and "endoffset()" always returns -1;
postingsInDoc.nextDoc(); //line A
int totalFreq = postingsInDoc.freq();
for (int i = 0; i < totalFreq; i++) {
postingsInDoc.nextPosition();
System.out.println(postingsInDoc.startOffset(), postingsInDoc.endOffset());
} //Line B
docId=postings.nextDoc();
}
}

Neo4j\Lucene multiterm wildcard at the end of query

I'm trying to create auto suggestion based on Lucene full text index.
The main issue is how to create autosuggestion(autocomplete) based on multiterm phrases, for example -
nosql dat*
results can be
nosql database
nosql data
but not
perfect nosql database
What is the correct syntax for Lucene query in order to create auto suggestion based on the first words in a multi term query with a wildcard at the end ?
I had a similar requirement,
Lucene has Span queries that allow you to use location of words in the text in queries.
I've implemented it in Lucene using FirstSpanQuery. (read about it in the docs)
here I use SpanNearQuery to force all the words to be next to each other and
SpanFirstQuery to force all of them to be in the start of the text.
if (querystr.contains(" ")) // more than one word?
{
String[] words = querystr.split(" ");
SpanQuery[] clausesWildCard = new SpanQuery[words.length];
for (int i = 0; i < words.length; i++) {
if (i == words.length - 1) //last word, add wildcard clause
{
PrefixQuery pq = new PrefixQuery(new Term(VALUE, words[i]));
clausesWildCard[i] = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
}
else
{
Term clause = new Term(VALUE, words[i]);
clausesWildCard[i] = new SpanTermQuery(clause);
}
}
SpanQuery allTheWordsNear = new SpanNearQuery(clausesWildCard, 0, true);
prefixquery = new SpanFirstQuery(allTheWordsNear, words.length);
}

High performance unique document id retrieval

Currently I am working on high-performance NRT system using Lucene 4.9.0 on Java platform which detects near-duplicate text documents.
For this purpose I query Lucene to return some set of matching candidates and do near-duplicate calculation locally (by retrieving and caching term vectors). But my main concern is performance issue of binding Lucene's docId (which can change) to my own unique and immutable document id stored within index.
My flow is as follows:
query for documents in Lucene
for each document:
fetch my unique document id based on Lucene docId
get term vector from cache for my document id (if it doesn't exists - fetch it from Lucene and populate the cache)
do maths...
My major bottleneck is "fetch my unique document id" step which introduces huge performance degradation (especially that sometimes I have to do calculation for, let's say, 40000 term vectors in single loop).
try {
Document document = indexReader.document(id);
return document.getField(ID_FIELD_NAME).numericValue().intValue();
} catch (IOException e) {
throw new IndexException(e);
}
Possible solutions I was considering was:
try of using Zoie which handles unique and persistent doc identifiers,
use of FieldCache (still very inefficient),
use of Payloads (according to http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html) - but I do not have any idea how to apply it.
Any other suggestions?
I have figured out how to solve the issue partially using benefits of Lucene's AtomicReader. For this purpose I use global cache in order to keep already instantiated segments' FieldCache.
Map<Object, FieldCache.Ints> fieldCacheMap = new HashMap<Object, FieldCache.Ints>();
In my method I use the following piece of code:
Query query = new TermQuery(new Term(FIELD_NAME, fieldValue));
IndexReader indexReader = DirectoryReader.open(indexWriter, true);
List<AtomicReaderContext> leaves = indexReader.getContext().leaves();
// process each segment separately
for (AtomicReaderContext leave : leaves) {
AtomicReader reader = leave.reader();
FieldCache.Ints fieldCache;
Object fieldCacheKey = reader.getCoreCacheKey();
synchronized (fieldCacheMap) {
fieldCache = fieldCacheMap.get(fieldCacheKey);
if (fieldCache == null) {
fieldCache = FieldCache.DEFAULT.getInts(reader, ID_FIELD_NAME, true);
fieldCacheMap.put(fieldCacheKey, fieldCache);
}
usedReaderSet.add(fieldCacheKey);
}
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (int i = 0; i < scoreDocs.length; i++) {
int docID = scoreDocs[i].doc;
int offerId = fieldCache.get(docID);
// do your processing here
}
}
// remove unused entries in cache set
synchronized(fieldCacheMap) {
Set<Object> inCacheSet = fieldCacheMap.keySet();
Set<Object> toRemove = new HashSet();
for(Object inCache : inCacheSet) {
if(!usedReaderSet.contains(inCache)) {
toRemove.add(inCache);
}
}
for(Object subject : toRemove) {
fieldCacheMap.remove(subject);
}
}
indexReader.close();
It works pretty fast. My main concern is memory usage which can be really high when using large index.

MoreLikeThis not returning 100% score rate in Lucene.Net when comparing the same document with each other

I don't know if I am calling Lucene.net correctly. I'm trying to call the MoreLikeThis function to compare a document to itself and I'm only getting a score of 0.3174651 where I think I should be getting a score of 1.0. Am I expecting the wrong expect?
This is my code:
int docId = hits[i].Doc;
var query2 = mlt.Like(docId);
TopScoreDocCollector collector = TopScoreDocCollector.Create(100, true);
searcher.Search(query2, collector);
ScoreDoc[] hits2 = collector.TopDocs().ScoreDocs;
var result = new List<string>();
for (int k = 0; k < hits2.Length; k++)
{
docId = hits2[k].Doc;
float score = hits2[k].Score;
}
Am I doing something wrong please?
The only thing you are doing wrong is thinking that Lucene scores are percentages. They aren't.
Document scores for a query are to be used to compare the strength of matches within the context of that single query. They are quite effective at sorting results, but they are not percentages, and are not generally suitable for display to a user.

Is there any way to extract all the tokens from solr?

How can one extract all the tokens from solr? Not from one document, but from all the documents indexed in solr?
Thanks!
You may do something like this(This sample is approved to be working on a lucene 4.x index):
IndexSearcher isearcher = new IndexSearcher(dir, true);
IndexReader reader = isearcher.getIndexReader();
Fields fields = MultiFields.getFields(reader);
Collection<String> cols = reader.getFieldNames(IndexReader.FieldOption.ALL);
for (String col : cols) {
Terms te = fields.terms(col);
if (te != null) {
TermsEnum tex = te.getThreadTermsEnum();
while (tex.next() != null)
// do something
tex.getTerm().text();
}
}
This iterates over all columns and also over every term per col. You may lookup the methods provided by TermsEnum like getTerm().