High performance unique document id retrieval - lucene

Currently I am working on high-performance NRT system using Lucene 4.9.0 on Java platform which detects near-duplicate text documents.
For this purpose I query Lucene to return some set of matching candidates and do near-duplicate calculation locally (by retrieving and caching term vectors). But my main concern is performance issue of binding Lucene's docId (which can change) to my own unique and immutable document id stored within index.
My flow is as follows:
query for documents in Lucene
for each document:
fetch my unique document id based on Lucene docId
get term vector from cache for my document id (if it doesn't exists - fetch it from Lucene and populate the cache)
do maths...
My major bottleneck is "fetch my unique document id" step which introduces huge performance degradation (especially that sometimes I have to do calculation for, let's say, 40000 term vectors in single loop).
try {
Document document = indexReader.document(id);
return document.getField(ID_FIELD_NAME).numericValue().intValue();
} catch (IOException e) {
throw new IndexException(e);
}
Possible solutions I was considering was:
try of using Zoie which handles unique and persistent doc identifiers,
use of FieldCache (still very inefficient),
use of Payloads (according to http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html) - but I do not have any idea how to apply it.
Any other suggestions?

I have figured out how to solve the issue partially using benefits of Lucene's AtomicReader. For this purpose I use global cache in order to keep already instantiated segments' FieldCache.
Map<Object, FieldCache.Ints> fieldCacheMap = new HashMap<Object, FieldCache.Ints>();
In my method I use the following piece of code:
Query query = new TermQuery(new Term(FIELD_NAME, fieldValue));
IndexReader indexReader = DirectoryReader.open(indexWriter, true);
List<AtomicReaderContext> leaves = indexReader.getContext().leaves();
// process each segment separately
for (AtomicReaderContext leave : leaves) {
AtomicReader reader = leave.reader();
FieldCache.Ints fieldCache;
Object fieldCacheKey = reader.getCoreCacheKey();
synchronized (fieldCacheMap) {
fieldCache = fieldCacheMap.get(fieldCacheKey);
if (fieldCache == null) {
fieldCache = FieldCache.DEFAULT.getInts(reader, ID_FIELD_NAME, true);
fieldCacheMap.put(fieldCacheKey, fieldCache);
}
usedReaderSet.add(fieldCacheKey);
}
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (int i = 0; i < scoreDocs.length; i++) {
int docID = scoreDocs[i].doc;
int offerId = fieldCache.get(docID);
// do your processing here
}
}
// remove unused entries in cache set
synchronized(fieldCacheMap) {
Set<Object> inCacheSet = fieldCacheMap.keySet();
Set<Object> toRemove = new HashSet();
for(Object inCache : inCacheSet) {
if(!usedReaderSet.contains(inCache)) {
toRemove.add(inCache);
}
}
for(Object subject : toRemove) {
fieldCacheMap.remove(subject);
}
}
indexReader.close();
It works pretty fast. My main concern is memory usage which can be really high when using large index.

Related

Hibernate Search manual indexing throw a "org.hibernate.TransientObjectException: The instance was not associated with this session"

I use Hibernate Search 5.11 on my Spring Boot 2 application, allowing to make full text research.
This librairy require to index documents.
When my app is launched, I try to re-index manually data of an indexed entity (MyEntity.class) each five minutes (for specific reason, due to my server context).
I try to index data of the MyEntity.class.
MyEntity.class has a property attachedFiles, which is an hashset, filled with a join #OneToMany(), with lazy loading mode enabled :
#OneToMany(mappedBy = "myEntity", cascade = CascadeType.ALL, orphanRemoval = true)
private Set<AttachedFile> attachedFiles = new HashSet<>();
I code the required indexing process, but an exception is thrown on "fullTextSession.index(result);" when attachedFiles property of a given entity is filled with one or more items :
org.hibernate.TransientObjectException: The instance was not associated with this session
The debug mode indicates a message like "Unable to load [...]" on entity hashset value in this case.
And if the HashSet is empty (not null, only empty), no exception is thrown.
My indexing method :
private void indexDocumentsByEntityIds(List<Long> ids) {
final int BATCH_SIZE = 128;
Session session = entityManager.unwrap(Session.class);
FullTextSession fullTextSession = Search.getFullTextSession(session);
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<MyEntity> criteria = builder.createQuery(MyEntity.class);
Root<MyEntity> root = criteria.from(MyEntity.class);
criteria.select(root).where(root.get("id").in(ids));
TypedQuery<MyEntity> query = fullTextSession.createQuery(criteria);
List<MyEntity> results = query.getResultList();
int index = 0;
for (MyEntity result : results) {
index++;
try {
fullTextSession.index(result); //index each element
if (index % BATCH_SIZE == 0 || index == ids.size()) {
fullTextSession.flushToIndexes(); //apply changes to indexes
fullTextSession.clear(); //free memory since the queue is processed
}
} catch (TransientObjectException toEx) {
LOGGER.info(toEx.getMessage());
throw toEx;
}
}
}
Does someone have an idea ?
Thanks !
This is probably caused by the "clear" call you have in your loop.
In essence, what you're doing is:
load all entities to reindex into the session
index one batch of entities
remove all entities from the session (fullTextSession.clear())
try to index the next batch of entities, even though they are not in the session anymore... ?
What you need to do is to only load each batch of entities after the session clearing, so that you're sure they are still in the session when you index them.
There's an example of how to do this in the documentation, using a scroll and an appropriate batch size: https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#search-batchindex-flushtoindexes
Alternatively, you can just split your ID list in smaller lists of 128 elements, and for each of these lists, run a query to get the corresponding entities, reindex all these 128 entities, then flush and clear.
Thanks for the explanations #yrodiere, they helped me a lot !
I chose your alternative solution :
Alternatively, you can just split your ID list in smaller lists of 128 elements, and for each of these lists, run a query to get the corresponding entities, reindex all these 128 entities, then flush and clear.
...and everything works perfectly !
Well seen !
See the code solution below :
private List<List<Object>> splitList(List<Object> list, int subListSize) {
List<List<Object>> splittedList = new ArrayList<>();
if (!CollectionUtils.isEmpty(list)) {
int i = 0;
int nbItems = list.size();
while (i < nbItems) {
int maxLastSubListIndex = i + subListSize;
int lastSubListIndex = (maxLastSubListIndex > nbItems) ? nbItems : maxLastSubListIndex;
List<Object> subList = list.subList(i, lastSubListIndex);
splittedList.add(subList);
i = lastSubListIndex;
}
}
return splittedList;
}
private void indexDocumentsByEntityIds(Class<Object> clazz, String entityIdPropertyName, List<Object> ids) {
Session session = entityManager.unwrap(Session.class);
List<List<Object>> splittedIdsLists = splitList(ids, 128);
for (List<Object> splittedIds : splittedIdsLists) {
FullTextSession fullTextSession = Search.getFullTextSession(session);
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
Transaction transaction = fullTextSession.beginTransaction();
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<Object> criteria = builder.createQuery(clazz);
Root<Object> root = criteria.from(clazz);
criteria.select(root).where(root.get(entityIdPropertyName).in(splittedIds));
TypedQuery<Object> query = fullTextSession.createQuery(criteria);
List<Object> results = query.getResultList();
int index = 0;
for (Object result : results) {
index++;
try {
fullTextSession.index(result); //index each element
if (index == splittedIds.size()) {
fullTextSession.flushToIndexes(); //apply changes to indexes
fullTextSession.clear(); //free memory since the queue is processed
}
} catch (TransientObjectException toEx) {
LOGGER.info(toEx.getMessage());
throw toEx;
}
}
transaction.commit();
}
}

Number of PhraseQuery matches in a document

This is my code to perform a PhraseQuery using Lucene. While it is clear how to get score matches for each document inside the index, i am not understanding how to extract the total number of matches for a single document.
The following is my code performing the query:
PhraseQuery.Builder builder = new PhraseQuery.Builder();
builder.add(new Term("contents", "word1"), 0);
builder.add(new Term("contents", "word2"), 1);
builder.add(new Term("contents", "word3"), 2);
builder.setSlop(3);
PhraseQuery pq = builder.build();
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(pq, hitsPerPage);
ScoreDoc[] hits = docs.scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i)
{
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(docId + " " + hits[i].score);
}
Is there a method to extract the total number of matches for each document rather than the score?
Approach A. This might not be the best way but it will give you a quick insight. You can use explain() function of IndexSearcher class which will return a string containing lots of information and phrase frequency in a document. Add this code inside your for loop:
System.out.println(searcher.explain(pq, searcher.doc(docId)));
Approach B. A more systematic way of doing this is to do the same thing that explain() function does. To compute the phrase frequency, explain() builds a scorer object for the phrase query and calls freq() on it. Most of the methods/classes used to do this are private/protected so I am not sure if you can really use them. However it might be helpful to look at the code of explain() in PhraseWeight class inside PhraseQuery and ExactPhraseScorer class. (Some of these classes are not public and you should download the source code to be able to see them).

Lucene - Iterating through TermsEnum for docfreq

I am trying to get the doc frequency for each term in term enum. But I getting everytime only a "1" for the document frequency for all terms. Any hint, what the problem could be? This is my code:
Terms terms = reader.getTermVector(docId, field);
TermsEnum termsEnum = null;
termsEnum = terms.iterator(termsEnum);
BytesRef termText = null;
while((termsEnum.next()) != null){
int docNumbersWithTerm = termsEnum.docfreq();
System.out.println(docNumbersWithTerm);
}
The Terms instance from IndexReader.getTermVector acts as if you have a single-document index, comprised entirely of the document specified. Since there is only one document to consider in this context, you should always get docfreq() = 1. You could generate the docfreq from the full index using the IndexReader.docFreq method:
int docNumbersWithTerm = reader.docFreq(new Term(termsEnum.term(), field));
System.out.println(docNumbersWithTerm);

Lucene 4.0 sample code

I can't get this to work with Lucene 4.0 and its new features... Could somebody please help me??
I have crawled a bunch of html-documents from the web. Now I would like to count the number of distinct words of every Document.
This is how I did it with Lucene 3.5 (for a single document. To get them all I loop over all documents... every time with a new RAMDirectory containing only one doc) :
Analyzer analyzer = some Lucene Analyzer;
RAMDirectory index;
index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
String _words = new String();
// get somehow the String containing a certain text:
_words = doc.getPageDescription();
try {
IndexWriter w = new IndexWriter(index, config);
addDoc(w, _words);
w.close();
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
try {
// System.out.print(", count Terms... ");
IndexReader reader = IndexReader.open(index);
TermFreqVector[] freqVector = reader.getTermFreqVectors(0);
if (freqVector == null) {
System.out.println("Count words: ": 0");
}
for (TermFreqVector vector : freqVector) {
String[] terms = vector.getTerms();
int[] freq = vector.getTermFrequencies();
int n = terms.length;
System.out.println("Count words: " + n);
....
How can I do this with Lucene 4.0?
I'd prefer to do this using a FSDirectory instead of RAMDirectory however; I guess this is more performant if I have a quite high number of documents?
Thanks and regards
C.
Use the Fields/Terms apis.
See especially the example 'access term vector fields for a specific document'
Seeing as you are looping over all documents, if your end goal is really something like the average number of unique terms across all documents, keep reading to the 'index statistics section'. For example in that case, you can compute that efficiently with #postings / #documents: getSumDocFreq()/maxDoc()
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/package-summary.html#package_description

How to score a small set of docs in Lucene

I would like to compute the scores for a small number of documents (rather than for the entire collection) for a given query. My attempt, as follows, returns 0 scores for each document, even though the queries I test with were derived from the terms in the documents I am trying to score. I am using Lucene 3.0.3.
List<Float> score(IndexReader reader, Query query, List<Integer> newDocs ) {
List<Float> scores = new List<Float>();
IndexSearcher searcher = reader.getSearcher();
Collector collector = TopScoreDocCollector.create(newDocs.size(), true);
Weight weight = query.createWeight(searcher);
Scorer scorer = weight.scorer(reader, true, true);
collector.setScorer(scorer);
float score = 0.0f;
for(Integer d: newDocs) {
scorer.advance(d);
collector.collect(d);
score = scorer.score();
System.out.println( "doc: " + d + "; score=" + score);
scores.add( new Float(score) );
}
return scores;
}
I am obviously missing something in the setup of scoring, but I cannot figure out from the Lucene source code what that might be.
Thanks in advance,
Gene
Use a filter, and do a search with that filter. Then just iterate through the results as you would with a normal search - Lucene will handle the filtering.
In general, if you are looking at DocIds, you're probably using a lower-level API than you need to, and it's going to give you trouble.