Hibernate Search manual indexing throw a "org.hibernate.TransientObjectException: The instance was not associated with this session" - lucene

I use Hibernate Search 5.11 on my Spring Boot 2 application, allowing to make full text research.
This librairy require to index documents.
When my app is launched, I try to re-index manually data of an indexed entity (MyEntity.class) each five minutes (for specific reason, due to my server context).
I try to index data of the MyEntity.class.
MyEntity.class has a property attachedFiles, which is an hashset, filled with a join #OneToMany(), with lazy loading mode enabled :
#OneToMany(mappedBy = "myEntity", cascade = CascadeType.ALL, orphanRemoval = true)
private Set<AttachedFile> attachedFiles = new HashSet<>();
I code the required indexing process, but an exception is thrown on "fullTextSession.index(result);" when attachedFiles property of a given entity is filled with one or more items :
org.hibernate.TransientObjectException: The instance was not associated with this session
The debug mode indicates a message like "Unable to load [...]" on entity hashset value in this case.
And if the HashSet is empty (not null, only empty), no exception is thrown.
My indexing method :
private void indexDocumentsByEntityIds(List<Long> ids) {
final int BATCH_SIZE = 128;
Session session = entityManager.unwrap(Session.class);
FullTextSession fullTextSession = Search.getFullTextSession(session);
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<MyEntity> criteria = builder.createQuery(MyEntity.class);
Root<MyEntity> root = criteria.from(MyEntity.class);
criteria.select(root).where(root.get("id").in(ids));
TypedQuery<MyEntity> query = fullTextSession.createQuery(criteria);
List<MyEntity> results = query.getResultList();
int index = 0;
for (MyEntity result : results) {
index++;
try {
fullTextSession.index(result); //index each element
if (index % BATCH_SIZE == 0 || index == ids.size()) {
fullTextSession.flushToIndexes(); //apply changes to indexes
fullTextSession.clear(); //free memory since the queue is processed
}
} catch (TransientObjectException toEx) {
LOGGER.info(toEx.getMessage());
throw toEx;
}
}
}
Does someone have an idea ?
Thanks !

This is probably caused by the "clear" call you have in your loop.
In essence, what you're doing is:
load all entities to reindex into the session
index one batch of entities
remove all entities from the session (fullTextSession.clear())
try to index the next batch of entities, even though they are not in the session anymore... ?
What you need to do is to only load each batch of entities after the session clearing, so that you're sure they are still in the session when you index them.
There's an example of how to do this in the documentation, using a scroll and an appropriate batch size: https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#search-batchindex-flushtoindexes
Alternatively, you can just split your ID list in smaller lists of 128 elements, and for each of these lists, run a query to get the corresponding entities, reindex all these 128 entities, then flush and clear.

Thanks for the explanations #yrodiere, they helped me a lot !
I chose your alternative solution :
Alternatively, you can just split your ID list in smaller lists of 128 elements, and for each of these lists, run a query to get the corresponding entities, reindex all these 128 entities, then flush and clear.
...and everything works perfectly !
Well seen !
See the code solution below :
private List<List<Object>> splitList(List<Object> list, int subListSize) {
List<List<Object>> splittedList = new ArrayList<>();
if (!CollectionUtils.isEmpty(list)) {
int i = 0;
int nbItems = list.size();
while (i < nbItems) {
int maxLastSubListIndex = i + subListSize;
int lastSubListIndex = (maxLastSubListIndex > nbItems) ? nbItems : maxLastSubListIndex;
List<Object> subList = list.subList(i, lastSubListIndex);
splittedList.add(subList);
i = lastSubListIndex;
}
}
return splittedList;
}
private void indexDocumentsByEntityIds(Class<Object> clazz, String entityIdPropertyName, List<Object> ids) {
Session session = entityManager.unwrap(Session.class);
List<List<Object>> splittedIdsLists = splitList(ids, 128);
for (List<Object> splittedIds : splittedIdsLists) {
FullTextSession fullTextSession = Search.getFullTextSession(session);
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
Transaction transaction = fullTextSession.beginTransaction();
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<Object> criteria = builder.createQuery(clazz);
Root<Object> root = criteria.from(clazz);
criteria.select(root).where(root.get(entityIdPropertyName).in(splittedIds));
TypedQuery<Object> query = fullTextSession.createQuery(criteria);
List<Object> results = query.getResultList();
int index = 0;
for (Object result : results) {
index++;
try {
fullTextSession.index(result); //index each element
if (index == splittedIds.size()) {
fullTextSession.flushToIndexes(); //apply changes to indexes
fullTextSession.clear(); //free memory since the queue is processed
}
} catch (TransientObjectException toEx) {
LOGGER.info(toEx.getMessage());
throw toEx;
}
}
transaction.commit();
}
}

Related

ChronicleMap cannot store/use the defined No of Max.Entries after removing a few entries?

Chronicle Map Versions I used - 3.22ea5 / 3.21.86
I am trying to use ChronicleMap as an LRU cache.
I have two ChronicleMaps both equal in configuration with allowSegmentTiering set as false. Consider one as main and the other as backup.
So, when the main Map gets full, few entries will be removed from the main Map and in parallel the backup Map will be used. Once the entries are removed from main Map, the entries from the backup Map will be refilled in the Main Map.
Shown below a sample code.
ChronicleMap<ByteBuffer, ByteBuffer> main = ChronicleMapBuilder.of(ByteBuffer.class, ByteBuffer.class).name("main")
.entries(61500)
.averageKey(ByteBuffer.wrap(new byte[500]))
.averageValue(ByteBuffer.wrap(new byte[5120]))
.allowSegmentTiering(false)
.create();
ChronicleMap<ByteBuffer, ByteBuffer> backup = ChronicleMapBuilder.of(ByteBuffer.class, ByteBuffer.class).name("backup")
.entries(100)
.averageKey(ByteBuffer.wrap(new byte[500]))
.averageValue(ByteBuffer.wrap(new byte[5120]))
.allowSegmentTiering(false)
.create();
System.out.println("Main Heap Size -> "+main.offHeapMemoryUsed());
SecureRandom random = new SecureRandom();
while (true)
{
System.out.println();
AtomicInteger entriesAdded = new AtomicInteger(0);
try
{
int mainEntries = main.size();
while /*(true) Loop until error is thrown */(mainEntries < 61500)
{
try
{
byte[] keyN = new byte[500];
byte[] valueN = new byte[5120];
random.nextBytes(keyN);
random.nextBytes(valueN);
main.put(ByteBuffer.wrap(keyN), ByteBuffer.wrap(valueN));
mainEntries++;
}
catch (Throwable t)
{
System.out.println("Max Entries is not yet reached!!!");
break;
}
}
System.out.println("Main Entries -> "+main.size());
for (int i = 0; i < 10; i++)
{
byte[] keyN = new byte[500];
byte[] valueN = new byte[5120];
random.nextBytes(keyN);
random.nextBytes(valueN);
backup.put(ByteBuffer.wrap(keyN), ByteBuffer.wrap(valueN));
}
AtomicInteger removed = new AtomicInteger(0);
AtomicInteger i = new AtomicInteger(Math.max( (backup.size() * 5), ( (main.size() * 5) / 100 ) ));
main.forEachEntry(c -> {
if (i.get() > 0)
{
c.context().remove(c);
i.decrementAndGet();
removed.incrementAndGet();
}
});
System.out.println("Removed "+removed.get()+" Entries from Main");
backup.forEachEntry(b -> {
ByteBuffer key = b.key().get();
ByteBuffer value = b.value().get();
b.context().remove(b);
main.put(key, value);
entriesAdded.incrementAndGet();
});
if(backup.size() > 0)
{
System.out.println("It will never be logged");
backup.clear();
}
}
catch (Throwable t)
{
// System.out.println();
// t.printStackTrace(System.out);
System.out.println();
System.out.println("-------------------------Failed----------------------------");
System.out.println("Added "+entriesAdded.get()+" Entries in Main | Lost "+(backup.size() + 1)+" Entries in backup");
backup.clear();
break;
}
}
main.close();
backup.close();
The above code yields the following result.
Main Entries -> 61500
Removed 3075 Entries from Main
Main Entries -> 61500
Removed 3075 Entries from Main
Main Entries -> 61500
Removed 3075 Entries from Main
Max Entries is not yet reached!!!
Main Entries -> 59125
Removed 2956 Entries from Main
Max Entries is not yet reached!!!
Main Entries -> 56227
Removed 2811 Entries from Main
Max Entries is not yet reached!!!
Main Entries -> 53470
Removed 2673 Entries from Main
-------------------------Failed----------------------------
Added 7 Entries in Main | Lost 3 Entries in backup
In the above result, The Max Entries of the Main map got decreased in the subsequent iterations and the refilling from the backup Map also got failed.
In the Issue 128, it was said the entries are deleted properly.
Then why the above sample code fails? What am I doing wrong in here? Is the Chronicle Map not designed for such usage pattern?
Even If I use one Map only, the max Entries the Map can hold gets reduced after each removal of entries.

Update context in SQL Server from ASP.NET Core 2.2

_context.Update(v) ;
_context.SaveChanges();
When I use this code then SQL Server adds a new record instead of updating the
current context
[HttpPost]
public IActionResult PageVote(List<string> Sar)
{
string name_voter = ViewBag.getValue = TempData["Namevalue"];
int count = 0;
foreach (var item in Sar)
{
count = count + 1;
}
if (count == 6)
{
Vote v = new Vote()
{
VoteSarparast1 = Sar[0],
VoteSarparast2 = Sar[1],
VoteSarparast3 = Sar[2],
VoteSarparast4 = Sar[3],
VoteSarparast5 = Sar[4],
VoteSarparast6 = Sar[5],
};
var voter = _context.Votes.FirstOrDefault(u => u.Voter == name_voter && u.IsVoted == true);
if (voter == null)
{
v.IsVoted = true;
v.Voter = name_voter;
_context.Add(v);
_context.SaveChanges();
ViewBag.Greeting = "رای شما با موفقیت ثبت شد";
return RedirectToAction(nameof(end));
}
v.IsVoted = true;
v.Voter = name_voter;
_context.Update(v);
_context.SaveChanges();
return RedirectToAction(nameof(end));
}
else
{
return View(_context.Applicants.ToList());
}
}
You need to tell the DbContext about your entity. If you do var vote = new Vote() vote has no Id. The DbContext see this and thinks you want to Add a new entity, so it simply does that. The DbContext tracks all the entities that you load from it, but since this is just a new instance, it has no idea about it.
To actually perform an update, you have two options:
1 - Load the Vote from the database in some way; If you get an Id, use that to find it.
// Loads the current vote by its id (or whatever other field..)
var existingVote = context.Votes.Single(p => p.Id == id_from_param);
// Perform the changes you want..
existingVote.SomeField = "NewValue";
// Then call save normally.
context.SaveChanges();
2 - Or if you don't want to load it from Db, you have to manually tell the DbContext what to do:
// create a new "vote"...
var vote = new Vote
{
// Since it's an update, you must have the Id somehow.. so you must set it manually
Id = id_from_param,
// do the changes you want. Be careful, because this can cause data loss!
SomeField = "NewValue"
};
// This is you telling the DbContext: Hey, I control this entity.
// I know it exists in the DB and it's modified
context.Entry(vote).State = EntityState.Modified;
// Then call save normally.
context.SaveChanges();
Either of those two approaches should fix your issue, but I suggest you read a little bit more about how Entity Framework works. This is crucial for the success (and performance) of your apps. Especially option 2 above can cause many many issues. There's a reason why the DbContext keep track of entities, so you don't have to. It's very complicated and things can go south fast.
Some links for you:
ChangeTracker in Entity Framework Core
Working with Disconnected Entity Graph in Entity Framework Core

High performance unique document id retrieval

Currently I am working on high-performance NRT system using Lucene 4.9.0 on Java platform which detects near-duplicate text documents.
For this purpose I query Lucene to return some set of matching candidates and do near-duplicate calculation locally (by retrieving and caching term vectors). But my main concern is performance issue of binding Lucene's docId (which can change) to my own unique and immutable document id stored within index.
My flow is as follows:
query for documents in Lucene
for each document:
fetch my unique document id based on Lucene docId
get term vector from cache for my document id (if it doesn't exists - fetch it from Lucene and populate the cache)
do maths...
My major bottleneck is "fetch my unique document id" step which introduces huge performance degradation (especially that sometimes I have to do calculation for, let's say, 40000 term vectors in single loop).
try {
Document document = indexReader.document(id);
return document.getField(ID_FIELD_NAME).numericValue().intValue();
} catch (IOException e) {
throw new IndexException(e);
}
Possible solutions I was considering was:
try of using Zoie which handles unique and persistent doc identifiers,
use of FieldCache (still very inefficient),
use of Payloads (according to http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html) - but I do not have any idea how to apply it.
Any other suggestions?
I have figured out how to solve the issue partially using benefits of Lucene's AtomicReader. For this purpose I use global cache in order to keep already instantiated segments' FieldCache.
Map<Object, FieldCache.Ints> fieldCacheMap = new HashMap<Object, FieldCache.Ints>();
In my method I use the following piece of code:
Query query = new TermQuery(new Term(FIELD_NAME, fieldValue));
IndexReader indexReader = DirectoryReader.open(indexWriter, true);
List<AtomicReaderContext> leaves = indexReader.getContext().leaves();
// process each segment separately
for (AtomicReaderContext leave : leaves) {
AtomicReader reader = leave.reader();
FieldCache.Ints fieldCache;
Object fieldCacheKey = reader.getCoreCacheKey();
synchronized (fieldCacheMap) {
fieldCache = fieldCacheMap.get(fieldCacheKey);
if (fieldCache == null) {
fieldCache = FieldCache.DEFAULT.getInts(reader, ID_FIELD_NAME, true);
fieldCacheMap.put(fieldCacheKey, fieldCache);
}
usedReaderSet.add(fieldCacheKey);
}
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (int i = 0; i < scoreDocs.length; i++) {
int docID = scoreDocs[i].doc;
int offerId = fieldCache.get(docID);
// do your processing here
}
}
// remove unused entries in cache set
synchronized(fieldCacheMap) {
Set<Object> inCacheSet = fieldCacheMap.keySet();
Set<Object> toRemove = new HashSet();
for(Object inCache : inCacheSet) {
if(!usedReaderSet.contains(inCache)) {
toRemove.add(inCache);
}
}
for(Object subject : toRemove) {
fieldCacheMap.remove(subject);
}
}
indexReader.close();
It works pretty fast. My main concern is memory usage which can be really high when using large index.

Insert 1000000 documents into RavenDB

I want to insert 1000000 documents into RavenDB.
class Program
{
private static string serverName;
private static string databaseName;
private static DocumentStore documentstore;
private static IDocumentSession _session;
static void Main(string[] args)
{
Console.WriteLine("Start...");
serverName = ConfigurationManager.AppSettings["ServerName"];
databaseName = ConfigurationManager.AppSettings["Database"];
documentstore = new DocumentStore { Url = serverName };
documentstore.Initialize();
Console.WriteLine("Initial Databse...");
_session = documentstore.OpenSession(databaseName);
for (int i = 0; i < 1000000; i++)
{
var person = new Person()
{
Fname = "Meysam" + i,
Lname = " Savameri" + i,
Bdate = DateTime.Now,
Salary = 6001 + i,
Address = "BITS provides one foreground and three background priority levels that" +
"you can use to prioritize transBfer jobs. Higher priority jobs preempt"+
"lower priority jobs. Jobs at the same priority level share transfer time,"+
"which prevents a large job from blocking small jobs in the transfer"+
"queue. Lower priority jobs do not receive transfer time until all the "+
"higher priority jobs are complete or in an error state. Background"+
"transfers are optimal because BITS uses idle network bandwidth to"+
"transfer the files. BITS increases or decreases the rate at which files "+
"are transferred based on the amount of idle network bandwidth that is"+
"available. If a network application begins to consume more bandwidth,"+
"BITS decreases its transfer rate to preserve the user's interactive"+
"experience. BITS supports multiple foreground jobs and one background"+
"transfer job at the same time.",
Email = "Meysam" + i + "#hotmail.com",
};
_session.Store(person);
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine("Count:" + i);
Console.ForegroundColor = ConsoleColor.White;
}
Console.WriteLine("Commit...");
_session.SaveChanges();
documentstore.Dispose();
_session.Dispose();
Console.WriteLine("Complete...");
Console.ReadLine();
}
}
but session doesn't save changes, I get an error:
An unhandled exception of type 'System.OutOfMemoryException' occurred in mscorlib.dll
A document session is intended to handle a small number of requests. Instead, experiment with inserting in batches of 1024. After that, dispose the session and create a new one. The reason you get an OutOfMemoryException is because the document session caches all constituent objects to provide a unit of work, which is why you should dispose of the session after inserting a batch.
A neat way to do this is with the use of a Batch linq extension:
foreach (var batch in Enumerable.Range(1, 1000000)
.Select(i => new Person { /* set properties */ })
.Batch(1024))
{
using (var session = documentstore.OpenSession())
{
foreach (var person in batch)
{
session.Store(person);
}
session.SaveChanges();
}
}
The implementations of both Enumerable.Range and Batch are lazy and don't keep all the objects in memory.
RavenDB also has a bulk API that does a similar thing without the need for additional LINQ extensions:
using (var bulkInsert = store.BulkInsert())
{
for (int i = 0; i < 1000 * 1000; i++)
{
bulkInsert.Store(new User
{
Name = "Users #" + i
});
}
}
Note .SaveChanges() isn't called and will be called either when a batch size is reached (defined in the BulkInsert() if needed), or when the bulkInsert is disposed of.

RavenDB Paging Behaviour

I have the following test for skip take -
[Test]
public void RavenPagingBehaviour()
{
const int count = 2048;
var eventEntities = PopulateEvents(count);
PopulateEventsToRaven(eventEntities);
using (var session = Store.OpenSession(_testDataBase))
{
var queryable =
session.Query<EventEntity>().Customize(x => x.WaitForNonStaleResultsAsOfLastWrite()).Skip(0).Take(1024);
var entities = queryable.ToArray();
foreach (var eventEntity in entities)
{
eventEntity.Key = "Modified";
}
session.SaveChanges();
queryable = session.Query<EventEntity>().Customize(x => x.WaitForNonStaleResultsAsOfLastWrite()).Skip(0).Take(1024);
entities = queryable.ToArray();
foreach (var eventEntity in entities)
{
Assert.AreEqual(eventEntity.Key, "Modified");
}
}
}
PopulateEventsToRaven simply adds 2048 very simple documents to the database.
The first skip take combination gets the first 1024 doucuments modifies the documents and then commits changes.
The next skip take combination again wants to get the first 1024 documents but this time it gets the document number 1024 to 2048 and hence fails the test. Why is this , I would expect the first 1024 again?
Edit: I have varified that if I dont modify the documents the behaviour is fine.
The problem is that you don't specify an order by, and that means that RavenDB is free to choose with items to return, those aren't necessarily going to be the same items that it returned in the previous call.
Use an OrderBy and it will be consistent.