Insert 1000000 documents into RavenDB - ravendb

I want to insert 1000000 documents into RavenDB.
class Program
{
private static string serverName;
private static string databaseName;
private static DocumentStore documentstore;
private static IDocumentSession _session;
static void Main(string[] args)
{
Console.WriteLine("Start...");
serverName = ConfigurationManager.AppSettings["ServerName"];
databaseName = ConfigurationManager.AppSettings["Database"];
documentstore = new DocumentStore { Url = serverName };
documentstore.Initialize();
Console.WriteLine("Initial Databse...");
_session = documentstore.OpenSession(databaseName);
for (int i = 0; i < 1000000; i++)
{
var person = new Person()
{
Fname = "Meysam" + i,
Lname = " Savameri" + i,
Bdate = DateTime.Now,
Salary = 6001 + i,
Address = "BITS provides one foreground and three background priority levels that" +
"you can use to prioritize transBfer jobs. Higher priority jobs preempt"+
"lower priority jobs. Jobs at the same priority level share transfer time,"+
"which prevents a large job from blocking small jobs in the transfer"+
"queue. Lower priority jobs do not receive transfer time until all the "+
"higher priority jobs are complete or in an error state. Background"+
"transfers are optimal because BITS uses idle network bandwidth to"+
"transfer the files. BITS increases or decreases the rate at which files "+
"are transferred based on the amount of idle network bandwidth that is"+
"available. If a network application begins to consume more bandwidth,"+
"BITS decreases its transfer rate to preserve the user's interactive"+
"experience. BITS supports multiple foreground jobs and one background"+
"transfer job at the same time.",
Email = "Meysam" + i + "#hotmail.com",
};
_session.Store(person);
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine("Count:" + i);
Console.ForegroundColor = ConsoleColor.White;
}
Console.WriteLine("Commit...");
_session.SaveChanges();
documentstore.Dispose();
_session.Dispose();
Console.WriteLine("Complete...");
Console.ReadLine();
}
}
but session doesn't save changes, I get an error:
An unhandled exception of type 'System.OutOfMemoryException' occurred in mscorlib.dll

A document session is intended to handle a small number of requests. Instead, experiment with inserting in batches of 1024. After that, dispose the session and create a new one. The reason you get an OutOfMemoryException is because the document session caches all constituent objects to provide a unit of work, which is why you should dispose of the session after inserting a batch.
A neat way to do this is with the use of a Batch linq extension:
foreach (var batch in Enumerable.Range(1, 1000000)
.Select(i => new Person { /* set properties */ })
.Batch(1024))
{
using (var session = documentstore.OpenSession())
{
foreach (var person in batch)
{
session.Store(person);
}
session.SaveChanges();
}
}
The implementations of both Enumerable.Range and Batch are lazy and don't keep all the objects in memory.

RavenDB also has a bulk API that does a similar thing without the need for additional LINQ extensions:
using (var bulkInsert = store.BulkInsert())
{
for (int i = 0; i < 1000 * 1000; i++)
{
bulkInsert.Store(new User
{
Name = "Users #" + i
});
}
}
Note .SaveChanges() isn't called and will be called either when a batch size is reached (defined in the BulkInsert() if needed), or when the bulkInsert is disposed of.

Related

ChronicleMap cannot store/use the defined No of Max.Entries after removing a few entries?

Chronicle Map Versions I used - 3.22ea5 / 3.21.86
I am trying to use ChronicleMap as an LRU cache.
I have two ChronicleMaps both equal in configuration with allowSegmentTiering set as false. Consider one as main and the other as backup.
So, when the main Map gets full, few entries will be removed from the main Map and in parallel the backup Map will be used. Once the entries are removed from main Map, the entries from the backup Map will be refilled in the Main Map.
Shown below a sample code.
ChronicleMap<ByteBuffer, ByteBuffer> main = ChronicleMapBuilder.of(ByteBuffer.class, ByteBuffer.class).name("main")
.entries(61500)
.averageKey(ByteBuffer.wrap(new byte[500]))
.averageValue(ByteBuffer.wrap(new byte[5120]))
.allowSegmentTiering(false)
.create();
ChronicleMap<ByteBuffer, ByteBuffer> backup = ChronicleMapBuilder.of(ByteBuffer.class, ByteBuffer.class).name("backup")
.entries(100)
.averageKey(ByteBuffer.wrap(new byte[500]))
.averageValue(ByteBuffer.wrap(new byte[5120]))
.allowSegmentTiering(false)
.create();
System.out.println("Main Heap Size -> "+main.offHeapMemoryUsed());
SecureRandom random = new SecureRandom();
while (true)
{
System.out.println();
AtomicInteger entriesAdded = new AtomicInteger(0);
try
{
int mainEntries = main.size();
while /*(true) Loop until error is thrown */(mainEntries < 61500)
{
try
{
byte[] keyN = new byte[500];
byte[] valueN = new byte[5120];
random.nextBytes(keyN);
random.nextBytes(valueN);
main.put(ByteBuffer.wrap(keyN), ByteBuffer.wrap(valueN));
mainEntries++;
}
catch (Throwable t)
{
System.out.println("Max Entries is not yet reached!!!");
break;
}
}
System.out.println("Main Entries -> "+main.size());
for (int i = 0; i < 10; i++)
{
byte[] keyN = new byte[500];
byte[] valueN = new byte[5120];
random.nextBytes(keyN);
random.nextBytes(valueN);
backup.put(ByteBuffer.wrap(keyN), ByteBuffer.wrap(valueN));
}
AtomicInteger removed = new AtomicInteger(0);
AtomicInteger i = new AtomicInteger(Math.max( (backup.size() * 5), ( (main.size() * 5) / 100 ) ));
main.forEachEntry(c -> {
if (i.get() > 0)
{
c.context().remove(c);
i.decrementAndGet();
removed.incrementAndGet();
}
});
System.out.println("Removed "+removed.get()+" Entries from Main");
backup.forEachEntry(b -> {
ByteBuffer key = b.key().get();
ByteBuffer value = b.value().get();
b.context().remove(b);
main.put(key, value);
entriesAdded.incrementAndGet();
});
if(backup.size() > 0)
{
System.out.println("It will never be logged");
backup.clear();
}
}
catch (Throwable t)
{
// System.out.println();
// t.printStackTrace(System.out);
System.out.println();
System.out.println("-------------------------Failed----------------------------");
System.out.println("Added "+entriesAdded.get()+" Entries in Main | Lost "+(backup.size() + 1)+" Entries in backup");
backup.clear();
break;
}
}
main.close();
backup.close();
The above code yields the following result.
Main Entries -> 61500
Removed 3075 Entries from Main
Main Entries -> 61500
Removed 3075 Entries from Main
Main Entries -> 61500
Removed 3075 Entries from Main
Max Entries is not yet reached!!!
Main Entries -> 59125
Removed 2956 Entries from Main
Max Entries is not yet reached!!!
Main Entries -> 56227
Removed 2811 Entries from Main
Max Entries is not yet reached!!!
Main Entries -> 53470
Removed 2673 Entries from Main
-------------------------Failed----------------------------
Added 7 Entries in Main | Lost 3 Entries in backup
In the above result, The Max Entries of the Main map got decreased in the subsequent iterations and the refilling from the backup Map also got failed.
In the Issue 128, it was said the entries are deleted properly.
Then why the above sample code fails? What am I doing wrong in here? Is the Chronicle Map not designed for such usage pattern?
Even If I use one Map only, the max Entries the Map can hold gets reduced after each removal of entries.

Hibernate Search manual indexing throw a "org.hibernate.TransientObjectException: The instance was not associated with this session"

I use Hibernate Search 5.11 on my Spring Boot 2 application, allowing to make full text research.
This librairy require to index documents.
When my app is launched, I try to re-index manually data of an indexed entity (MyEntity.class) each five minutes (for specific reason, due to my server context).
I try to index data of the MyEntity.class.
MyEntity.class has a property attachedFiles, which is an hashset, filled with a join #OneToMany(), with lazy loading mode enabled :
#OneToMany(mappedBy = "myEntity", cascade = CascadeType.ALL, orphanRemoval = true)
private Set<AttachedFile> attachedFiles = new HashSet<>();
I code the required indexing process, but an exception is thrown on "fullTextSession.index(result);" when attachedFiles property of a given entity is filled with one or more items :
org.hibernate.TransientObjectException: The instance was not associated with this session
The debug mode indicates a message like "Unable to load [...]" on entity hashset value in this case.
And if the HashSet is empty (not null, only empty), no exception is thrown.
My indexing method :
private void indexDocumentsByEntityIds(List<Long> ids) {
final int BATCH_SIZE = 128;
Session session = entityManager.unwrap(Session.class);
FullTextSession fullTextSession = Search.getFullTextSession(session);
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<MyEntity> criteria = builder.createQuery(MyEntity.class);
Root<MyEntity> root = criteria.from(MyEntity.class);
criteria.select(root).where(root.get("id").in(ids));
TypedQuery<MyEntity> query = fullTextSession.createQuery(criteria);
List<MyEntity> results = query.getResultList();
int index = 0;
for (MyEntity result : results) {
index++;
try {
fullTextSession.index(result); //index each element
if (index % BATCH_SIZE == 0 || index == ids.size()) {
fullTextSession.flushToIndexes(); //apply changes to indexes
fullTextSession.clear(); //free memory since the queue is processed
}
} catch (TransientObjectException toEx) {
LOGGER.info(toEx.getMessage());
throw toEx;
}
}
}
Does someone have an idea ?
Thanks !
This is probably caused by the "clear" call you have in your loop.
In essence, what you're doing is:
load all entities to reindex into the session
index one batch of entities
remove all entities from the session (fullTextSession.clear())
try to index the next batch of entities, even though they are not in the session anymore... ?
What you need to do is to only load each batch of entities after the session clearing, so that you're sure they are still in the session when you index them.
There's an example of how to do this in the documentation, using a scroll and an appropriate batch size: https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#search-batchindex-flushtoindexes
Alternatively, you can just split your ID list in smaller lists of 128 elements, and for each of these lists, run a query to get the corresponding entities, reindex all these 128 entities, then flush and clear.
Thanks for the explanations #yrodiere, they helped me a lot !
I chose your alternative solution :
Alternatively, you can just split your ID list in smaller lists of 128 elements, and for each of these lists, run a query to get the corresponding entities, reindex all these 128 entities, then flush and clear.
...and everything works perfectly !
Well seen !
See the code solution below :
private List<List<Object>> splitList(List<Object> list, int subListSize) {
List<List<Object>> splittedList = new ArrayList<>();
if (!CollectionUtils.isEmpty(list)) {
int i = 0;
int nbItems = list.size();
while (i < nbItems) {
int maxLastSubListIndex = i + subListSize;
int lastSubListIndex = (maxLastSubListIndex > nbItems) ? nbItems : maxLastSubListIndex;
List<Object> subList = list.subList(i, lastSubListIndex);
splittedList.add(subList);
i = lastSubListIndex;
}
}
return splittedList;
}
private void indexDocumentsByEntityIds(Class<Object> clazz, String entityIdPropertyName, List<Object> ids) {
Session session = entityManager.unwrap(Session.class);
List<List<Object>> splittedIdsLists = splitList(ids, 128);
for (List<Object> splittedIds : splittedIdsLists) {
FullTextSession fullTextSession = Search.getFullTextSession(session);
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
Transaction transaction = fullTextSession.beginTransaction();
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<Object> criteria = builder.createQuery(clazz);
Root<Object> root = criteria.from(clazz);
criteria.select(root).where(root.get(entityIdPropertyName).in(splittedIds));
TypedQuery<Object> query = fullTextSession.createQuery(criteria);
List<Object> results = query.getResultList();
int index = 0;
for (Object result : results) {
index++;
try {
fullTextSession.index(result); //index each element
if (index == splittedIds.size()) {
fullTextSession.flushToIndexes(); //apply changes to indexes
fullTextSession.clear(); //free memory since the queue is processed
}
} catch (TransientObjectException toEx) {
LOGGER.info(toEx.getMessage());
throw toEx;
}
}
transaction.commit();
}
}

XGBoost4J Synchronization issue?

We are using XGboost4J for ML predictions. We developed predictor using restful webservice so that within platform various components can call ML predictor. e.g from product titles and description finding out product category tree.
Just depicting code in fundamental way we implemented.
// This is done in
initialize method, for every model there is one singleton Booster object loaded.
Class Predictor{
private Booster xgboost;
//init call from Serivice initialization while injecting Predictor
public void init(final String modelFile, final Integer numThreads){
if (!(new File(modelFile).exists())) {
throw new IOException("Modelfile " + modelFile + " does not exist");
}
// we use a util class Params to handle parameters as example
final Iterable<Entry<String, Object>> param = new Params() {
{
put("nthread", numThreads);
}
};
xgboost = new Booster(param, modelFile);
}
//Predict method
public String predict(final String predictionString){
final String dummyLabel = "-1";
final String x_n = dummyLabel + "\t" + x_n_libsvm_idxStr;
final DataLoader.CSRSparseData spData = XGboostSparseData.format(x_n);
final DMatrix x_n_dmatrix = new DMatrix(spData.rowHeaders,
spData.colIndex, spData.data, DMatrix.SparseType.CSR);
final float[][] predict = xgboost.predict(x_n_dmatrix);
// Then there is conversion logic of predict to predicted model result which returns predictions
String prediction = getPrediction(predict);
return prediction
}
}
Above predictor class is singleton injected in webservices Services class
so for every services call thread call's
service.predict(predictionString);
There is problem in tomcat container when multiple concurrent threads calls predict method Boosters method is synchronized
private synchronized float[][] pred(DMatrix data, boolean outPutMargin, long treeLimit, boolean predLeaf) throws XGBoostError {
byte optionMask = 0;
if(outPutMargin) {
optionMask = 1;
}
if(predLeaf) {
optionMask = 2;
}
float[][] rawPredicts = new float[1][];
ErrorHandle.checkCall(XgboostJNI.XGBoosterPredict(this.handle, data.getHandle(), optionMask, treeLimit, rawPredicts));
int row = (int)data.rowNum();
int col = rawPredicts[0].length / row;
float[][] predicts = new float[row][col];
for(int i = 0; i < rawPredicts[0].length; ++i) {
int r = i / col;
int c = i % col;
predicts[r][c] = rawPredicts[0][i];
}
return predicts;
}
This created thread waits and locking because of synchronized block and this is resulting webservices not scalable.
We tried removing synchronized from XGboost4J source code and compiled jar but it crashes within first 1-2 mins. Heap dump showing its crashing at below line while doing native call to XgboostJNI
ErrorHandle.checkCall(XgboostJNI.XGBoosterPredict(this.handle, data.getHandle(), optionMask, treeLimit, rawPredicts));
Anyone knows better way of implementing Xgboost4J for highly scalable webservices approach using Java?
You could use PMML (https://github.com/jpmml/jpmml-xgboost), referring to https://github.com/jpmml/jpmml-xgboost/issues/7#issuecomment-250965282

Setting data in the Cloned Seismic Cube

I am quite new to Ocean Framework. I had cloned a pre-existing seismic cube to create a new seismic cube .
// Getting the Parent Cube
SeismicCube ParentCube = InputSeismicLine3D.SeismicCube;
// Getting the Seismic Collection
SeismicCollection Sc = ParentCube.SeismicCollection;
//
if (Sc.CanCreateSeismicCube(ParentCube))
{
SeismicCube NewCube = Sc.CreateSeismicCube(ParentCube, ParentCube.Template);
}
Can anyone tell me how to set the trace data in the "NewCube".
Thanks in advance.
SeismicCube has Traces property.
From the SDK:
Writing is only possible for cubes, which return IsWritable as true and the cube is locked in a transaction. Trace value changes (like trace[12] = 123.0) are flushed automatically at regular intervals when you advance to the next trace with enumerator.MoveNext(). The value range is recomputed when you finish iterating (MoveNext() returns false).
In your example:
if (Sc.CanCreateSeismicCube(ParentCube))
{
SeismicCube NewCube = Sc.CreateSeismicCube(ParentCube, ParentCube.Template);
if (!NewCube.IsWriteable)
return;
using (ITransaction trans = DataManager.NewTransaction()) {
trans.Lock(NewCube);
foreach (ITrace trace in NewCube.Traces)
{
//Do some setting of trace values here. Example only:
for (int i = 0; i < trace.Length; i++)
{
trace[i] = trace.I + trace.J + i;
}
}
trans.Commit();
}
}

RavenDB Paging Behaviour

I have the following test for skip take -
[Test]
public void RavenPagingBehaviour()
{
const int count = 2048;
var eventEntities = PopulateEvents(count);
PopulateEventsToRaven(eventEntities);
using (var session = Store.OpenSession(_testDataBase))
{
var queryable =
session.Query<EventEntity>().Customize(x => x.WaitForNonStaleResultsAsOfLastWrite()).Skip(0).Take(1024);
var entities = queryable.ToArray();
foreach (var eventEntity in entities)
{
eventEntity.Key = "Modified";
}
session.SaveChanges();
queryable = session.Query<EventEntity>().Customize(x => x.WaitForNonStaleResultsAsOfLastWrite()).Skip(0).Take(1024);
entities = queryable.ToArray();
foreach (var eventEntity in entities)
{
Assert.AreEqual(eventEntity.Key, "Modified");
}
}
}
PopulateEventsToRaven simply adds 2048 very simple documents to the database.
The first skip take combination gets the first 1024 doucuments modifies the documents and then commits changes.
The next skip take combination again wants to get the first 1024 documents but this time it gets the document number 1024 to 2048 and hence fails the test. Why is this , I would expect the first 1024 again?
Edit: I have varified that if I dont modify the documents the behaviour is fine.
The problem is that you don't specify an order by, and that means that RavenDB is free to choose with items to return, those aren't necessarily going to be the same items that it returned in the previous call.
Use an OrderBy and it will be consistent.