RavenDB Consistency - WaitForIndexesAfterSaveChanges() / WaitForNonStaleResultsAsOfNow - ravendb

I am using RavenDB 3.5.
I know that querying entities is not acid but loading per ID is.
Apparently writing to DB is also acid.
So far so good.
Now a question:
I've found some code:
session.Advanced.WaitForIndexesAfterSaveChanges();
entity = session.Load<T>(id);
session.Delete(entity);
session.SaveChanges();
// Func<T, T> command
command?.Invoke(entity);
what would be the purpose of calling WaitForIndexesAfterSaveChanges() here?
is this because of executing a command?
or is it rather because might depedning/consuming queries are supposed to immediately catch up with those changes made?
if this would be the case, I could remove WaitForIndexesAfterSaveChanges() in this code block and just add WaitForNonStaleResultsAsOfNow() in the queries, couldn't I?
When would I use WaitForIndexesAfterSaveChanges() in the first place if my critical queries are already flagged with WaitForNonStaleResultsAsOfNow()?

The most likely reason for this behavior is wanting to wait, in this operation, for the indexes to complete.
A good example why you want to do that is when you create a new item, and the next operation is going to show a list of items. You can use WaitForIndexesAfterSaveChanges to wait, during the save, for the indexes to update.

Related

Ravendb memory leak on query

I'm having a hard problem solving an issue with RavenDB.
At my work we have a process to trying to identify potential duplicates in our database on a specified collection (let's call it users collection).
That means, I'm iterating through the collection and for each document there is a query that is trying to find similar entities. So just imagine, it's quite a long task to run.
My problem is, when the task starts running, the memory consumption for RavenDB is going higher and higher, it's literally just growing and growing, and it seems to continue until it reaches the maximum memory of the system.
But it doesn't really makes sense, since I'm only doing query, I'm using one single index and take a default page size when querying (128).
Anybody meet a similar problem like this? I really have no idea what is going on in ravendb. but it seems like a memory leak.
RavenDB version: 3.0.179
When i need to do massive operations on large collections i work following this steps to prevent problems on memory usage:
I use Query Streaming to extract all the ids of the documents that i want to process (with a dedicated session)
I open a new session for each id, i load the document and then i do what i need
First, a recommendation: if you don't want duplicates, store them with a well-known ID. For example, suppose you don't want duplicate User objects. You'd store them with an ID that makes them unique:
var user = new User() { Email = "foo#bar.com" };
var id = "Users/" + user.Email; // A well-known ID
dbSession.Store(user, id);
Then, when you want to check for duplicates, just check against the well known name:
public string RegisterNewUser(string email)
{
// Unlike .Query, the .Load call is ACID and never stale.
var existingUser = dbSession.Load<User>("Users/" + email);
if (existingUser != null)
{
return "Sorry, that email is already taken.";
}
}
If you follow this pattern, you won't have to worry about running complex queries nor worry about stale indexes.
If this scenario can't work for you for some reason, then we can help diagnose your memory issues. But to diagnose that, we'll need to see your code.

Doctrine2: fill up related entities after loading main

There are HotelComment and CommentPhoto (1:n) - user can add some photos to own comment. I'm loading slice of comments with one query and want load photos to this comments using other query (using WHERE IN).
$comments = $commentsRepo->findByHotel($hotel);
$comments->loadPhotos(); // of course comments is simple array yet
Loading comments needed on demand, not on PostLoad event.
So question is: how it possible associate loaded comments with objects of HotelComment? Using ReflectionProperty: setAcesseble() + setValue()? Is there simpler sollution? And I'm afraid that UoW detects HotelComment entities as modified and will send updates to db.
If you want to hydrate the related objects this one time only, and not every time the object is loaded, you need to use DQL:
$em->createQuery("SELECT comments, photos FROM HotelComment comments JOIN comments.photos photos");
You can put this in a method on the repository.
This will issue a single SELECT statement, with an INNER JOIN to the comment photos table.
You have to configure your relation as "LAZY". See doctrine documentation:
ManyToOne
ManyToMany
OneToOne
Than you'll be able to load it lazily with $comments->loadPhotos(), at least documentation says so
UPDATE: I think you don't have to to something special to avoid your entities flushing to the DB. In fact, when you query your entries with DQL, they have managed state, so attaching them to other managed entity's collection does not change their states, so they are not flushed unless you have modified them.
Hovewer, that doesn't help at all, because associations are fetched before first usage, so adding an entity to the collection with the following code will result in an implicit database query:
$comment->addPhoto($photo);
//in Comment class
function addPhoto(Photo $photo){
//var_dump(count($this->photos)); //if you have any - they are already here
$this->photos->add($photo);
}
Maybe declaring your collection as public (or that tricks with ReflectionProperty) will help fool the Doctrine, but that's a dirty hack, so I haven't even tried them.
Detaching parent entity also doesn't help. I've ran out of ideas for now....

groovy sql eachRow and rows method

I am new to grails and groovy.
Can anyone please explain to me the difference between these two groovy sql methods
sql.eachRow
sql.rows
Also, which is more efficient?
I am working on an application that retrieves data from the database(the resultset is very huge) and writes it to CSV file or returns a JSON format.
I was wondering which of the two methods mentioned above to use to have the process done faster and efficient.
Can anyone please explain to me the
difference between these two groovy
sql methods sql.eachRow sql.rows
It's difficult to tell exactly which 2 methods you're referring 2 because there are a large number of overloaded versions of each method. However, in all cases, eachRow returns nothing
void eachRow(String sql, Closure closure)
whereas rows returns a list of rows
List rows(String sql)
So if you use eachRow, the closure passed in as the second parameter should handle each row, e.g.
sql.eachRow("select * from PERSON where lastname = 'murphy'") { row ->
println "$row.firstname"
}
whereas if you use rows the rows are returned, and therefore should be handled by the caller, e.g.
rows("select * from PERSON where lastname = 'murphy'").each {row ->
println "$row.firstname"
}
Also, which is more efficient?
This question is almost unanswerable. Even if I had implemented these methods myself there's no way of knowing which one will perform better for you because I don't know
what hardware you're using
what JVM you're targeting
what version of Groovy you're using
what parameters you'll be passing
whether this method is a bottleneck for your application's performance
or any of the other factors that influence a method's performance that cannot be determined from the source code alone. The only way you can get a useful answer to the question of which method is more efficient for you is by measuring the performance of each.
Despite everything I've said above, I would be amazed if the performance difference between these two was in any way significant, so if I were you, I would choose whichever one you find more convenient. If you find later on that this method is a performance bottleneck, try using the other one instead (but I'll bet you a dollar to a dime it makes no difference).
If we set aside minor syntax differences, there is one difference that seems important. Let's consider
sql.rows("select * from my_table").each { row -> doIt(row) }
vs
sql.eachRow("select * from my_table") { row -> doIt(row) }
The first one opens connection, retrieves results, closes connection and returns them. Now you can iterate over the results while connection is released. The drawback is you now have entire result list in memory which in some cases might be a lot.
EachRow on the other hand opens a connection and while keeping it open executes your closure for each row. If your closure operates on the database and requires another connection your code will consume two connections from the pool at the same time. The connection used by eachRow is released after it iterates though all the resulting rows. Also if you don't perform any database operations but closure takes a while to execute, you will be blocking one database connection until eachRow completes.
I am not 100% sure but possibly eachRow allows you not to keep all resulting rows in memory but access them through a cursor - this may depend on the database driver.
If you don't perform any database operations inside your closure, closure executes fast and results list is big enough to impact memory then I'd go for eachRow. If you do perform DB operations inside closure or each closure call takes significant time while results list is manageable, then go for rows.
They differ in signature only - both support result sets paging, so both will be efficient. Use whichever fits your code.

How to determine if nHibernate object changed

Probably a stupid question but I'm still trying to wrap my head around nHibernate.
As far as I can tell from using the software, nHibernate requires you to do a little bit of extra handling for saving changes properly.
Let's imagine I have an object X which can contain many of object Y. I'll create an X which has 2 Y's, each of which have their own properties. I then decide I want to update X. I'm going to add a new Y, and change one of the existing Y's.
So I load in my object X using it's ID. I then iterate through the Y's that I'm adding, add them to the X and save the lot using an update statement.
If you do this, you find the "old" Y's get orphaned in the database. Which, when I think about it, is exactly what I'd expect to happen - I haven't got rid of those objects after all, I've just created some new ones.
So there's two ways to look at this. Either I ought to be deleting all the Y data and then re-creating it, or I ought to be able to flag up to nHibernate that what I'm doing is a change and that it should be updating existing objects rather than creating new ones. Trouble is, I'm not sure which is the "right" approach or how best to do it - the former seems tremendously inefficient and the latter means setting a lot of "changed" flags and very fiddly code.
So I'm pretty sure there must be an easier solution that I'm missing in my stupidity. Can someone point me at the best approach and how best to handle it in nHibernate ... that is if the question makes any sense at all :)
Cheers,
Matt
You probably have a mapping or usage problem.
Correctly configured, your usage should be something like this:
using (var session = sessionFactory.OpenSession())
using (var tx = session.BeginTransaction())
{
var x = session.Get<X>(theId);
x.Ys[0].SomeProperty = theNewValue;
x.Ys.Add(theNewY);
tx.Commit();
}
You should post more details about the actual classes, mappings and usage.
Also, I suggest that you read the docs in full: http://nhibernate.info/doc/nh/en/index.html. It's only a few hours, that will save you many days of frustration.

Fastest way to query for object existence in NHibernate

I am looking for the fastest way to check for the existence of an object.
The scenario is pretty simple, assume a directory tool, which reads the current hard drive. When a directory is found, it should be either created, or, if already present, updated.
First lets only focus on the creation part:
public static DatabaseDirectory Get(DirectoryInfo dI)
{
var result = DatabaseController.Session
.CreateCriteria(typeof (DatabaseDirectory))
.Add(Restrictions.Eq("FullName", dI.FullName))
.List<DatabaseDirectory>().FirstOrDefault();
if (result == null)
{
result = new DatabaseDirectory
{
CreationTime = dI.CreationTime,
Existing = dI.Exists,
Extension = dI.Extension,
FullName = dI.FullName,
LastAccessTime = dI.LastAccessTime,
LastWriteTime = dI.LastWriteTime,
Name = dI.Name
};
}
return result;
}
Is this the way to go regarding:
Speed
Separation of Concern
What comes to mind is the following: A scan will always be performed "as a whole". Meaning, during a scan of drive C, I know that nothing new gets added to the database (from some other process). So it MAY be a good idea to "cache" all existing directories prior to the scan, and look them up this way. On the other hand, this may be not suitable for large sets of data, like files (which will be 600.000 or more)...
Perhaps some performance gain can be achieved using "index columns" or something like this, but I am not so familiar with this topic. If anybody has some references, just point me in the right direction...
Thanks,
Chris
PS: I am using NHibernate, Fluent Interface, Automapping and SQL Express (could switch to full SQL)
Note:
In the given problem, the path is not the ID in the database. The ID is an auto-increment, and I can't change this requirement (other reasons). So the real question is, what is the fastest way to "check for the existance of an object, where the ID is not known, just a property of that object"
And batching might be possible, by selecting a big group with something like "starts with C:Testfiles\" but the problem then remains, how do I know in advance how big this set will be. I cant select "max 1000" and check in this buffered dictionary, because i might "hit next to the searched dir"... I hope this problem is clear. The most important part, is, is buffering really affecting performance this much. If so, does it make sense to load the whole DB in a dictionary, containing only PATH and ID (which will be OK, even if there are 1.000.000 object, I think..)
First off, I highly recommend that you (anyone using NH, really) read Ayende's article about the differences between Get, Load, and query.
In your case, since you need to check for existence, I would use .Get(id) instead of a query for selecting a single object.
However, I wonder if you might improve performance by utilizing some knowledge of your problem domain. If you're going to scan the whole drive and check each directory for existence in the database, you might get better performance by doing bulk operations. Perhaps create a DTO object that only contains the PK of your DatabaseDirectory object to further minimize data transfer/processing. Something like:
Dictionary<string, DirectoryInfo> directories;
session.CreateQuery("select new DatabaseDirectoryDTO(dd.FullName) from DatabaseDirectory dd where dd.FullName in (:ids)")
.SetParameterList("ids", directories.Keys)
.List();
Then just remove those elements that match the returned ID values to get the directories that don't exist. You might have to break the process into smaller batches depending on how large your input set is (for the files, almost certainly).
As far as separation of concerns, just keep the operation at a repository level. Have a method like SyncDirectories that takes a collection (maybe a Dictionary if you follow something like the above) that handles the process for updating the database. That way your higher application logic doesn't have to worry about how it all works and won't be affected should you find an even faster way to do it in the future.