Why is Orchard so slow when executing a content item query? - sql

Lets say i want to query all Orchard user IDs and i want to include those users that have been removed (aka soft deleted) also. The DB contains around 1000 users.
Option A - takes around 2 minutes
Orchard.ContentManagement.IContentManager lContentManager = ...;
lContentManager
.Query<Orchard.Users.Models.UserPart, Orchard.Users.Models.UserPartRecord>(Orchard.ContentManagement.VersionOptions.AllVersions)
.List()
.Select(u => u.Id)
.ToList();
Option B - executes with almost unnoticeable delay
Orchard.Data.IRepository<Orchard.Users.Models.UserPartRecord> UserRepository = ...;
UserRepository .Fetch(u => true).Select(u => u.Id).ToList();
I don't see any SQL queries being executed in SQL Profiler when using Option A. I guess it has something to do with NHibernate or caching.
Is there any way to optimize Option A?

Could it be because the IContentManager version is accessing the data via InfoSet (basically an xml representation of the data), where as the IRepository version uses the actual DB table itself.
I seem to remember reading that though Infoset is great in many cases, when you're dealing with larger datasets with sorting / filtering it is more efficient to go direct to the table, as using Infoset requires each xml fragment to be parsed and elements extracted before you get to the data.
Since 'the shift', Orchard uses both so you can use whichever method best suits to your needs. I can't find the article that explained it now, but this explains the shift & infosets quite nicely:
http://weblogs.asp.net/bleroy/the-shift-how-orchard-painlessly-shifted-to-document-storage-and-how-it-ll-affect-you
Hope that helps you?

Related

neo4j count nodes performance on 200K nodes and 450K relations

We're developing an application based on neo4j and php with about 200k nodes, which every node has a property like type='user' or type='company' to denote a specific entity of our application. We need to get the count of all nodes of a specific type in the graph.
We created an index for every entity like users, companies which holds the nodes of that property. So inside users index resides 130K nodes, and the rest on companies.
With Cypher we quering like this.
START u=node:users('id:*')
RETURN count(u)
And the results are
Returned 1 row.Query took 4080ms
The Server is configured as default with a little tweaks, but 4 sec is too for our needs. Think that the database will grow in 1 month 20K, so we need this query performs very very much.
Is there any other way to do this, maybe with Gremlin, or with some other server plugin?
I'll cache those results, but I want to know if is possible to tweak this.
Thanks a lot and sorry for my poor english.
Finaly, using Gremlin instead of Cypher, I found the solution.
g.getRawGraph().index().forNodes('NAME_OF_USERS_INDEX').query(
new org.neo4j.index.lucene.QueryContext('*')
).size()
This method uses the lucene index to get "aproximate" rows.
Thanks again to all.
Mmh,
this is really about the performance of that Lucene index. If you just need this single query most of the time, why not update an integer with the total count on some node somewhere, and maybe update that together with the index insertions, for good measure run an update with the query above every night on it?
You could instead keep a property on a specific node up to date with the number of such nodes, where updates are done guarded by write locks:
Transaction tx = db.beginTx();
try {
...
...
tx.acquireWriteLock( countingNode );
countingNode.setProperty( "user_count",
((Integer)countingNode.getProperty( "user_count" ))+1 );
tx.success();
} finally {
tx.finish();
}
If you want the best performance, don't model your entity categories as properties on the node. In stead, do it like this :
company1-[:IS_ENTITY]->companyentity
Or if you are using 2.0
company1:COMPANY
The second would also allow you automatically update your index in a separate background thread by the way, imo one of the best new features of 2.0
The first method should also proof more efficient, since making a "hop" in general takes less time than reading a property from a node. It does however require you to create a separate index for the entities.
Your queries would look like this :
v2.0
MATCH company:COMPANY
RETURN count(company)
v1.9
START entity=node:entityindex(value='company')
MATCH company-[:IS_ENTITIY]->entity
RETURN count(company)

Memory leak in Rails 3.0.11 migration

A migration contains the following:
Service.find_by_sql("select
service_id,
registrations.regulator_given_id,
registrations.regulator_id
from
registrations
order by
service_id, updated_at desc").each do |s|
this_service_id = s["service_id"]
if this_service_id != last_service_id
Service.find(this_service_id).update_attributes!(:regulator_id => s["regulator_id"],
:regulator_given_id => s["regulator_given_id"])
last_service_id = this_service_id
end
end
and it is eating up memory, to the point where it will not run in the 512MB allowed in Heroku (the registrations table has 60,000 items). Is there a known problem? Workaround? Fix in a later version of Rails?
Thanks in advance
Edit following request to clarify:
That is all the relevant source - the rest of the migration creates the two new columns that are being populated. The situation is that I have data about services from multiple sources (regulators of the services) in the registrations table. I have decided to 'promote' some of the data ([prime]regulator_id and [prime]regulator_given_key) into the services table for the prime regulators to speed up certain queries.
This will load all 60000 items in one go and keep those 60000 AR objects around, which will consume a fair amount of memory. Rails does provide a find_each method for breaking down a query like that into chunks of 1000 objects at a time, but it doesn't allow you to specify an ordering as you do.
You're probably best off implementing your own paging scheme. Using limit/offset is a possibility however large OFFSET values are usually inefficient because the database server has to generate a bunch of results that it then discards.
An alternative is to add conditions to your query that ensures that you don't return already processed items, for example specifying that service_id be less than the previously returned values. This is more complicated if when compared in this matter some items are equal. With both of these paging type schemes you probably need to think about what happens if a row gets inserted into your registrations table while you are processing it (probably not a problem with migrations, assuming you run them with access to the site disabled)
(Note: OP reports this didn't work)
Try something like this:
previous = nil
Registration.select('service_id, regulator_id, regulator_given_id')
.order('service_id, updated_at DESC')
.each do |r|
if previous != r.service_id
service = Service.find r.service_id
service.update_attributes(:regulator_id => r.regulator_id, :regulator_given_id => r.regulator_given_id)
previous = r.service_id
end
end
This is a kind of hacky way of getting the most recent record from regulators -- there's undoubtedly a better way to do it with DISTINCT or GROUP BY in SQL all in a single query, which would not only be a lot faster, but also more elegant. But this is just a migration, right? And I didn't promise elegant. I also am not sure it will work and resolve the problem, but I think so :-)
The key change is that instead of using SQL, this uses AREL, meaning (I think) the update operation is performed once on each associated record as AREL returns them. With SQL, you return them all and store in an array, then update them all. I also don't think it's necessary to use the .select(...) clause.
Very interested in the result, so let me know if it works!

django objects...values() select only some fields

I'm optimizing the memory load (~2GB, offline accounting and analysis routine) of this line:
l2 = Photograph.objects.filter(**(movie.get_selectors())).values()
Is there a way to convince django to skip certain columns when fetching values()?
Specifically, the routine obtains all rows of the table matching certain criteria (db is optimized and performs it very quickly), but it is a bit too much for python to handle - there is a long string referenced in each row, storing the urls for thumbnails.
I only really need three fields from each row, but, if all the fields are included, it suddenly consumes about 5kB/row which sadly pushes the RAM to the limit.
The values(*fields) function allows you to specify which fields you want.
Check out the QuerySet method, only. When you declare that you only want certain fields to be loaded immediately, the QuerySet manager will not pull in the other fields in your object, till you try to access them.
If you have to deal with ForeignKeys, that must also be pre-fetched, then also check out select_related
The two links above to the Django documentation have good examples, that should clarify their use.
Take a look at Django Debug Toolbar it comes with a debugsqlshell management command that allows you to see the SQL queries being generated, along with the time taken, as you play around with your models on a django/python shell.

Fastest way to query for object existence in NHibernate

I am looking for the fastest way to check for the existence of an object.
The scenario is pretty simple, assume a directory tool, which reads the current hard drive. When a directory is found, it should be either created, or, if already present, updated.
First lets only focus on the creation part:
public static DatabaseDirectory Get(DirectoryInfo dI)
{
var result = DatabaseController.Session
.CreateCriteria(typeof (DatabaseDirectory))
.Add(Restrictions.Eq("FullName", dI.FullName))
.List<DatabaseDirectory>().FirstOrDefault();
if (result == null)
{
result = new DatabaseDirectory
{
CreationTime = dI.CreationTime,
Existing = dI.Exists,
Extension = dI.Extension,
FullName = dI.FullName,
LastAccessTime = dI.LastAccessTime,
LastWriteTime = dI.LastWriteTime,
Name = dI.Name
};
}
return result;
}
Is this the way to go regarding:
Speed
Separation of Concern
What comes to mind is the following: A scan will always be performed "as a whole". Meaning, during a scan of drive C, I know that nothing new gets added to the database (from some other process). So it MAY be a good idea to "cache" all existing directories prior to the scan, and look them up this way. On the other hand, this may be not suitable for large sets of data, like files (which will be 600.000 or more)...
Perhaps some performance gain can be achieved using "index columns" or something like this, but I am not so familiar with this topic. If anybody has some references, just point me in the right direction...
Thanks,
Chris
PS: I am using NHibernate, Fluent Interface, Automapping and SQL Express (could switch to full SQL)
Note:
In the given problem, the path is not the ID in the database. The ID is an auto-increment, and I can't change this requirement (other reasons). So the real question is, what is the fastest way to "check for the existance of an object, where the ID is not known, just a property of that object"
And batching might be possible, by selecting a big group with something like "starts with C:Testfiles\" but the problem then remains, how do I know in advance how big this set will be. I cant select "max 1000" and check in this buffered dictionary, because i might "hit next to the searched dir"... I hope this problem is clear. The most important part, is, is buffering really affecting performance this much. If so, does it make sense to load the whole DB in a dictionary, containing only PATH and ID (which will be OK, even if there are 1.000.000 object, I think..)
First off, I highly recommend that you (anyone using NH, really) read Ayende's article about the differences between Get, Load, and query.
In your case, since you need to check for existence, I would use .Get(id) instead of a query for selecting a single object.
However, I wonder if you might improve performance by utilizing some knowledge of your problem domain. If you're going to scan the whole drive and check each directory for existence in the database, you might get better performance by doing bulk operations. Perhaps create a DTO object that only contains the PK of your DatabaseDirectory object to further minimize data transfer/processing. Something like:
Dictionary<string, DirectoryInfo> directories;
session.CreateQuery("select new DatabaseDirectoryDTO(dd.FullName) from DatabaseDirectory dd where dd.FullName in (:ids)")
.SetParameterList("ids", directories.Keys)
.List();
Then just remove those elements that match the returned ID values to get the directories that don't exist. You might have to break the process into smaller batches depending on how large your input set is (for the files, almost certainly).
As far as separation of concerns, just keep the operation at a repository level. Have a method like SyncDirectories that takes a collection (maybe a Dictionary if you follow something like the above) that handles the process for updating the database. That way your higher application logic doesn't have to worry about how it all works and won't be affected should you find an even faster way to do it in the future.

Batch Update in NHibernate

Does batch update command exist in NHibernate? As far as I am aware it doesn't. So what's the best way to handle this situation? I would like to do the following:
Fetch a list of objects ( let's call them a list of users, List<User> ) from the database
Change the properties of those objects, ( Users.Foreach(User=>User.Country="Antartica")
Update each item back individually ( Users.Foreach(User=>NHibernate.Session.Update(User)).
Call Session.Flush to update the database.
Is this a good approach? Will this resulted in a lot of round trip between my code and the database?
What do you think? Or is there a more elegant solution?
I know I'm late to the party on this, but thought you may like to know this is now possible using HQL in NHibernate 2.1+
session.CreateQuery(#"update Users set Country = 'Antarctica'")
.ExecuteUpdate();
Starting NHibernate 3.2 batch jobs have improvements which minimizes database roundtrips. More information can be found on HunabKu blog.
Here is example from it - these batch updates do only 6 roundtrips:
using (ISession s = OpenSession())
using (s.BeginTransaction())
{
for (int i = 0; i < 12; i++)
{
var user = new User {UserName = "user-" + i};
var group = new Group {Name = "group-" + i};
s.Save(user);
s.Save(group);
user.AddMembership(group);
}
s.Transaction.Commit();
}
You can set the batch size for updates in the nhibernate config file.
<property name="hibernate.adonet.batch_size">16</property>
And you don't need to call Session.Update(User) there - just flush or commit a transaction and NHibernate will handle things for you.
EDIT: I was going to post a link to the relevant section of the nhibernate docs but the site is down - here's an old post from Ayende on the subject:
As to whether the use of NHibernate (or any ORM) here is a good approach, it depends on the context. If you are doing a one-off update of every row in a large table with a single value (like setting all users to the country 'Antarctica' (which is a continent, not a country by the way!), then you should probably use a sql UPDATE statement. If you are going to be updating several records at once with a country as part of your business logic in the general usage of your application, then using an ORM could be a more sensible method. This depends on the number of rows you are updating each time.
Perhaps the most sensible option here if you are not sure is to tweak the batch_size option in NHibernate and see how that works out. If the performance of the system is not acceptable then you might look at implementing a straight sql UPDATE statement in your code.
Starting with NHibernate 5.0 it is possible to make bulk operations using LINQ.
session.Query<Cat>()
.Where(c => c.BodyWeight > 20)
.Update(c => new { BodyWeight = c.BodyWeight / 2 });
NHibernate will generate a single "update" sql query.
See Updating entities
You don't need to update, nor flush:
IList<User> users = session.CreateQuery (...).List<User>;
users.Foreach(u=>u.Country="Antartica")
session.Transaction.Commit();
I think NHibernate writes a batch for all the changes.
The problem is, that your users need to be loaded into memory. If it gets a problem, you can still use native SQL using NHibernate. But until you didn't prove that it is a performance problem, stick with the nice solution.
No it's not a good approach!
Native SQL is many times better for this sort of update.
UPDATE USERS SET COUNTRY = 'Antartica';
Just could not be simpler and the database engine will process this one hundred times more efficiently than row at a time Java code.