Symfony2, Doctrine, add\insert\update best solution for big count of queries - optimization

Let's imagine we have this code:
while (true)
{
foreach($array as $row)
{
$item = $em->getRepository('reponame')->findOneBy(array('filter'));
if (!$item)
{
$needPersist = true;
$item = new Item();
}
$item->setItemName()
// and so on ...
if ($needPersist)
{
$em->persist();
}
}
$em->flush();
}
So, the point is that code will be executed a lot of times (while server won't die :) ). And we want to optimize it. Every time we:
Select already entry from repository.
If entry not exists, create it.
Set new (update) vars to it.
Apply actions (flush).
So question is - how to avoid unnecessary queries and optimize "check if entry is exist"? Because when there are 100-500 queries it's not so scary... But when it comes up to 1000-10000 for one while loop - it's too much.
PS: Each entry in DB is unique by several columns (not only by ID).

Instead of fetching results one-by-one, load all results with one query.
Eg.
let's say your filter wants to load ids 1, 2, 10. So QB would be something like:
$allResults = ...
->where("o.id IN (:ids)")->setParameter("ids", $ids)
->getQuery()
->getResults() ;
"foreach" of these results, do your job of updating them and flushing
While doing that loop, save ids of those fetched objects in new array
Compare that array with original one using array_diff. Now you have ids that were not fetched the first time
Rinse and repeat :)
And don't forget $em->clear() to free memory
While this can still be slow when working with 10.000 records (dunno, never tested), it will be much faster to have 2 big queries than 10.000 small ones.

Regardless if you need them to persist or not after the update, retrieving 10k+ and up entries from the database and hydrating them to php objects is going to need too much memory. In such cases you should better fallback to the Doctrine DBAL Layer and fire pure SQL queries.

Related

Should I use unique tables for every user?

I'm working on an web app that collects traffic information for websites that use my service. Think google analytics but far more visual. I'm using SQL Server 2012 for the backbone of my app and am considering using MongoDB as the data gathering analytic side of the site.
If I have 100 users with an average of 20,000 hits a month on their site, that's 2,000,000 records in a single collection that will be getting queried.
Should I use MongoDB to store this information (I'm new to it and new things are intimidating)?
Should I dynamically create new collections/tables for every new user?
Thanks!
With MongoDB the collection (aka sql table) can get quite big without much issue. That is largely what it is designed for. The Mongo is part HuMONGOus (pretty clever eh). This is a great use for mongodb which is great at storing point in time information.
Options :
1. New Collection for each Client
very easy to do I use a GetCollectionSafe Method for this
public class MongoStuff
private static MongoDatabase GetDatabase()
{
var databaseName = "dbName";
var connectionString = "connStr";
var client = new MongoClient(connectionString);
var server = client.GetServer();
return server.GetDatabase(databaseName);
}
public static MongoCollection<T> GetCollection<T>(string collectionName)
{
return GetDatabase().GetCollection<T>(collectionName);
}
public static MongoCollection<T> GetCollectionSafe<T>(string collectionName)
{
//var db = GetDatabase();
var db = GetDatabase();
if (!db.CollectionExists(collectionName)) {
db.CreateCollection(collectionName);
}
return db.GetCollection<T>(collectionName);
}
}
then you can call with :
var collection = MongoStuff.GetCollectionSafe<Record>("ClientName");
Running this script
static void Main(string[] args)
{
var times = new List<long>();
for (int i = 0; i < 1000; i++)
{
Stopwatch watch = new Stopwatch();
watch.Start();
MongoStuff.GetCollectionSafe<Person>(String.Format("Mark{0:000}", i));
watch.Stop();
Console.WriteLine(watch.ElapsedMilliseconds);
times.Add(watch.ElapsedMilliseconds);
}
Console.WriteLine(String.Format("Max : {0} \nMin : {1} \nAvg : {2}", times.Max(f=>f), times.Min(f=> f), times.Average(f=> f)));
Console.ReadKey();
}
Gave me (on my laptop)
Max : 180
Min : 1
Avg : 6.635
Benefits :
Ease of splitting data if one client needs to go on their own
Might match your brain map of the problem
Cons :
Almost impossible to do aggregate data over all collections
Hard to find collections in Management studios (like robomongo)
2. One Large Collection
Use one collection for it all access it this way
var coll = MongoStuff.GetCollection<Record>("Records");
Put an index on the table (the index will make reads orders of magnitude quicker)
coll.EnsureIndex(new IndexKeysBuilder().Ascending("ClientId"));
needs to only be run once (per collection, per index )
Benefits :
One Simple place to find data
Aggregate over all clients possible
More traditional Mongodb setup
Cons :
All Clients Data is intermingled
May not mentally map as well
Just as a reference the mongodb limits for sizes are here :
[http://docs.mongodb.org/manual/reference/limits/][1]
3. Store only aggregated data
If you are never intending to break down to an individual record just save the aggregates themselves.
Page Loads :
# Page Total Time Average Time
15 Default.html 1545 103
I will let someone else tackle the MongoDB side of your question as I don't feel I'm the best person to comment on it, I would point out that MongoDB is a very different animal and you'll lose a lot of the RI you enjoy in SQL.
In terms of SQL design I would not use a different schema for each customer approach. Your database schema and backups could grow uncontrollably, maintaining a dynamically growing schema will be a nightmare.
I would suggest one of two approaches:
Either you can create a new database for each customer:
This is more secure as users cannot access each other's data (just use different credentials) and users are easier to manage/migrate and separate.
However many hosting providers charge per database, it will cost more to run and maintain and should you wish to compare data across users it gets much more challenging.
Your second approach is to simply host all users in a single DB, your tables will grow large (although 2 million rows is not over the top for a well maintained SQL DB). You would simply use a UserID column to discriminate.
The emphasis will be on you to get the performance you need through proper indexing
Users' data will exist in the same system and there's no SQL defense against users accessing each other's data - your code will have to be good!

Nhibernate QueryOver don't get latest database changes

I am trying get a record updated from database with QueryOver.
My code initially creates an entity and saves in database, then the same record is updated on database externally( from other program, manually or the same program running in other machine), and when I call queryOver filtering by the field changed, the query gets the record but without latest changes.
This is my code:
//create the entity and save in database
MyEntity myEntity = CreateDummyEntity();
myEntity.Name = "new_name";
MyService.SaveEntity(myEntity);
// now the entity is updated externally changing the name property with the
// "modified_name" value (for example manually in TOAD, SQL Server,etc..)
//get the entity with QueryOver
var result = NhibernateHelper.Session
.QueryOver<MyEntity>()
.Where(param => param.Name == "modified_name")
.List<T>();
The previous statement gets a collection with only one record(good), BUT with the name property established with the old value instead of "modified_name".
How I can fix this behaviour? First Level cache is disturbing me? The same problem occurs with
CreateCriteria<T>();
The session in my NhibernateHelper is not being closed in any moment due application framework requirements, only are created transactions for each commit associated to a session.Save().
If I open a new session to execute the query evidently I get the latest changes from database, but this approach is not allowed by design requirement.
Also I have checked in the NHibernate SQL output that a select with a WHERE clause is being executed (therefore Nhibernate hits the database) but don´t updates the returned object!!!!
UPDATE
Here's the code in SaveEntity after to call session.Save: A call to Commit method is done
public virtual void Commit()
{
try
{
this.session.Flush();
this.transaction.Commit();
}
catch
{
this.transaction.Rollback();
throw;
}
finally
{
this.transaction = this.session.BeginTransaction();
}
}
The SQL generated by NHibernate for SaveEntity:
NHibernate: INSERT INTO MYCOMPANY.MYENTITY (NAME) VALUES (:p0);:p0 = 'new_name'.
The SQL generated by NHibernate for QueryOver:
NHibernate: SELECT this_.NAME as NAME26_0_
FROM MYCOMPANY.MYENTITY this_
WHERE this_.NAME = :p0;:p0 = 'modified_name' [Type: String (0)].
Queries has been modified due to company confidential policies.
Help very appreciated.
As far as I know, you have several options :
have your Session as a IStatelessSession, by calling sessionFactory.OpenStatelesSession() instead of sessionFactory.OpenSession()
perform Session.Evict(myEntity) after persisting an entity in DB
perform Session.Clear() before your QueryOver
set the CacheMode of your Session to Ignore, Put or Refresh before your QueryOver (never tested that)
I guess the choice will depend on the usage you have of your long running sessions ( which, IMHO, seem to bring more problems than solutions )
Calling session.Save(myEntity) does not cause the changes to be persisted to the DB immediately*. These changes are persisted when session.Flush() is called either by the framework itself or by yourself. More information about flushing and when it is invoked can be found on this question and the nhibernate documentation about flushing.
Also performing a query will not cause the first level cache to be hit. This is because the first level cache only works with Get and Load, i.e. session.Get<MyEntity>(1) would hit the first level cache if MyEntity with an id of 1 had already been previously loaded, whereas session.QueryOver<MyEntity>().Where(x => x.id == 1) would not.
Further information about NHibernate's caching functionality can be found in this post by Ayende Rahien.
In summary you have two options:
Use a transaction within the SaveEntity method, i.e.
using (var transaction = Helper.Session.BeginTransaction())
{
Helper.Session.Save(myEntity);
transaction.Commit();
}
Call session.Flush() within the SaveEntity method, i.e.
Helper.Session.Save(myEntity);
Helper.Session.Flush();
The first option is the best in pretty much all scenarios.
*The only exception I know to this rule is when using Identity as the id generator type.
try changing your last query to:
var result = NhibernateHelper.Session
.QueryOver<MyEntity>()
.CacheMode(CacheMode.Refresh)
.Where(param => param.Name == "modified_name")
if that still doesn't work, try add this after the query:
NhibernateHelper.Session.Refresh(result);
After search and search and think and think.... I´ve found the solution.
The fix: It consist in open a new session, call QueryOver<T>() in this session and the data is succesfully refreshed. If you get child collections not initialized you can call HibernateUtil.Initialize(entity) or sets lazy="false" in your mappings. Take special care about lazy="false" in large collections, because you can get a poor performance. To fix this problem(performance problem loading large collections), set lazy="true" in your collection mappings and call the mentioned method HibernateUtil.Initialize(entity) of the affected collection to get child records from database; for example, you can get all records from a table, and if you need access to all child records of a specific entity, call HibernateUtil.Initialize(collection) only for the interested objects.
Note: as #martin ernst says, the update problem can be a bug in hibernate and my solution is only a temporal fix, and must be solved in hibernate.
People here do not want to call Session.Clear() since it is too strong.
On the other hand, Session.Evict() may seem un-applicable when the objects are not known beforehand.
Actually it is still usable.
You need to first retrieve the cached objects using the query, then call Evict() on them. And then again retrieve fresh objects calling the same query again.
This approach is slightly inefficient in case the object was not cached to begin with - since then there would be actually two "fresh" queries - but there seems to be not much to do about that shortcoming...
By the way, Evict() accepts null argument too without exceptions - this is useful in case the queried object is actually not present in the DB.
var cachedObjects = NhibernateHelper.Session
.QueryOver<MyEntity>()
.Where(param => param.Name == "modified_name")
.List<T>();
foreach (var obj in cachedObjects)
NhibernateHelper.Session.Evict(obj);
var freshObjects = NhibernateHelper.Session
.QueryOver<MyEntity>()
.Where(param => param.Name == "modified_name")
.List<T>()
I'm getting something very similar, and have tried debugging NHibernate.
In my scenario, the session creates an object with a couple children in a related collection (cascade:all), and then calls ISession.Flush().
The records are written into the DB, and the session needs to continue without closing. Meanwhile, another two child records are written into the DB and committed.
Once the original session then attempts to re-load the graph using QueryOver with JoinAlias, the SQL statement generated looks perfectly fine, and the rows are being returned correctly, however the collection that should receive these new children is found to have already been initialized within the session (as it should be), and based on that NH decides for some reason to completely ignore the respective rows.
I think NH makes an invalid assumption here that if the collection is already marked "Initialized" it does not need to be re-loaded from the query.
It would be great if someone more familiar with NHibernate internals could chime in on this.

How can I speed my Entity Framework code?

My SQL and Entity Framework knowledge is a somewhat limited. In one Entity Framework (4) application, I notice it takes forever (about 2 minutes) to complete one of my method calls. The first queries do not take much time, but when I loop through the Entity Framework objects returned by the queries, even though I am only reading (not modifying) the data I supposedly got, it takes forever to complete the nested loops, even though there are only dozens of entries in each list and a few levels of looping.
I expect the example below could be re-written with a fancier query that could probably include all of the filtering I am doing in my loops with some SQL words I don't really know how to use, so if someone could show me what the equivalent SQL expression would be, that would be extremely educational to me and probably solve my current performance problem.
Moreover, since other parts of this and other applications I develop often want to do more complex computations on SQL data, I would also like to know a good way to retrieve data from Entity Framework to local memory objects that do not have huge delays in reading them. In my LINQ-to-SQL project there was a similar performance problem, and I solved it by refactoring the whole application to load all SQL data into parallel objects in RAM, which I had to write myself, and I wonder if there isn't a better way to either tell Entity Framework to not keep doing whatever high-latency communication it is doing, or to load into local RAM objects.
In the example below, the code gets a list of food menu items for a member (i.e. a person) on a certain date via a SQL query, and then I use other queries and loops to filter out the menu items on two criteria: 1) If the member has a rating of zero for any group id which the recipe is a member of (a many-to-many relationship) and 2) If the member has a rating of zero for the recipe itself.
Example:
List<PFW_Member_MenuItem> MemberMenuForCookDate =
(from item in _myPfwEntities.PFW_Member_MenuItem
where item.MemberID == forMemberId
where item.CookDate == onCookDate
select item).ToList();
// Now filter out recipes in recipe groups rated zero by the member:
List<PFW_Member_Rating_RecipeGroup> ExcludedGroups =
(from grpRating in _myPfwEntities.PFW_Member_Rating_RecipeGroup
where grpRating.MemberID == forMemberId
where grpRating.Rating == 0
select grpRating).ToList();
foreach (PFW_Member_Rating_RecipeGroup grpToExclude in ExcludedGroups)
{
List<PFW_Member_MenuItem> rcpsToRemove = new List<PFW_Member_MenuItem>();
foreach (PFW_Member_MenuItem rcpOnMenu in MemberMenuForCookDate)
{
PFW_Recipe rcp = GetRecipeById(rcpOnMenu.RecipeID);
foreach (PFW_RecipeGroup group in rcp.PFW_RecipeGroup)
{
if (group.RecipeGroupID == grpToExclude.RecipeGroupID)
{
rcpsToRemove.Add(rcpOnMenu);
break;
}
}
}
foreach (PFW_Member_MenuItem rcpToRemove in rcpsToRemove)
MemberMenuForCookDate.Remove(rcpToRemove);
}
// Now filter out recipes rated zero by the member:
List<PFW_Member_Rating_Recipe> ExcludedRecipes =
(from rcpRating in _myPfwEntities.PFW_Member_Rating_Recipe
where rcpRating.MemberID == forMemberId
where rcpRating.Rating == 0
select rcpRating).ToList();
foreach (PFW_Member_Rating_Recipe rcpToExclude in ExcludedRecipes)
{
List<PFW_Member_MenuItem> rcpsToRemove = new List<PFW_Member_MenuItem>();
foreach (PFW_Member_MenuItem rcpOnMenu in MemberMenuForCookDate)
{
if (rcpOnMenu.RecipeID == rcpToExclude.RecipeID)
rcpsToRemove.Add(rcpOnMenu);
}
foreach (PFW_Member_MenuItem rcpToRemove in rcpsToRemove)
MemberMenuForCookDate.Remove(rcpToRemove);
}
You can use EFProf http://www.hibernatingrhinos.com/products/EFProf to track see exactly what EF is sending to SQL. It can also show you how many queries you are sending and how many unique queries. It also provides you some analysis of each query (e.g. is it unbound etc). Entity Framework with its navigation properties, it is quite easy to not realize you are making a db request. When you are in a loop, and have a navigation property, you get in to the N + 1 problem.
You could use the Keyword Virtual on your List parts of your model if you are using code first to enable proxying, that way you will not have to get all the data back at once, only as you need it.
Also consider NoTracking for read only data
context.bigTable.MergeOption = MergeOption.NoTracking;

How to delete data in DB efficiently using LinQ to NHibernate (one-shot-delete)

Producing software for customers, mostly using MS SQL but some Oracle, a decision was made to plunge into Nhibernate (and C#).
The task is to delete efficiently e.g. 10 000 rows from 100 000 and still stay sticked to ORM.
I've tried named queries - link already,
IQuery sql = s.GetNamedQuery("native-delete-car").SetString(0, "Kirsten");
sql.ExecuteUpdate();
but the best I have ever found seems to be:
using (ITransaction tx = _session.BeginTransaction())
{
try
{
string cmd = "delete from Customer where Id < GetSomeId()";
var count = _session.CreateSQLQuery(cmd).ExecuteUpdate();
...
Since it may not get into dB to get all complete rows before deleting them.
My questions are:
If there is a better way for this kind of delete.
If there is a possibility to get the Where condition for Delete like this:
Having a select statement (using LinQ to NHibernate) => which will generate appropriate SQL for DB => we get that Where condition and use it for Delete.
If there is a better way for this kind of delete.
Yes, you could use HQL instead of SQL.
If there is a possibility to get the Where condition for Delete [using Expressions]:
No, AFAIK that's not implemented. Since NHibernate is an open source project, I encourage you to find out if anyone has proposed this, and/or discuss it on the mailing list.
Thanks for your quick reply. Now I've probably got the difference.
session.CreateSQLQuery(cmd).ExecuteUpdate();
must have cmd with Delete From DbTable. On the contrary the HQL way
session.CreateQuery(cmd).ExecuteUpdate();
needs cmd with Delete From MappedCollectionOfObjects.
In that case it possibly solves my other question as well.
There now is a better way with NHibernate 5.0:
var biggestId = GetSomeId();
session.Query<Customer>()
.Where(c => c.Id < biggestId)
.Delete();
Documentation:
//
// Summary:
// Delete all entities selected by the specified query. The delete operation is
// performed in the database without reading the entities out of it.
//
// Parameters:
// source:
// The query matching the entities to delete.
//
// Type parameters:
// TSource:
// The type of the elements of source.
//
// Returns:
// The number of deleted entities.
public static int Delete<TSource>(this IQueryable<TSource> source);

EF: How to do effective lazy-loading (not 1+N selects)?

Starting with a List of entities and needing all dependent entities through an association, is there a way to use the corresponding navigation-propertiy to load all child-entities with one db-round-trip? Ie. generate a single WHERE fkId IN (...) statement via navigation property?
More details
I've found these ways to load the children:
Keep the set of parent-entities as IQueriable<T>
Not good since the db will have to find the main set every time and join to get the requested data.
Put the parent-objects into an array or list, then get related data through navigation properties.
var children = parentArray.Select(p => p.Children).Distinct()
This is slow since it will generate a select for every main-entity.
Creates duplicate objects since each set of children is created independetly.
Put the foreign keys from the main entities into an array then filter the entire dependent-ObjectSet
var foreignKeyIds = parentArray.Select(p => p.Id).ToArray();
var children = Children.Where(d => foreignKeyIds.Contains(d.Id))
Linq then generates the desired "WHERE foreignKeyId IN (...)"-clause.
This is fast but only possible for 1:*-relations since linking-tables are mapped away.
Removes the readablity advantage of EF by using Ids after all
The navigation-properties of type EntityCollection<T> are not populated
Eager loading though the .Include()-methods, included for completeness (asking for lazy-loading)
Alledgedly joins everything included together and returns one giant flat result.
Have to decide up front which data to use
It there some way to get the simplicity of 2 with the performance of 3?
You could attach the parent object to your context and get the children when needed.
foreach (T parent in parents) {
_context.Attach(parent);
}
var children = parents.Select(p => p.Children);
Edit: for attaching multiple, just iterate.
I think finding a good answer is not possible or at least not worth the trouble. Instead a micro ORM like Dapper give the big benefit of removing the need to map between sql-columns and object-properties and does it without the need to create a model first. Also one simply writes the desired sql instead of understanding what linq to write to have it generated. IQueryable<T> will be missed though.