nHibernate Criteria queries produces duplicates using Paging - nhibernate

Version nHibernate 2.1
As can be seen from the vast array of similar questions - we're not alone in experiencing problems with paging generating duplicates. We thought it was just happening with HQL queries but one of our clients has reported seeing it where the query is a Criteria query.
So far we've only seen it on the reporting side - where we tend to collect bits of information from various 'associated' entities and use the AliasToBeanTransformer to put it into a DTO (DataTransferObject):
.SetResultTransformer(new AliasToBeanResultTransformer(typeof(OurDTO)));
We're not new to nHibernate, but we're certainly not aware of so many of the subtleties of it, and as a result weren't aware of
new NHibernate.Transform.DistinctRootEntityResultTransformer()
which could potentially eliminate our duplicates, but I'm struggling to see how we could do this when it's not a mapped entity, i.e. a DTO.
We've tried creating a custom dialect which seems to have served some people well enough to be confident of consistent behaviour.
I realise there's no such thing as a silver bullet and context is always the kicker, but has anyone managed to come up with a solution for this?
The code we use to handle the collation of the pages is as follows:
query.SetMaxResults(50);
for (int i = 0; ; ++i)
{
query.SetFirstResult(i * 50);
IList results = query.List();
cumulativeResults.AddRange(results);
OnRecordsLoaded(results.Count);
if (results.Count < 50)
{ break; }
Many thanks for any input on this.
With kind regards
Colin

NHibernate does not produce duplicates. The relational database does. And you cannot prevent that.
If your query involves a one - to -many join say you have customer and order tables and there is a one to many relation between customers and orders and you query the customers filtering by order, you will get multiple multiple customers (of same identity)
The way to prevent it to use HashedSets in memory assuming you propery overrode Equals and GetHashCode for your entities which you should. If you put the result into HashedSet (from Iesi or .NET 4) they will elminate the duplicates.
That's one of the gotchas of ORMs.

Related

Should I use Eloquent ORM or create big joins by Fluent?

Well, I'm using Eloquent ORM for a project that I'm developing, but it is bugging me with the performance issue. When I use only its own methods, I can see by its query log that it creates a lot of queries.
I'm trying to fetch data from a main table with 4 other tables, one related to it one-to-one and the others many-to-many. Eloquent creates about 6-7 queries for it, and that makes me afraid of performance issues. Then, I remove Eloquent's methods and create jumbo queries with Fluent, using lots of joins, which makes me lose code readability and practicity.
What I really need to know is: Does Eloquent sacrifice performance? Should I stick to it, or use just Fluent? And what is better, a few big joined queries or more small ones?
I'm going to extend Sebastian's answer.
I too have many to many relationships or even one to many relationships.
I have actually melded Eloquent's style of programming (its easier on the eyes) with a bit of a joint hack with Fluent. Please be reminded that Eloquent is an extension of Fluent so your not sacrificing unless you are doing bat queries.
If you do a User and then Phone model with a One to one or one to many (a user can have many phone number)
and you simply where()->get() and then $users->phone - this will make eloquent run a select * for each ID. This is where Eager Loading (as referenced by Sebastian but too short to actually explain) is used where it prefetch all the IDs required and eager load the IDs (you can verify this by running a query log profiler).
The added bonus of this is that you can eagerload many relationships like this.
So its not cut dry solution of "is Eloquent providing a performance hit" if you dont use it the right way.
Now here is a small example to how I put both eloquent and fluent to use:
Within Book Model - I have defined a Scope function which is a relationship function:
public function scopeLicensorStatus($query, $licensor_status)
{
$query->select('book.*')
->leftJoin('licensors as l', 'l.id', '=', 'book.licensor_id')
->where('l.status','=',$licensor_status);
}
$bookData = Book::
->LicensorStatus('active')
->where('book.status','=', 'active')
->whereIN('book.id',$recommendedIds)
->take($limit)
->skip($offset)
->get();
what does this do is do the Join for me as a function and let me chain up the commands fro the outside. In the end (if you do toSQL() instead of get()) you will achieve a single query that will match raw SQL, however as you can see a) the code is reusable if you forsee to reuse the join with other constraints, b) your not sacrificing speed since the end game query is a single one (just need to write it properly), c) looks nicer and readable which is why we like eloquent.
Hope this answer helps you to dive a bit more into eloquent

NHibernate QueryCache in Multiuser-Environment

For our web-application (ASP.NET) we're using Fluent NHibernate (2.1.2) with 2nd-Level caching not only for entities, but also for queries (generating queries with the criteria API). We're using the Session-Per-Request pattern and one SessionFactory applicationwide, so the cache serves all Nhibernate-Sessions.
Problem:
We have to deal with different "Access-Rigths" per user on the data-objects in our legacy-database (Oracle) - that is, views constrain the returning data per user-rights.
So there's the situation, where for example the same view is queried by our criteria with the excact same query, but returns a different resultset, depending on the user-Rights.
Now, to gain performance, the mentioned query is cached. But this gives us the problem, that when the query is first fired from an action of user A, it caches the resulting ID's, which are the ID's to which user A has access rights. Shortly after, the same query is fired from an action of user B and Nhibernate then picks the cached ID's from the first call (from user A) and tries to get the corresponding entities, to which User B doesn't have access-rights (or maybe not for all of them). We're checking the rights with event-listeners, so our appliction throws an access-right-exception in the mentioned case.
Thoughts:
Not caching the queries could be an option against this. But performance is cleary an issue in our application, so it would be really desirable to have cached queries user-wise.
We even thought about a SessionFactory per user, to have a cache per user, sort of. But this has clearly an impact on ressources, is somewhat of an overkill and honestly isn't an option, because
there are entities, which have to be accessed, and are manipulated, by multiple users (think of a user-group), creating issues with stale data in the "individual caches" and so on. So that's a no-go.
What would be a valid solution for this? Is there something like "best practice" for such a situation?
Idea:
As I was stuck with this yesterday, seeing no way out, I slept over it, and today I came up with some sort of a "hack".
As NHibernate caches the query by query-text and parameters ("clauses"), I thought about a way, to "smuggle" something user-dependent in that signature of the queries, so it would
cache every query per user, but would not alter the query itself (concerning the result of the query).
So "creativity" guided me to this (example-code):
string userName = GetCurrentUser();
ICriteria criteria = session.CreateCriteria(typeof (EntityType))
.SetCacheable(true)
.SetCacheMode(CacheMode.Normal)
.Add(Expression.Eq("PropertyA", 1))
.Add(Expression.IsNotNull("PropertyB"))
.Add(Expression.Sql(string.Format("'{0}' = '{0}'", userName)));
return criteria.List();
This line:
.Add(Expression.Sql(string.Format("{0} = {0}", userName)))
results in a where-clause, which always evaluates to true, but "changes" the query from Nhibernate's viewpoint, so it caches per separate "userName".
I know, it's kind of ugly and I'm not really pleased with it.
Does anybody knows any alternative approach?
thanks in advance.

fetching multiple nested associations eagerly using nhibernate (and queryover)

I have a database which has multiple nested associates. Basically, the structure is as follows:
Order -> OrderItem -> OrderItemPlaylist -> OrderPlaylistItem -> Track -> Artist
I need to generate a report based on all orders sold in a certain date, which needs to traverse into ALL the mentioned associations in order to generate the required information.
Trying to join all tables together would be an overkill, as it would result in an extremely large cartesian join with many redundant data, considering it would be joining 6 tables together. Code below:
q.Left.JoinQueryOver<OrderItem>(order => order.OrderItems)
.Left.JoinQueryOver<OrderItemPlaylist>(orderItem => orderItem.Playlist)
.Left.JoinQueryOver<OrderItemPlaylistItem>(orderItemPlaylist => orderItemPlaylist.PlaylistItems)
.Left.JoinQueryOver<Track>(orderItemPlaylistItem => orderItemPlaylistItem.Track)
.Left.JoinQueryOver<Artist>(track => track.Artist)
The above works, but with even a few orders, each with a few order items, and a playlist each consisting of multiple tracks, the results would explode to thousand records, growing exponentially with each extra order.
Any idea what would be the best and most efficient approach? I've currently tried enabling batch-loading, which greatly scales down the number of database queries but still does not seem to me like a good approach, but more like an 'easy-workaround'.
There is no need for all the data to be loaded in just one SQL query, given the huge amount of data. One SQL query for each association would be perfect I guess. Ideally it would be something where first you get all orders, then you get all the order items for the order and load them in the associated collections, then the playlists for each order item, so on and so forth.
Also, this doesn't have to be specifically in QueryOver, as I can access the .RootCriteria and use the Criteria API.
Any help would be greatly appreciated !
I believe this is what you are looking for
http://ayende.com/blog/4367/eagerly-loading-entity-associations-efficiently-with-nhibernate
If you prefer one SQL query, what SQL syntax would you expect this to produce? I guess you can't avoid a long sequence of JOINs if you're going for one SQL query.
I guess what I would do is get the entities level by level, using several queries.
You should probably start off by defining the query as best you can in SQL, and looking at the execution plans to find the very best method (and whether your indexes are sufficiant).
At that point you know what you're shooting for, and then it's reasonably easy to try and code the query in HQL or QueryOver or even LINQ and check the results using the SQL writer in NHibernate, or the excellent NHProfiler http://www.nhprof.com.
You are probably right about ending up with several queries. Speed them up by batching as many as you can (that do not depend on each other) into single trips by using the "Future" command in Criteria or QueryOver. You can read more about that here: http://ayende.com/blog/3979/nhibernate-futures

EntitySet Querying

I'm trying to run a query similar to
var results = MyItem.MyEntitySet.Where( x => x.PropertyB == 0 )
MyEntitySet has one association, PropertyA, with MyItem.
Ideally, the underlying SQL query should be
SELECT .. FROM .. WHERE ([t0].[PropertyA] = #p0) AND ([t0].[PropertyB ] = #p1)
since PropertyA and PropertyB are the two primary keys of the table I'm querying.
But my traces seem to indicate that the program queries with PropertyA first to return MyEntitySet, then queries with PropertyB to return var results.
Is there anyway I can force Linq to query with these two conditions in a single SQL statement?
Maybe, maybe not. The generated SQL does match the way you're writing the LINQ query, so the generated SQL isn't a surprise. If you started with the entity represented by "MyEntitySet" then, maybe, the generated SQL would change.
It's not immediately clear whether you're using LINQ to SQL or Entity Framework. LINQ to SQL does represent one-to-many relationships as an "entity set", while Entity Framework treats relationships as first-class objects, so that a one-to-many relationship is a set of relationship objects with related entities, rather than simply an entity set. It does affect the generated SQL.
Two other thoughts...
If you want that much control over the generated SQL, you probably won't be happy with LINQ. It doesn't always generate optimal SQL (although it can sometimes surprise you). On the other hand, one of the major benefits of LINQ is that you start writing code that expresses the real relationships in your data. The downfall of classic ADO.NET is that you write code about manipulating SQL and processing DataSet and DataTable collections. LINQ is infinitely cleaner, safer, more robust, and more maintainable code to write. Everything is a trade-off.
Second, the query generation is likely to get better over time (especially in Entity Framework).

Why is ORM considered good but "select *" considered bad?

Doesn't an ORM usually involve doing something like a select *?
If I have a table, MyThing, with column A, B, C, D, etc, then there typically would be an object, MyThing with properties A, B, C, D.
It would be evil if that object were incompletely instantiated by a select statement that looked like this, only fetching the A, B, not the C, D:
select A, B from MyThing /* don't get C and D, because we don't need them */
but it would also be evil to always do this:
select A, B, C, D /* get all the columns so that we can completely instantiate the MyThing object */
Does ORM make an assumption that database access is so fast now you don't have to worry about it and so you can always fetch all the columns?
Or, do you have different MyThing objects, one for each combo of columns that might happen to be in a select statement?
EDIT: Before you answer the question, please read Nicholas Piasecki's and Bill Karwin's answers. I guess I asked my question poorly because many misunderstood it, but Nicholas understood it 100%. Like him, I'm interested in other answers.
EDIT #2: Links that relate to this question:
Why do we need entity objects?
http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx, especially the section "The Partial-Object Problem and the Load-Time Paradox"
http://groups.google.com/group/comp.object/browse_thread/thread/853fca22ded31c00/99f41d57f195f48b?
http://www.martinfowler.com/bliki/AnemicDomainModel.html
http://database-programmer.blogspot.com/2008/06/why-i-do-not-use-orm.html
In my limited experience, things are as you describe--it's a messy situation and the usual cop-out "it depends" answer applies.
A good example would be the online store that I work for. It has a Brand object, and on the main page of the Web site, all of the brands that the store sells are listed on the left side. To display this menu of brands, all the site needs is the integer BrandId and the string BrandName. But the Brand object contains a whole boatload of other properties, most notably a Description property that can contain a substantially large amount of text about the Brand. No two ways about it, loading all of that extra information about the brand just to spit out its name in an unordered list is (1) measurably and significantly slow, usually because of the large text fields and (2) pretty inefficient when it comes to memory usage, building up large strings and not even looking at them before throwing them away.
One option provided by many ORMs is to lazy load a property. So we could have a Brand object returned to us, but that time-consuming and memory-wasting Description field is not until we try to invoke its get accessor. At that point, the proxy object will intercept our call and suck down the description from the database just in time. This is sometimes good enough but has burned me enough times that I personally don't recommend it:
It's easy to forget that the property is lazy-loaded, introducing a SELECT N+1 problem just by writing a foreach loop. Who knows what happens when LINQ gets involved.
What if the just-in-time database call fails because the transport got flummoxed or the network went out? I can almost guarantee that any code that is doing something as innocuous as string desc = brand.Description was not expecting that simple call to toss a DataAccessException. Now you've just crashed in a nasty and unexpected way. (Yes, I've watched my app go down hard because of just that. Learned the hard way!)
So what I've ended up doing is that in scenarios that require performance or are prone to database deadlocks, I create a separate interface that the Web site or any other program can call to get access to specific chunks of data that have had their query plans carefully examined. The architecture ends up looking kind of like this (forgive the ASCII art):
Web Site: Controller Classes
|
|---------------------------------+
| |
App Server: IDocumentService IOrderService, IInventoryService, etc
(Arrays, DataSets) (Regular OO objects, like Brand)
| |
| |
| |
Data Layer: (Raw ADO.NET returning arrays, ("Full cream" ORM like NHibernate)
DataSets, simple classes)
I used to think that this was cheating, subverting the OO object model. But in a practical sense, as long as you do this shortcut for displaying data, I think it's all right. The updates/inserts and what have you still go through the fully-hydrated, ORM-filled domain model, and that's something that happens far less frequently (in most of my cases) than displaying particular subsets of the data. ORMs like NHibernate will let you do projections, but by that point I just don't see the point of the ORM. This will probably be a stored procedure anyway, writing the ADO.NET takes two seconds.
This is just my two cents. I look forward to reading some of the other responses.
People use ORM's for greater development productivity, not for runtime performance optimization. It depends on the project whether it's more important to maximize development efficiency or runtime efficiency.
In practice, one could use the ORM for greatest productivity, and then profile the application to identify bottlenecks once you're finished. Replace ORM code with custom SQL queries only where you get the greatest bang for the buck.
SELECT * isn't bad if you typically need all the columns in a table. We can't generalize that the wildcard is always good or always bad.
edit: Re: doofledorfer's comment... Personally, I always name the columns in a query explicitly; I never use the wildcard in production code (though I use it when doing ad hoc queries). The original question is about ORMs -- in fact it's not uncommon that ORM frameworks issue a SELECT * uniformly, to populate all the fields in the corresponding object model.
Executing a SELECT * query may not necessarily indicate that you need all those columns, and it doesn't necessarily mean that you are neglectful about your code. It could be that the ORM framework is generating SQL queries to make sure all the fields are available in case you need them.
Linq to Sql, or any implementation of IQueryable, uses a syntax which ultimately puts you in control of the selected data. The definition of a query is also the definition of its result set.
This neatly avoids the select * issue by removing data shape responsibilities from the ORM.
For example, to select all columns:
from c in data.Customers
select c
To select a subset:
from c in data.Customers
select new
{
c.FirstName,
c.LastName,
c.Email
}
To select a combination:
from c in data.Customers
join o in data.Orders on c.CustomerId equals o.CustomerId
select new
{
Name = c.FirstName + " " + c.LastName,
Email = c.Email,
Date = o.DateSubmitted
}
There are two separate issues to consider.
To begin, it is quite common when using an ORM for the table and the object to have quite different "shapes", this is one reason why many ORM tools support quite complex mappings.
A good example is when a table is partially denormalised, with columns containing redundant information (often, this is done to improve query or reporting performance). When this occurs, it is more efficient for the ORM to request just the columns it requires, than to have all the extra columns brought back and ignored.
The question of why "Select *" is evil is separate, and the answer falls into two halves.
When executing "select *" the database server has no obligation to return the columns in any particular order, and in fact could reasonably return the columns in a different order every time, though almost no databases do this.
Problem is, when a typical developer observes that the columns returned seem to be in a consistent order, the assumption is made that the columns will always be in that order, and then you have code making unwarranted assumptions, just waiting to fail. Worse, that failure may not be fatal, but may simply involve, say, using Year of Birth in place of Account Balance.
The other issue with "Select *" revolves around table ownership - in many large companies, the DBA controls the schema, and makes changes as required by major systems. If your tool is executing "select *" then you only get the current columns - if the DBA has removed a redundant column that you need, you get no error, and your code may blunder ahead causing all sorts of damage. By explicitly requesting the fields you require, you ensure that your system will break rather than process the wrong information.
I am not sure why you would want a partially hydrated object. Given a class of Customer with properties of Name, Address, Id. I would want them all to create a fully populated Customer object.
The list hanging off of Customers called Orders can be lazily loaded when accessed though most ORMs. And NHibernate anyway allows you to do projections into other objects. So if you had say a simply customer list where you displayed the ID and Name, you can create an object of type CustomerListDisplay and project your HQL query into that object set and only obtain the columns you need from the database.
Friends don't let friends premature optimize. Fully hydrate your object, lazy load it's associations. And then profile your application looking for problems and optimize the problem areas.
Even ORMs need to avoid SELECT * to be effective, by using lazy loading etc.
And yes, SELECT * is generally a bad idea if you aren't consuming all the data.
So, do you have different kinds of MyThing objects, one for each column combo? – Corey Trager (Nov 15 at 0:37)
No, I have read-only digest objects (which only contain important information) for things like lookups and massive collections and convert these to fully hydrated objects on demand. – Cade Roux (Nov 15 at 1:22)
The case you describe is a great example of how ORM is not a panacea. Databases offer flexible, needs-based access to their data primarily through SQL. As a developer, I can easily and simply get all the data (SELECT *) or some of the data (SELECT COL1, COL2) as needed. My mechanism for doing this will be easily understood by any other developer taking over the project.
In order to get the same flexibility from ORM, a lot more work has to be done (either by you or the ORM developers) just to get you back to the place under the hood where you're either getting all or some of the columns from the database as needed (see the excellent answers above to get a sense of some of the problems). And all this extra stuff is just more stuff that can fail, making an ORM system intrinsically less reliable than straight SQL calls.
This is not to say that you shouldn't use ORM (my standard disclaimer is that all design choices have costs and benefits, and the choice of one or the other just depends) - knock yourself out if it works for you. I will say that I truly don't understand the popularity of ORM, given the amount of extra un-fun work it seems to create for its users. I'll stick with using SELECT * when (wait for it) I need to get every column from a table.
ORMs in general do not rely on SELECT *, but rely on better methods to find columns like defined data map files (Hibernate, variants of Hibernate, and Apache iBATIS do this). Something a bit more automatic could be set up by querying the database schema to get a list of columns and their data types for a table. How the data gets populated is specific to the particular ORM you are using, and it should be well-documented there.
It is never a good idea to select data that you do not use at all, as it can create a needless code dependency that can be obnoxious to maintain later. For dealing with data internal to the class, things are a bit more complicated.
A short rule would be to always fetch all the data that the class stores by default. In most cases, a small amount of overhead won't make a huge difference, so your main goal is to reduce maintenance overhead. Later, when you performance profiling of the code, and have reason to believe that it may benefit from adjusting the behavior, that is the time to do it.
If I saw an ORM make SELECT * statements, either visibly or under its covers, then I would look elsewhere to fulfill my database integration needs.
SELECT * is not bad. Did you ask whoever considered it to be bad "why?".
SELECT * is a strong indication you don't have design control over the scope of your application and its modules. One of the major difficulties in cleaning up someone else's work is when there is stuff in there that is for no purpose, but no indication what is needed and used, and what isn't.
Every piece of data and code in your application should be there for a purpose, and the purpose should be specified, or easily detected.
We all know, and despise, programmers who don't worry too much about why things work, they just like to try stuff until the expected things happen and close it up for the next guy. SELECT * is a really good way to do that.
If you feel the need to encapsulate everything within an object, but need something with a small subset of what is contained within a table - define your own class. Write straight sql (within or without the ORM - most allow straight sql to circumvent limitations) and populate your object with the results.
However, I'd just use the ORMs representation of a table in most situations unless profiling told me not to.
If you're using query caching select * can be good. If you're selecting a different assortment of columns every time you hit a table, it could just be getting the cached select * for all of those queries.
I think you're confusing the purpose of ORM. ORM is meant to map a domain model or similar to a table in a database or some data storage convention. It's not meant to make your application more computationally efficient or even expected to.