How to query three related tables efficiently (JPA-QL) - optimization

Let's say I have entities A, B, C and each A has many B and C entities. I want to query a load of A entities based on some criterea, and I know I will be accessing all B and C entities for each A I return.
Something like select a from A as a join fetch a.b join fetch a.c would seem to make sense at first, but this creates a huge product if the numbers of B and C entities are large. Extending this to another associated entities makes the query totally unreasonable.
If I leave JPA to its own devices, I end up with n+1 selects when it wants to access the B and C entities.
What I thought I'd do was query A join fetch B, then A join fetch C, but this doesn't work as it gives me two List<A> results each with only half the information.
This is a pretty simple query in SQL terms, and I'm disappointed there isn't an obvious way to handle this. Am I missing something?
Provider is toplink essentials

JPA should at least mention objects. The fact that you don't suggests to me that you're not going to be leveraging JPA to its fullest extent.
If you've got a legacy schema, and an object model doesn't make sense, perhaps you shouldn't be using JPA.
JPA isn't intended to be a substitute for SQL. It addresses that object-relational mismatch. If you don't have objects, just drop down to JDBC and SQL.
I don't know what your tables represent, but if you're thinking about objects you should be talking about 1:m and m:n relationships. Once you have those you can use caching, lazy and eager fetching to optimize populating the objects.
UPDATE: Write the query so each product has its options and prices lists as 1:m relationships and do eager fetching. That will avoid the (n+1) problem.
How can you say that relationships and eager fetching don't help here?
Try expressing the relationships in objects and have JPA show you the SQL it generates and compare it to what you'd write. If it's satisfactory, go for it. If not, drop down to JDBC and see if you can do better.

I wonder why you say this is pretty simple in SQL terms. Wouldn't you also have the cartesian product?
Using the Hibernate provider for JPA, an option you mention works:
query A join fetch B, then A join fetch C
You have two list of the same values, you use only one and it is fine (you just need to LEFT join).
In Hibernate, you can also ask to fetch the missing data in a second query.
Use fetch="subselect".
See https://www.hibernate.org/315.html
UPDATED after comment of the Original Poster:
In java, you could also do this by hand.
Fetch the As with their collections of Bs, in a list called entityAs.
Fetch the As with their collections of Cs (reusing part of the query, or using ids).
Create a datastructure Map> for the second query (for performance, to avoid inner loop).
Loop on list entityAs, using the Map to set the set Cs for each instance A.
This would have a good performance also.
If you run several times into this need, you could write a parameterized method to do this for you, so you only code it once.
As commented by the Original Poster, you need to detach all A entities from entityAs before modifying them, to be sure there will be no update send to the database...

Related

Does JOIN in a database lead to code duplication in the application?

For example, we have a web application that uses PostgreSQL. The application has AuthorService that implements CRUD operations for Author entity. AuthorService uses "authors" table in database.
Now we need to implement BookService, which should fetch data from "books" table. BookService must join the Author entity.
If we use SQL JOIN in the BookService, then we need to repeat some logic (code) from the AuthorService in the BookService, since the AuthorService contains the access control logic for the Author entity and logic for generating the URLs of the author's photos (S3 signed URL)
OR we can use the AuthorService inside the BookService to fetch the data and after we can join this data in the application instead of PostgreSQL (we can write a loop that join entities), but in this case we may have performance problems.
Which option is better?
I feel the right place to do the JOIN is in the database, even if it might mean some extra code needed from the application side as you have said so.
Joining inside the application layer would blank out any database optimizations which the database optimizer is capable of making use had "join" been inside the db. The database the optimizer chooses the option to return back records on the basis of statistics on the tables/columns/histograms values and a whole lot of optimizations .
Take for example a looping logic. If we have a small table called dept and a large table called emp and if we are to perform a query join on the two in the db. It is most likely going to use a nested loop which might be more efficient since the large table needs to be traversed just once to get all matching records.And if the dept table is wide(many columns) the optimizer can choose to use an index and get the same output in an efficent manner
In case both of the tables are large the optimizer may choose a hash join or sorted join.
Consider the alternative, in your application if you were to join, you would be using just the looping logic all the time(mostly a nested loop) or if you are to implement an sophisticated algorithm of doing the "join" you would be duplicating all of the effort which has gone into making the database.
So best option in my humble opinion - Use db for any SET related operations (JOIN,FILTER,AGGREGATION)

fetching multiple nested associations eagerly using nhibernate (and queryover)

I have a database which has multiple nested associates. Basically, the structure is as follows:
Order -> OrderItem -> OrderItemPlaylist -> OrderPlaylistItem -> Track -> Artist
I need to generate a report based on all orders sold in a certain date, which needs to traverse into ALL the mentioned associations in order to generate the required information.
Trying to join all tables together would be an overkill, as it would result in an extremely large cartesian join with many redundant data, considering it would be joining 6 tables together. Code below:
q.Left.JoinQueryOver<OrderItem>(order => order.OrderItems)
.Left.JoinQueryOver<OrderItemPlaylist>(orderItem => orderItem.Playlist)
.Left.JoinQueryOver<OrderItemPlaylistItem>(orderItemPlaylist => orderItemPlaylist.PlaylistItems)
.Left.JoinQueryOver<Track>(orderItemPlaylistItem => orderItemPlaylistItem.Track)
.Left.JoinQueryOver<Artist>(track => track.Artist)
The above works, but with even a few orders, each with a few order items, and a playlist each consisting of multiple tracks, the results would explode to thousand records, growing exponentially with each extra order.
Any idea what would be the best and most efficient approach? I've currently tried enabling batch-loading, which greatly scales down the number of database queries but still does not seem to me like a good approach, but more like an 'easy-workaround'.
There is no need for all the data to be loaded in just one SQL query, given the huge amount of data. One SQL query for each association would be perfect I guess. Ideally it would be something where first you get all orders, then you get all the order items for the order and load them in the associated collections, then the playlists for each order item, so on and so forth.
Also, this doesn't have to be specifically in QueryOver, as I can access the .RootCriteria and use the Criteria API.
Any help would be greatly appreciated !
I believe this is what you are looking for
http://ayende.com/blog/4367/eagerly-loading-entity-associations-efficiently-with-nhibernate
If you prefer one SQL query, what SQL syntax would you expect this to produce? I guess you can't avoid a long sequence of JOINs if you're going for one SQL query.
I guess what I would do is get the entities level by level, using several queries.
You should probably start off by defining the query as best you can in SQL, and looking at the execution plans to find the very best method (and whether your indexes are sufficiant).
At that point you know what you're shooting for, and then it's reasonably easy to try and code the query in HQL or QueryOver or even LINQ and check the results using the SQL writer in NHibernate, or the excellent NHProfiler http://www.nhprof.com.
You are probably right about ending up with several queries. Speed them up by batching as many as you can (that do not depend on each other) into single trips by using the "Future" command in Criteria or QueryOver. You can read more about that here: http://ayende.com/blog/3979/nhibernate-futures

Hibernate Performance Tweaks

In your experience what are some good Hibernate performance tweaks? I mean this in terms of Inserts/Updates and Querying.
Some Hibernate-specific performance tuning tips:
Avoid join duplicates caused by parallel to-many assocation fetch-joins (hence avoid duplicate object instantiations)
Use lazy loading with fetch="subselect" (prevents N+1 select problem)
On huge read-only resultsets, don't fetch into mapped objects, but into flat DTOs (with Projections and AliasToBean-ResultTransformer)
Apply HQL Bulk Update, Bulk Delete and Insert-By-Select
Use FlushMode.Never where appropriate
Taken from http://arnosoftwaredev.blogspot.com/2011/01/hibernate-performance-tips.html
I'm not sure this is a tweak, but join fetch can be useful if you have a many-to-one that you know you're going to need. For example, if a Person can be a member of a single Department and you know you're going to need both in one particular place you can use something like from Person p left join fetch p.department and Hibernate will do a single query instead of one query for Person followed by n queries for Department.
When doing a lot of inserts/updates, call flush periodically instead of after each save or at the end - Hibernate will batch those statements and send them to the database together which will reduce network overhead.
Finally, be careful with the second level cache. If you know the majority of the objects you read by id will be in the cache, it can make things really fast, but if count on them being there but don't have it configured well, you'll end up doing a lot of single row database queries when you could have brought back a large result set with only one network/database trip.
Using caching, cascades and lazy loading appropriately.
Tweaks? Hibernate generates SQL for you, based on the mappings you give. If you don't like the SQL, then maybe Hibernate isn't the correct tool.
The rest of performance has to do with the database design: normalization, indexes, etc.

Has anyone written a higher level query langage (than sql) that generates sql for common tasks, on limited schemas

Sql is the standard in query languages, however it is sometime a bit verbose. I am currently writing limited query language that will make my common queries quicker to write and with a bit less mental overhead.
If you write a query over a good database schema, essentially you will be always joining over the primary key, foreign key fields so I think it should be unnecessary to have to state them each time.
So a query could look like.
select s.name, region.description from shop s
where monthly_sales.amount > 4000 and s.staff < 10
The relations would be
shop -- many to one -- region,
shop -- one to many -- monthly_sales
The sql that would be eqivilent to would be
select distinct s.name, r.description
from shop s
join region r on shop.region_id = region.region_id
join monthly_sales ms on ms.shop_id = s.shop_id
where ms.sales.amount > 4000 and s.staff < 10
(the distinct is there as you are joining to a one to many table (monthly_sales) and you are not selecting off fields from that table)
I understand that original query above may be ambiguous for certain schemas i.e if there the two relationship routes between two of the tables. However there are ways around (most) of these especially if you limit the schema allowed. Most possible schema's are not worth considering anyway.
I was just wondering if there any attempts to do something like this?
(I have seen most orm solutions to making some queries easier)
EDIT: I actually really like sql. I have used orm solutions and looked at linq. The best I have seen so far is SQLalchemy (for python). However, as far as I have seen they do not offer what I am after.
Hibernate and LinqToSQL do exactly what you want
I think you'd be better off spending your time just writing more SQL and becoming more comfortable with it. Most developers I know have gone through just this progression, where their initial exposure to SQL inspires them to bypass it entirely by writing their own ORM or set of helper classes that auto-generates the SQL for them. Usually they continue adding to it and refining it until it's just as complex (if not more so) than SQL. The results are sometimes fairly comical - I inherited one application that had classes named "And.cs" and "Or.cs", whose main functions were to add the words " AND " and " OR ", respectively, to a string.
SQL is designed to handle a wide variety of complexity. If your application's data design is simple, then the SQL to manipulate that data will be simple as well. It doesn't make much sense to use a different sort of query language for simple things, and then use SQL for the complex things, when SQL can handle both kinds of thing well.
I believe that any (decent) ORM would be of help here..
Entity SQL is slightly higher level (in places) than Transact SQL. Other than that, HQL, etc. For object-model approaches, LINQ (IQueryable<T>) is much higher level, allowing simple navigation:
var qry = from cust in db.Customers
select cust.Orders.Sum(o => o.OrderValue);
etc
Martin Fowler plumbed a whole load of energy into this and produced the Active Record pattern. I think this is what you're looking for?
Not sure if this falls in what you are looking for but I've been generating SQL dynamically from the definition of the Data Access Objects; the idea is to reflect on the class and by default assume that its name is the table name and all properties are columns. I also have search criteria objects to build the where part. The DAOs may contain lists of other DAO classes and that directs the joins.
Since you asked for something to take care of most of the repetitive SQL, this approach does it. And when it doesn't, I just fall back on handwritten SQL or stored procedures.

Why is ORM considered good but "select *" considered bad?

Doesn't an ORM usually involve doing something like a select *?
If I have a table, MyThing, with column A, B, C, D, etc, then there typically would be an object, MyThing with properties A, B, C, D.
It would be evil if that object were incompletely instantiated by a select statement that looked like this, only fetching the A, B, not the C, D:
select A, B from MyThing /* don't get C and D, because we don't need them */
but it would also be evil to always do this:
select A, B, C, D /* get all the columns so that we can completely instantiate the MyThing object */
Does ORM make an assumption that database access is so fast now you don't have to worry about it and so you can always fetch all the columns?
Or, do you have different MyThing objects, one for each combo of columns that might happen to be in a select statement?
EDIT: Before you answer the question, please read Nicholas Piasecki's and Bill Karwin's answers. I guess I asked my question poorly because many misunderstood it, but Nicholas understood it 100%. Like him, I'm interested in other answers.
EDIT #2: Links that relate to this question:
Why do we need entity objects?
http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx, especially the section "The Partial-Object Problem and the Load-Time Paradox"
http://groups.google.com/group/comp.object/browse_thread/thread/853fca22ded31c00/99f41d57f195f48b?
http://www.martinfowler.com/bliki/AnemicDomainModel.html
http://database-programmer.blogspot.com/2008/06/why-i-do-not-use-orm.html
In my limited experience, things are as you describe--it's a messy situation and the usual cop-out "it depends" answer applies.
A good example would be the online store that I work for. It has a Brand object, and on the main page of the Web site, all of the brands that the store sells are listed on the left side. To display this menu of brands, all the site needs is the integer BrandId and the string BrandName. But the Brand object contains a whole boatload of other properties, most notably a Description property that can contain a substantially large amount of text about the Brand. No two ways about it, loading all of that extra information about the brand just to spit out its name in an unordered list is (1) measurably and significantly slow, usually because of the large text fields and (2) pretty inefficient when it comes to memory usage, building up large strings and not even looking at them before throwing them away.
One option provided by many ORMs is to lazy load a property. So we could have a Brand object returned to us, but that time-consuming and memory-wasting Description field is not until we try to invoke its get accessor. At that point, the proxy object will intercept our call and suck down the description from the database just in time. This is sometimes good enough but has burned me enough times that I personally don't recommend it:
It's easy to forget that the property is lazy-loaded, introducing a SELECT N+1 problem just by writing a foreach loop. Who knows what happens when LINQ gets involved.
What if the just-in-time database call fails because the transport got flummoxed or the network went out? I can almost guarantee that any code that is doing something as innocuous as string desc = brand.Description was not expecting that simple call to toss a DataAccessException. Now you've just crashed in a nasty and unexpected way. (Yes, I've watched my app go down hard because of just that. Learned the hard way!)
So what I've ended up doing is that in scenarios that require performance or are prone to database deadlocks, I create a separate interface that the Web site or any other program can call to get access to specific chunks of data that have had their query plans carefully examined. The architecture ends up looking kind of like this (forgive the ASCII art):
Web Site: Controller Classes
|
|---------------------------------+
| |
App Server: IDocumentService IOrderService, IInventoryService, etc
(Arrays, DataSets) (Regular OO objects, like Brand)
| |
| |
| |
Data Layer: (Raw ADO.NET returning arrays, ("Full cream" ORM like NHibernate)
DataSets, simple classes)
I used to think that this was cheating, subverting the OO object model. But in a practical sense, as long as you do this shortcut for displaying data, I think it's all right. The updates/inserts and what have you still go through the fully-hydrated, ORM-filled domain model, and that's something that happens far less frequently (in most of my cases) than displaying particular subsets of the data. ORMs like NHibernate will let you do projections, but by that point I just don't see the point of the ORM. This will probably be a stored procedure anyway, writing the ADO.NET takes two seconds.
This is just my two cents. I look forward to reading some of the other responses.
People use ORM's for greater development productivity, not for runtime performance optimization. It depends on the project whether it's more important to maximize development efficiency or runtime efficiency.
In practice, one could use the ORM for greatest productivity, and then profile the application to identify bottlenecks once you're finished. Replace ORM code with custom SQL queries only where you get the greatest bang for the buck.
SELECT * isn't bad if you typically need all the columns in a table. We can't generalize that the wildcard is always good or always bad.
edit: Re: doofledorfer's comment... Personally, I always name the columns in a query explicitly; I never use the wildcard in production code (though I use it when doing ad hoc queries). The original question is about ORMs -- in fact it's not uncommon that ORM frameworks issue a SELECT * uniformly, to populate all the fields in the corresponding object model.
Executing a SELECT * query may not necessarily indicate that you need all those columns, and it doesn't necessarily mean that you are neglectful about your code. It could be that the ORM framework is generating SQL queries to make sure all the fields are available in case you need them.
Linq to Sql, or any implementation of IQueryable, uses a syntax which ultimately puts you in control of the selected data. The definition of a query is also the definition of its result set.
This neatly avoids the select * issue by removing data shape responsibilities from the ORM.
For example, to select all columns:
from c in data.Customers
select c
To select a subset:
from c in data.Customers
select new
{
c.FirstName,
c.LastName,
c.Email
}
To select a combination:
from c in data.Customers
join o in data.Orders on c.CustomerId equals o.CustomerId
select new
{
Name = c.FirstName + " " + c.LastName,
Email = c.Email,
Date = o.DateSubmitted
}
There are two separate issues to consider.
To begin, it is quite common when using an ORM for the table and the object to have quite different "shapes", this is one reason why many ORM tools support quite complex mappings.
A good example is when a table is partially denormalised, with columns containing redundant information (often, this is done to improve query or reporting performance). When this occurs, it is more efficient for the ORM to request just the columns it requires, than to have all the extra columns brought back and ignored.
The question of why "Select *" is evil is separate, and the answer falls into two halves.
When executing "select *" the database server has no obligation to return the columns in any particular order, and in fact could reasonably return the columns in a different order every time, though almost no databases do this.
Problem is, when a typical developer observes that the columns returned seem to be in a consistent order, the assumption is made that the columns will always be in that order, and then you have code making unwarranted assumptions, just waiting to fail. Worse, that failure may not be fatal, but may simply involve, say, using Year of Birth in place of Account Balance.
The other issue with "Select *" revolves around table ownership - in many large companies, the DBA controls the schema, and makes changes as required by major systems. If your tool is executing "select *" then you only get the current columns - if the DBA has removed a redundant column that you need, you get no error, and your code may blunder ahead causing all sorts of damage. By explicitly requesting the fields you require, you ensure that your system will break rather than process the wrong information.
I am not sure why you would want a partially hydrated object. Given a class of Customer with properties of Name, Address, Id. I would want them all to create a fully populated Customer object.
The list hanging off of Customers called Orders can be lazily loaded when accessed though most ORMs. And NHibernate anyway allows you to do projections into other objects. So if you had say a simply customer list where you displayed the ID and Name, you can create an object of type CustomerListDisplay and project your HQL query into that object set and only obtain the columns you need from the database.
Friends don't let friends premature optimize. Fully hydrate your object, lazy load it's associations. And then profile your application looking for problems and optimize the problem areas.
Even ORMs need to avoid SELECT * to be effective, by using lazy loading etc.
And yes, SELECT * is generally a bad idea if you aren't consuming all the data.
So, do you have different kinds of MyThing objects, one for each column combo? – Corey Trager (Nov 15 at 0:37)
No, I have read-only digest objects (which only contain important information) for things like lookups and massive collections and convert these to fully hydrated objects on demand. – Cade Roux (Nov 15 at 1:22)
The case you describe is a great example of how ORM is not a panacea. Databases offer flexible, needs-based access to their data primarily through SQL. As a developer, I can easily and simply get all the data (SELECT *) or some of the data (SELECT COL1, COL2) as needed. My mechanism for doing this will be easily understood by any other developer taking over the project.
In order to get the same flexibility from ORM, a lot more work has to be done (either by you or the ORM developers) just to get you back to the place under the hood where you're either getting all or some of the columns from the database as needed (see the excellent answers above to get a sense of some of the problems). And all this extra stuff is just more stuff that can fail, making an ORM system intrinsically less reliable than straight SQL calls.
This is not to say that you shouldn't use ORM (my standard disclaimer is that all design choices have costs and benefits, and the choice of one or the other just depends) - knock yourself out if it works for you. I will say that I truly don't understand the popularity of ORM, given the amount of extra un-fun work it seems to create for its users. I'll stick with using SELECT * when (wait for it) I need to get every column from a table.
ORMs in general do not rely on SELECT *, but rely on better methods to find columns like defined data map files (Hibernate, variants of Hibernate, and Apache iBATIS do this). Something a bit more automatic could be set up by querying the database schema to get a list of columns and their data types for a table. How the data gets populated is specific to the particular ORM you are using, and it should be well-documented there.
It is never a good idea to select data that you do not use at all, as it can create a needless code dependency that can be obnoxious to maintain later. For dealing with data internal to the class, things are a bit more complicated.
A short rule would be to always fetch all the data that the class stores by default. In most cases, a small amount of overhead won't make a huge difference, so your main goal is to reduce maintenance overhead. Later, when you performance profiling of the code, and have reason to believe that it may benefit from adjusting the behavior, that is the time to do it.
If I saw an ORM make SELECT * statements, either visibly or under its covers, then I would look elsewhere to fulfill my database integration needs.
SELECT * is not bad. Did you ask whoever considered it to be bad "why?".
SELECT * is a strong indication you don't have design control over the scope of your application and its modules. One of the major difficulties in cleaning up someone else's work is when there is stuff in there that is for no purpose, but no indication what is needed and used, and what isn't.
Every piece of data and code in your application should be there for a purpose, and the purpose should be specified, or easily detected.
We all know, and despise, programmers who don't worry too much about why things work, they just like to try stuff until the expected things happen and close it up for the next guy. SELECT * is a really good way to do that.
If you feel the need to encapsulate everything within an object, but need something with a small subset of what is contained within a table - define your own class. Write straight sql (within or without the ORM - most allow straight sql to circumvent limitations) and populate your object with the results.
However, I'd just use the ORMs representation of a table in most situations unless profiling told me not to.
If you're using query caching select * can be good. If you're selecting a different assortment of columns every time you hit a table, it could just be getting the cached select * for all of those queries.
I think you're confusing the purpose of ORM. ORM is meant to map a domain model or similar to a table in a database or some data storage convention. It's not meant to make your application more computationally efficient or even expected to.