fetching multiple nested associations eagerly using nhibernate (and queryover) - nhibernate

I have a database which has multiple nested associates. Basically, the structure is as follows:
Order -> OrderItem -> OrderItemPlaylist -> OrderPlaylistItem -> Track -> Artist
I need to generate a report based on all orders sold in a certain date, which needs to traverse into ALL the mentioned associations in order to generate the required information.
Trying to join all tables together would be an overkill, as it would result in an extremely large cartesian join with many redundant data, considering it would be joining 6 tables together. Code below:
q.Left.JoinQueryOver<OrderItem>(order => order.OrderItems)
.Left.JoinQueryOver<OrderItemPlaylist>(orderItem => orderItem.Playlist)
.Left.JoinQueryOver<OrderItemPlaylistItem>(orderItemPlaylist => orderItemPlaylist.PlaylistItems)
.Left.JoinQueryOver<Track>(orderItemPlaylistItem => orderItemPlaylistItem.Track)
.Left.JoinQueryOver<Artist>(track => track.Artist)
The above works, but with even a few orders, each with a few order items, and a playlist each consisting of multiple tracks, the results would explode to thousand records, growing exponentially with each extra order.
Any idea what would be the best and most efficient approach? I've currently tried enabling batch-loading, which greatly scales down the number of database queries but still does not seem to me like a good approach, but more like an 'easy-workaround'.
There is no need for all the data to be loaded in just one SQL query, given the huge amount of data. One SQL query for each association would be perfect I guess. Ideally it would be something where first you get all orders, then you get all the order items for the order and load them in the associated collections, then the playlists for each order item, so on and so forth.
Also, this doesn't have to be specifically in QueryOver, as I can access the .RootCriteria and use the Criteria API.
Any help would be greatly appreciated !

I believe this is what you are looking for
http://ayende.com/blog/4367/eagerly-loading-entity-associations-efficiently-with-nhibernate

If you prefer one SQL query, what SQL syntax would you expect this to produce? I guess you can't avoid a long sequence of JOINs if you're going for one SQL query.
I guess what I would do is get the entities level by level, using several queries.

You should probably start off by defining the query as best you can in SQL, and looking at the execution plans to find the very best method (and whether your indexes are sufficiant).
At that point you know what you're shooting for, and then it's reasonably easy to try and code the query in HQL or QueryOver or even LINQ and check the results using the SQL writer in NHibernate, or the excellent NHProfiler http://www.nhprof.com.
You are probably right about ending up with several queries. Speed them up by batching as many as you can (that do not depend on each other) into single trips by using the "Future" command in Criteria or QueryOver. You can read more about that here: http://ayende.com/blog/3979/nhibernate-futures

Related

What is the performance of a Postgres recursive query with a large depth on millions of rows? Should I use a graph database instead?

I am trying to decide whether I should use a graph database or a relational one. I am fairly new to SQL so I am not sure how to test this as I don't have the millions of rows of data yet but I figured this could become a problem down the line so I am looking for some anecdotal experiences from people if possible.
Does anyone know the performance of a recursive query for a Postgres database with millions of rows in each table? Is the performance usually good enough to be able to query data in the database through a web api in sub 100-200ms with potentially a large depth of, lets say 30? The depth is unknown, it could be 1 or all the way to something large like 30 or even more.
My data currently maps very well to the relational data model as it is relational in nature, hence why I am leaning towards Postgres but I would like to execute a specific recursive query that is making me lean towards a graph database. The tables (lets say 5-6) will most likely be millions of rows each in production and the depth of the query is essentially unknown. An example of such a query would be:
Depth 1:
EntityA transacts with -> EntityB
Depth 2:
EntityB transacts with -> EntityC
EntityB transacts with -> EntityD
...
Depth n:
until the last entities (leaves) don't transact with any other entities. At each depth, the entities can transact with many entities and the depth is unknown.
Should I use a graph database instead just because of this 1 specific query or can Postgres handle this in a reasonable time using its RECURSIVE command?
Whenever you're dealing with recursive JOINs in a SQL system that's a dead giveaway that you're actually dealing with graphs and trying to figure out how to navigate a graph with SQL.
Using a graph database is a better option in this case.
For a much more complete answer -- see this other answer on a related question. Performance of arbitrary queries with Neo4j

Are joins the proper way to do cross table queries?

I have a few tables in their third normal form and I need to do some cross table queries to get the information I need.
I looked at joins but it seems like it will create a new table. Is this the proper way to perform such queries? Or should I just do nested queries ? I guess it might make sense if I have to do these queries alot? I'm really not sure how well optimize these operations are. I'm using the sequelize ORM and I'm not sure I see any clear solution.
It seems to me you are asking about joins vs subqueries. These are to some extent different. But let's start with a couple of points.
A join creates a new relvar, not a new table. A relvar is a variable standing in for the relation output by the join operation. It is transient (as opposed to a view which would be persistent).
Joins and subqueries are not always perfect substitutes. Sometimes you will need both.
Your query output is also a relvar.
The above being said, generally where possible I think joins are preferable. The major reason is that a SQL query that can be written using the structure below is far easier (as you master the language) to both understand and debug than most alternatives, and also subqueries in column lists necessarily perform badly:
SELECT [column_list]
FROM [initial_table]
[join list]
WHERE [filters]
GROUP BY [grouping list]
HAVING [post-aggregation filters]
LIMIT [limit and offset]
If your query fits the above structure then you can usually expect that specific kinds of problems will occur in logic in specific parts of the query. On the other hand, with subqueries, you have to check these independently.

SQL: multiple queries vs joins (specific case)

This question seems to have been asked a lot and the answer seems to be "it depends on the details". so I am asking for my specific case: Is it better for me to have multiple queries or use joins?
The details are as follows:
"products" table -- probably around 2000 rows, 15 or so columns
"tags" table -- probably around 10 rows, 3 columns
"types" table -- probably around 10 rows, 3 columns
I need the "tags" and "types" table to get the tag/type-id that is in the product table.
My gut says that if i join the tables i end up searching a much much larger set so its better to do multiple queries, but i am not really sure...
Thoughts?
No, join will probably outperform multiple queries. Your tables are extremely small.
Ask yourself what extra work would be involved in doing multiple queries... I don't know what you need this data for, but I assume you would, at some point, need to correlate the results - match Tags and Types to Products, wouldn't you? If you don't do that with a join, you just have to do it elsewhere with some other mechanism.
Further, your conception of this overlooks the fact that databases are designed for join scenarios. If you perform three isolated queries, the database has no opportunity to optimize its querying behavior across the results you're looking for. If you do it in one query with a join, it does have that opportunity.
Leave the problem of producing a resultset of ~2000 * 10 * 10 records and then filtering it up to the database, in my opinion - that's what it's good at. :)
The amount of data is too small to demonstrate one over the other, but multiple separate queries will use more with respect to transferring over the wire than a single query. There is packet overhead, and separate data sets risks difference if the data set changes between queries if not in the same transaction.
JOINs specifically might not be necessary, EXISTS or IN can be used if the supporting tables don't expose columns in the resultset. A JOIN between tables that are parent & child, and there can be more than one child to a parent will inflate the rows searched -- not necessarily the rows returned.
Assuming that everything has indexes on the primary keys (you should be doing that), then joins will be very efficient. The only case where joins would be worse is if you had some kind of external caching of query results (as some ORMs will do for you), your products table was much bigger, and you were querying at a sufficient rate to keep the results of the two smaller queries (but not the third) in cache. In that scenario multiple queries becomes faster because you're only making one of the three queries. But the difference is going to be hard to measure.
If the database is not on localhost but accessed over a network it's better to send one request, let the database do the work and retrieve the data at once. This will give you less network delay. So joins are preferred.

How (and where) should I combine one-to-many relationships?

I have a user table, and then a number of dependent tables with a one to many relationship
e.g. an email table, an address table and a groups table. (i.e. one user can have multiple email addresses, physical addresses and can be a member of many groups)
Is it better to:
Join all these tables, and process the heap of data in code,
Use something like GROUP_CONCAT and return one row, and split apart the fields in code,
Or query each table independently?
Thanks.
It really depends on how much data you have in the related tables and on how many users you're querying at a time.
Option 1 tends to be messy to deal with in code.
Option 2 tends to be messy to deal with as well in addition to the fact that grouping tends to be slow especially on large datasets.
Option 3 is easiest to deal with but generates more queries overall. If your data-set is small and you're not planning to scale much beyond your current needs its probably the best option. It's definitely the best option if you're only trying to display one record.
There is a fourth option however that is a middle of the road approach which I use in my job in which we deal with a very similar situation. Instead of getting the related records for each row 1 at a time, use IN() to get all of the related records for your results set. Then loop in your code to match them to the appropriate record for display. If you cache search queries you can cache that second query as well. Its only two queries and only one loop in the code (no parsing, use hashes to relate things by their key)
Personally, assuming my table indexes where up to scratch I'd going with a table join and get all the data out in one go and then process that to end up with a nested data structure. This way you're playing to each systems strengths.
Generally speaking, do the most efficient query for the situation you're in. So don't create a mega query that you use in all cases. Create case specific queries that return just the information you need.
In terms of processing the results, if you use GROUP_CONCAT you have to split all the resulting values during processing. If there are extra delimiter characters in your GROUP_CONCAT'd values, this can be problematic. My preferred method is to put the GROUPed BY field into a $holder during the output loop. Compare that field to the $holder each time through and change your output accordingly.

Has anyone written a higher level query langage (than sql) that generates sql for common tasks, on limited schemas

Sql is the standard in query languages, however it is sometime a bit verbose. I am currently writing limited query language that will make my common queries quicker to write and with a bit less mental overhead.
If you write a query over a good database schema, essentially you will be always joining over the primary key, foreign key fields so I think it should be unnecessary to have to state them each time.
So a query could look like.
select s.name, region.description from shop s
where monthly_sales.amount > 4000 and s.staff < 10
The relations would be
shop -- many to one -- region,
shop -- one to many -- monthly_sales
The sql that would be eqivilent to would be
select distinct s.name, r.description
from shop s
join region r on shop.region_id = region.region_id
join monthly_sales ms on ms.shop_id = s.shop_id
where ms.sales.amount > 4000 and s.staff < 10
(the distinct is there as you are joining to a one to many table (monthly_sales) and you are not selecting off fields from that table)
I understand that original query above may be ambiguous for certain schemas i.e if there the two relationship routes between two of the tables. However there are ways around (most) of these especially if you limit the schema allowed. Most possible schema's are not worth considering anyway.
I was just wondering if there any attempts to do something like this?
(I have seen most orm solutions to making some queries easier)
EDIT: I actually really like sql. I have used orm solutions and looked at linq. The best I have seen so far is SQLalchemy (for python). However, as far as I have seen they do not offer what I am after.
Hibernate and LinqToSQL do exactly what you want
I think you'd be better off spending your time just writing more SQL and becoming more comfortable with it. Most developers I know have gone through just this progression, where their initial exposure to SQL inspires them to bypass it entirely by writing their own ORM or set of helper classes that auto-generates the SQL for them. Usually they continue adding to it and refining it until it's just as complex (if not more so) than SQL. The results are sometimes fairly comical - I inherited one application that had classes named "And.cs" and "Or.cs", whose main functions were to add the words " AND " and " OR ", respectively, to a string.
SQL is designed to handle a wide variety of complexity. If your application's data design is simple, then the SQL to manipulate that data will be simple as well. It doesn't make much sense to use a different sort of query language for simple things, and then use SQL for the complex things, when SQL can handle both kinds of thing well.
I believe that any (decent) ORM would be of help here..
Entity SQL is slightly higher level (in places) than Transact SQL. Other than that, HQL, etc. For object-model approaches, LINQ (IQueryable<T>) is much higher level, allowing simple navigation:
var qry = from cust in db.Customers
select cust.Orders.Sum(o => o.OrderValue);
etc
Martin Fowler plumbed a whole load of energy into this and produced the Active Record pattern. I think this is what you're looking for?
Not sure if this falls in what you are looking for but I've been generating SQL dynamically from the definition of the Data Access Objects; the idea is to reflect on the class and by default assume that its name is the table name and all properties are columns. I also have search criteria objects to build the where part. The DAOs may contain lists of other DAO classes and that directs the joins.
Since you asked for something to take care of most of the repetitive SQL, this approach does it. And when it doesn't, I just fall back on handwritten SQL or stored procedures.