I'm qurious on how the result set of an SQL query is transported from the server to the client.
Most O/R mappers support both eager and lazy load, both have their pros and cons.
e.g. Entity Framework4 (.NET) has wonderful eager load support.
However, lets assume we have a model like this:
BlogPost
{
public string Body {get;set;}
ICollection<Comment> Comments {get;set;}
}
...
and a query like this:
var posts = context
.Posts
.Include(post => post.Comments)
.Where(post => post.Id == 1)
.First();
This will result in a single SQL query, where all the data for the "Post" is repeated on each row for every "Comment"
Lets say we have 100 comments on a specific post and the Post.Body is a massive peice of text. this can't be good?
Or is the data somehow compressed when sent to the client, thus minimizing the overhead of repeating data on each row?
What is the best way to determine if one such query is more efficient than just two simple queries (one for getting the post and one for getting its comments)?
Benchmarking this on a dev environment is pretty pointless, there are multiple factors here:
CPU load on the SQL server
Network load
CPU load on the app server (materializing objects)
Ideas on this?
[Edit]
Clarification:
Two queries would be something like this:
sql
select * from post where postid = 123
result
id , topic, body , etc...
sql
select * from comment where postid = 123
result
id,postid, commenttext , etc...
the first query would yield one row and the 2nd query would yield as many rows as there are comments.
with a single query there would be as many rows as there are comments for the specific post , but with all the post data repeated on each row.
result
p.id , p.topic, __p.body__, c.id, c.postid, c.commenttext
p.body would be repeated on each row, thus making the result set extremely large.
(assuming that p.body contains alot of data that is ;-)
I think it really comes down to the following:
How many posts are there?
How complex is it to get the comments of a post?
If you have several million posts, it will be better to use a single query, even if you have several comments for each post, because the aggregated roundtrip time will be much worse than the time for the transfer of the additional data.
So, I think you need to have a sharp eye ;-)
And also, I think that benchmarking in the dev environment is not pointless, because it can give at least relations between the two ways of doing it.
Having a single query that returns a lot of rows is almost always faster than a lot of queries returning just a single row.
In your case though, retrieving the user first, and then all comments (with a single query) is probably more efficient than getting everything in one query.
Related
We query a relational database using standardized SQL. The result of a query is a two dimensional table; rows and columns.
I really like the well structure of a rdms (i honestly never worked professionally with other db systems). But the query language or more exactly the result set sql produces is quite a limitation affecting performance in general.
Let's create a simple example: Customer - Order (1 - n)
I want to query all customers starting with letter "A" having an order this year and display each with all his/her orders.
I have two options to query this data.
Option 1
Load data with a single query with a join between both tables.
Downside: The result which is transferred to the client, contains duplicated customer data which represents an overhead.
Option 2
Query the customers and start a second query to load their orders.
Downsides: 2 queries which result in twice the network latency, the where in term of the second query can potentially be very big, which could lead to query length limitation violation, performance is not optimal because both queries peform a join/filtering to/of orders
There would be of course an option three where we start query with the orders table.
So generally there exists the problem that we have to estimate based on the specific situation what the better trade is. Single query with data overhead or multiple queries with worse execution time. Both strategies can be bad in complex situations where a lot of data in well normalized form has to be queries.
So ideally SQL would be able to specify the result of a query in form of an object structure. Imagine the result of the query would be structured as xml or json instead of a table. If you ever worked with an ORM like EntityFramework you maybe know the "Include" command. With support of an "include" like command in sql and returning the result not as join but structured like an object, world would be a better place. Another scenario would be an include like query but without duplicates. So basically two tables in one result. To visualize it results could look like:
{
{ customer 1 { order 1 order 2} }
{ customer 2 { order 3 order 4} }
} or
{
{ customer1, customer2 }
{ order1, order2, order3, order4 }
}
MS SQL Server has a feature "Multiple Result Sets" which i think comes quite close. But it is not part of Standard SQL. Also i am unsure about ORM Mappers really using such feature. And i assume it is still two queries executed (but one client to server request). Instead of something like "select customers include orders From customers join orders where customers starts with 'A' and orders..."
Do you generally face the same problem? How do you solve it if so? Do you know a database query language which can do that maybe even with existing ORM Mapper supporting that (probably not)? I have no real working experience with other database systems, but i don't think that all the new database systems address this problem? (but other problems of course) What is interesting is that in graph databases joins are basically free as far as i understand.
I think you can alter your application workflow to solve this issue.
New application workflow:
Query the Customer table which customer start with a letter 'A'. Send the result to client for display.
User select a customer from client and send back the customer id to server
Query the Order table by the customer id and send the result to client for display.
There is a possibility to return json on some SQL-Server. If you have a table A relate to table B and every entry on table point to maximum one entry at table A then you can reduce overload on traffic as you described. On example could be an address and their contacts.
SELECT * FROM Address
JOIN Contact ON Address.AddressId = Contact.AddressId
FOR JSON AUTO
The SQL return result would be smaller:
"AddressId": "3396B2F8",
"Contact": [{
"ContactId": "05E41746",
... some other information
}, {
"ContactId": "025417A5",
... some other information
}, {
"ContactId": "15E417D5",
... some other information
}
}
]
But actually, I don't know any ORM which process JSON for traffic reduction.
If you had some contacts for different addresses it could be counterproductive.
Don’t forget that JSON also has some overhand and it need to be serialized and deserialized
The optimum for traffic reduction would be if the SQL-Server split the joined result in Multiple Result Sets and the client respectively the Object-Relational-Mapper map them together. I’m would be interested if you find a solution for your problem.
Another train of thought would be to use a graph database.
I have a database which has multiple nested associates. Basically, the structure is as follows:
Order -> OrderItem -> OrderItemPlaylist -> OrderPlaylistItem -> Track -> Artist
I need to generate a report based on all orders sold in a certain date, which needs to traverse into ALL the mentioned associations in order to generate the required information.
Trying to join all tables together would be an overkill, as it would result in an extremely large cartesian join with many redundant data, considering it would be joining 6 tables together. Code below:
q.Left.JoinQueryOver<OrderItem>(order => order.OrderItems)
.Left.JoinQueryOver<OrderItemPlaylist>(orderItem => orderItem.Playlist)
.Left.JoinQueryOver<OrderItemPlaylistItem>(orderItemPlaylist => orderItemPlaylist.PlaylistItems)
.Left.JoinQueryOver<Track>(orderItemPlaylistItem => orderItemPlaylistItem.Track)
.Left.JoinQueryOver<Artist>(track => track.Artist)
The above works, but with even a few orders, each with a few order items, and a playlist each consisting of multiple tracks, the results would explode to thousand records, growing exponentially with each extra order.
Any idea what would be the best and most efficient approach? I've currently tried enabling batch-loading, which greatly scales down the number of database queries but still does not seem to me like a good approach, but more like an 'easy-workaround'.
There is no need for all the data to be loaded in just one SQL query, given the huge amount of data. One SQL query for each association would be perfect I guess. Ideally it would be something where first you get all orders, then you get all the order items for the order and load them in the associated collections, then the playlists for each order item, so on and so forth.
Also, this doesn't have to be specifically in QueryOver, as I can access the .RootCriteria and use the Criteria API.
Any help would be greatly appreciated !
I believe this is what you are looking for
http://ayende.com/blog/4367/eagerly-loading-entity-associations-efficiently-with-nhibernate
If you prefer one SQL query, what SQL syntax would you expect this to produce? I guess you can't avoid a long sequence of JOINs if you're going for one SQL query.
I guess what I would do is get the entities level by level, using several queries.
You should probably start off by defining the query as best you can in SQL, and looking at the execution plans to find the very best method (and whether your indexes are sufficiant).
At that point you know what you're shooting for, and then it's reasonably easy to try and code the query in HQL or QueryOver or even LINQ and check the results using the SQL writer in NHibernate, or the excellent NHProfiler http://www.nhprof.com.
You are probably right about ending up with several queries. Speed them up by batching as many as you can (that do not depend on each other) into single trips by using the "Future" command in Criteria or QueryOver. You can read more about that here: http://ayende.com/blog/3979/nhibernate-futures
I have a table with books a table with authors and a table relating books to authors. A book can have more than one author, so when I do my big query for this results I might get more than one row per book, if the book has more than one author. I then merge together the results in the PHP, but the thing is that if I LIMIT - OFFSET the query for pagination, I might get less than 25 (desired) unique books per page.
Can anyone think of a (or is there a built-in) way to have the LIMIT affect a grouped-by query but still get all the results? I'd rather not do one grouped-by query and then do other queries to get each author because I lose the benefit of cached results.
If not, I'll probably do a pre-pass saving the cached results and then query each author separately.
I had exactly this same problem in a different use case (theater reservation system) and after some research and testing, I've used the pre-pass approach. It's fast and clean and works very well even with a large number of rows (in my case, over 600k). Hope it helps! :)
There are two approaches you could use:
Using n+1 queries.
Emulate ROW_NUMBER() OVER (PARTITION BY your_group) in MySQL using variables and select only the rows with row number 25 or less.
The second is quite difficult to write correctly.
There's already an accepted answer, but I think this may be useful.
GROUP_CONCAT allows you to merge multiple rows into a single row in a MySQL query. Using this, you could concatenate the authors into a list as one field.
SELECT GROUP_CONCAT(author) FROM books GROUP BY book_id;
http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html#function_group-concat
Suppose I have a database which contains blog posts which have tags and I want to retrieve a blog post with its tags.
I can do two queries:
SELECT title, text, date
FROM POSTS
WHERE id = 1
and
SELECT tag
FROM TAGS
WHERE post_id = 1
However I can also do it in a single query:
SELECT title, text, date, tag
FROM POSTS, TAGS
WHERE posts.id = 1
AND tags.post_id = posts.id
The latter is a bit wasteful, because it transfers the same title, text and date columns as many times as many tags the blog post have, but it may be faster, because there is only one trip to the server.
Is there some alternative for the latter query which avoids transferring duplicate data? Or is it not a big deal and I should use it anyway, because transferring a few hundred unused extra bytes is cheaper than making two separate queries?
MySQL optimization isn't quite as straightforward as this. You'll find that sometimes multiple queries (possibly with a temp table in the middle) is much faster than a single query, especially with complex joins / aggregations going on. So, don't assume that a single query will always be faster, because it won't.
However, a single query is often just as fast or faster than multiple queries, and it expresses what you're doing much more succinctly. Don't worry about a handful of bytes across the wire, they are trivial in comparison to everything else that's going on.
Rather make a single trip to the server, than a trip per posts entry.
Trust me, you will see the performance gain.
I single query to retrieve data is more often than not a better solution than using round trips from the Client to Server.
How would make a news feed "friendly" database design, so that it wouldn't be extremely expensive to get all of the items (query) to put in the news feed? The only way I can think of would involve UNIONing nearly every table (representing groups, notes, friends, etc) and getting the dates and such, that just seems like it'd be a really expensive query to run for each user, and it'd be pretty hard to cache something like that with everyone's being different.
Firstly, consider doing a performance prototype to check your hunch that the union would be too expensive. You may be prematurely optimisizing something that is not an issue.
If it is a real issue, consider a table designed purely to hold the event feed data, that must be updated in parallel with the other tables.
E.g. when you create a Note record, also create an event record in the Event table with the date, description, and user involved.
Consider an indexing the Event table based on UserId (or UserId and Date). Also consider clearing old data when it is no longer required.
This isn't a normalised schema, but it may be faster if getting an event feed is a frequent operation.
It's hard to answer this question without a schema, but my hunch is that a UNION involving 10 or more properly indexed tables is nothing:
A typical LAMP application like wordpress or PHPBB runs more than 10 queries per pageview without problems. So don't worry.
UNION = expensive, because the complete result set is subject to a DISTINCT operation.
UNION ALL = cheaper, because it is effectively multiple queries for which the results of each are appended together.
It depends on the data volume, or course.
The main driver of efficiency would be the individual queries that are unioned together, but there's no reason why selecting the most recent (say) 10 records from each of 10 tables should take more than a small fraction of a second.