We query a relational database using standardized SQL. The result of a query is a two dimensional table; rows and columns.
I really like the well structure of a rdms (i honestly never worked professionally with other db systems). But the query language or more exactly the result set sql produces is quite a limitation affecting performance in general.
Let's create a simple example: Customer - Order (1 - n)
I want to query all customers starting with letter "A" having an order this year and display each with all his/her orders.
I have two options to query this data.
Option 1
Load data with a single query with a join between both tables.
Downside: The result which is transferred to the client, contains duplicated customer data which represents an overhead.
Option 2
Query the customers and start a second query to load their orders.
Downsides: 2 queries which result in twice the network latency, the where in term of the second query can potentially be very big, which could lead to query length limitation violation, performance is not optimal because both queries peform a join/filtering to/of orders
There would be of course an option three where we start query with the orders table.
So generally there exists the problem that we have to estimate based on the specific situation what the better trade is. Single query with data overhead or multiple queries with worse execution time. Both strategies can be bad in complex situations where a lot of data in well normalized form has to be queries.
So ideally SQL would be able to specify the result of a query in form of an object structure. Imagine the result of the query would be structured as xml or json instead of a table. If you ever worked with an ORM like EntityFramework you maybe know the "Include" command. With support of an "include" like command in sql and returning the result not as join but structured like an object, world would be a better place. Another scenario would be an include like query but without duplicates. So basically two tables in one result. To visualize it results could look like:
{
{ customer 1 { order 1 order 2} }
{ customer 2 { order 3 order 4} }
} or
{
{ customer1, customer2 }
{ order1, order2, order3, order4 }
}
MS SQL Server has a feature "Multiple Result Sets" which i think comes quite close. But it is not part of Standard SQL. Also i am unsure about ORM Mappers really using such feature. And i assume it is still two queries executed (but one client to server request). Instead of something like "select customers include orders From customers join orders where customers starts with 'A' and orders..."
Do you generally face the same problem? How do you solve it if so? Do you know a database query language which can do that maybe even with existing ORM Mapper supporting that (probably not)? I have no real working experience with other database systems, but i don't think that all the new database systems address this problem? (but other problems of course) What is interesting is that in graph databases joins are basically free as far as i understand.
I think you can alter your application workflow to solve this issue.
New application workflow:
Query the Customer table which customer start with a letter 'A'. Send the result to client for display.
User select a customer from client and send back the customer id to server
Query the Order table by the customer id and send the result to client for display.
There is a possibility to return json on some SQL-Server. If you have a table A relate to table B and every entry on table point to maximum one entry at table A then you can reduce overload on traffic as you described. On example could be an address and their contacts.
SELECT * FROM Address
JOIN Contact ON Address.AddressId = Contact.AddressId
FOR JSON AUTO
The SQL return result would be smaller:
"AddressId": "3396B2F8",
"Contact": [{
"ContactId": "05E41746",
... some other information
}, {
"ContactId": "025417A5",
... some other information
}, {
"ContactId": "15E417D5",
... some other information
}
}
]
But actually, I don't know any ORM which process JSON for traffic reduction.
If you had some contacts for different addresses it could be counterproductive.
Don’t forget that JSON also has some overhand and it need to be serialized and deserialized
The optimum for traffic reduction would be if the SQL-Server split the joined result in Multiple Result Sets and the client respectively the Object-Relational-Mapper map them together. I’m would be interested if you find a solution for your problem.
Another train of thought would be to use a graph database.
Related
I an trying to find a way to determine whether or not an SQL SELECT query A is prone to return a subset of the results returned by another query B. Furthermore, this needs to be acomplished from the queries alone, without having access to the respective result sets.
For example, the query SELECT * from employee WHERE salary >= 1000 will return a subset of the results of query SELECT * from employee. I need to find an automated way to perform this validation for any two queries A and B, without accessing the database that stores the data.
If it is unfeasable to achieve this without the aid of an RDBMS, we can assume that I have access to a local, but empty RDBMS, but with the data stored somewhere else. In addition, this check must be done in code, either using an algorithm or a library. The language I am using is Java, but other language will also do.
Many thanks in advance.
I don't know how deep you want to get into parsing queries, but basically you can say that there are two general ways of making a subset of a query (given that source table and projection(select) staying the same):
using where clause to add condition to row values
using having clause to add conditions to aggregated values
So you can say that if you have two objects that represent queries and say they look something close to this:
{
'select': { ... },
'from': {},
'where': {},
'orderby': {}
}
and they have select, from and orderby to be the same, but one have extra condition in the where clause , you have a subset.
One way you might be able to determine if a query is a subset of another is by examining their source tables. If you don't have access to the data itself, this can be tricky. This question references using Snowflake joins to generate database diagrams based on a query without having access to the data itself:
Generate table relationship diagram from existing schema (SQL Server)
If your query is 800 characters or less, the tool is free to use: https://snowflakejoins.com/index.html
I tested it out using the AdventureWorks database and these two queries:
SELECT * FROM HumanResources.Employee
SELECT * FROM HumanResources.Employee WHERE EmployeeID < 200
When I plugged both of them into the Snowflake Joins text editor, this is what was generated:
SnowflakeJoins DB Diagram example
Hope that helps.
I have this type of model(schema)
1 product hasOffer (many) Offers
1 offer hseRule (many) shipment rules
like
Product(1)--->Offer(N)----->Rules(M)
How can I query
one product with all offers and all shipping rules.
In simple words.
How can I query one-many related records ?
This can be achieved using a simple SPARQL query with multiple triples in the WHERE clause. This works because the graph model over which SPARQL queries naturally joins data together.
eg.
SELECT ?product ?offer ?rules
WHERE { {?product ns:hasOffer ?offer} {?offer ns:hasRules ?rules} }
Source: Running the same example in Protege with this query.
EDIT:
Question: is there any way to construct as product{ offers:{ rules:{} }, productAttributes } ?
Answer: Yes there is, you can add ?attribute to the SELECT clause the triple ?product ns:productAttribute ?attribute to the WHERE clause (or multiple such clauses if you have attributes across multiple data properties). However I would strongly recommend you refrain from doing so. The query is going to return a potentially large set of data similar to an SQL result set. Thus you are going to see multiple rows with the same product and having a product attribute there will make your life unnecessarily difficult, since that attribute will appear in all rows with that product. Instead run a separate query where you only get the products and their attributes.
I have a database which has multiple nested associates. Basically, the structure is as follows:
Order -> OrderItem -> OrderItemPlaylist -> OrderPlaylistItem -> Track -> Artist
I need to generate a report based on all orders sold in a certain date, which needs to traverse into ALL the mentioned associations in order to generate the required information.
Trying to join all tables together would be an overkill, as it would result in an extremely large cartesian join with many redundant data, considering it would be joining 6 tables together. Code below:
q.Left.JoinQueryOver<OrderItem>(order => order.OrderItems)
.Left.JoinQueryOver<OrderItemPlaylist>(orderItem => orderItem.Playlist)
.Left.JoinQueryOver<OrderItemPlaylistItem>(orderItemPlaylist => orderItemPlaylist.PlaylistItems)
.Left.JoinQueryOver<Track>(orderItemPlaylistItem => orderItemPlaylistItem.Track)
.Left.JoinQueryOver<Artist>(track => track.Artist)
The above works, but with even a few orders, each with a few order items, and a playlist each consisting of multiple tracks, the results would explode to thousand records, growing exponentially with each extra order.
Any idea what would be the best and most efficient approach? I've currently tried enabling batch-loading, which greatly scales down the number of database queries but still does not seem to me like a good approach, but more like an 'easy-workaround'.
There is no need for all the data to be loaded in just one SQL query, given the huge amount of data. One SQL query for each association would be perfect I guess. Ideally it would be something where first you get all orders, then you get all the order items for the order and load them in the associated collections, then the playlists for each order item, so on and so forth.
Also, this doesn't have to be specifically in QueryOver, as I can access the .RootCriteria and use the Criteria API.
Any help would be greatly appreciated !
I believe this is what you are looking for
http://ayende.com/blog/4367/eagerly-loading-entity-associations-efficiently-with-nhibernate
If you prefer one SQL query, what SQL syntax would you expect this to produce? I guess you can't avoid a long sequence of JOINs if you're going for one SQL query.
I guess what I would do is get the entities level by level, using several queries.
You should probably start off by defining the query as best you can in SQL, and looking at the execution plans to find the very best method (and whether your indexes are sufficiant).
At that point you know what you're shooting for, and then it's reasonably easy to try and code the query in HQL or QueryOver or even LINQ and check the results using the SQL writer in NHibernate, or the excellent NHProfiler http://www.nhprof.com.
You are probably right about ending up with several queries. Speed them up by batching as many as you can (that do not depend on each other) into single trips by using the "Future" command in Criteria or QueryOver. You can read more about that here: http://ayende.com/blog/3979/nhibernate-futures
I'm qurious on how the result set of an SQL query is transported from the server to the client.
Most O/R mappers support both eager and lazy load, both have their pros and cons.
e.g. Entity Framework4 (.NET) has wonderful eager load support.
However, lets assume we have a model like this:
BlogPost
{
public string Body {get;set;}
ICollection<Comment> Comments {get;set;}
}
...
and a query like this:
var posts = context
.Posts
.Include(post => post.Comments)
.Where(post => post.Id == 1)
.First();
This will result in a single SQL query, where all the data for the "Post" is repeated on each row for every "Comment"
Lets say we have 100 comments on a specific post and the Post.Body is a massive peice of text. this can't be good?
Or is the data somehow compressed when sent to the client, thus minimizing the overhead of repeating data on each row?
What is the best way to determine if one such query is more efficient than just two simple queries (one for getting the post and one for getting its comments)?
Benchmarking this on a dev environment is pretty pointless, there are multiple factors here:
CPU load on the SQL server
Network load
CPU load on the app server (materializing objects)
Ideas on this?
[Edit]
Clarification:
Two queries would be something like this:
sql
select * from post where postid = 123
result
id , topic, body , etc...
sql
select * from comment where postid = 123
result
id,postid, commenttext , etc...
the first query would yield one row and the 2nd query would yield as many rows as there are comments.
with a single query there would be as many rows as there are comments for the specific post , but with all the post data repeated on each row.
result
p.id , p.topic, __p.body__, c.id, c.postid, c.commenttext
p.body would be repeated on each row, thus making the result set extremely large.
(assuming that p.body contains alot of data that is ;-)
I think it really comes down to the following:
How many posts are there?
How complex is it to get the comments of a post?
If you have several million posts, it will be better to use a single query, even if you have several comments for each post, because the aggregated roundtrip time will be much worse than the time for the transfer of the additional data.
So, I think you need to have a sharp eye ;-)
And also, I think that benchmarking in the dev environment is not pointless, because it can give at least relations between the two ways of doing it.
Having a single query that returns a lot of rows is almost always faster than a lot of queries returning just a single row.
In your case though, retrieving the user first, and then all comments (with a single query) is probably more efficient than getting everything in one query.
Suppose I have two tables:
Group
(
id integer primary key,
someData1 text,
someData2 text
)
GroupMember
(
id integer primary key,
group_id foreign key to Group.id,
someData text
)
I'm aware that my SQL syntax is not correct :) Hopefully is clear enough. My problem is this: I want to load a group record and all the GroupMember records associated with that group. As I see it, there are two options.
A single query:
SELECT Group.id, Group.someData1, Group.someData2 GroupMember.id, GroupMember.someData
FROM Group INNER JOIN GroupMember ...
WHERE Group.id = 4;
Two queries:
SELECT id, someData2, someData2
FROM Group
WHERE id = 4;
SELECT id, someData
FROM GroupMember
WHERE group_id = 4;
The first solution has the advantage of only being one database round trip, but has the disadvantage of returning redundant data (All group data is duplicated for every group member)
The second solution returns no duplicate data but involves two round trips to the database.
What is preferable here? I suppose there's some threshold such that if the group sizes become sufficiently large, the cost of returning all the redundant data is going to be greater than the overhead involved with an additional database call. What other things should I be thinking about here?
Thanks,
Jordan
If you actually want the results joined, I believe it is always more efficient to do the joining at the server level. The SQL processor is designed to match sets of data.
If you really want the results of 2 sql statements, you can always send two statements in one batch separated by a semicolon, and get two resultsets back with one round trip to the DB.
How the data is finally used is an important and unknown factor.
I suggest the single query method for most applications. Proper indexing will keep the query more efficient than the two query method.
The single query method also has the benefit of remaining valid if you need to select more than one group.
If you are only ever going to be retreiving a single group record with each request to the database then i would go with the second option. If you are retrieving multiple group records and associated group member records, go with the join as it will be much quicker.
In general, it depends on what type of data you are trying to display.
If you are showing a single group and all its members, performance differences between the two options would be negligible.
If you are showing many groups and all of their members, the overhead of having to make a roundtrip to the database for each successive group will quickly outweigh any benefit you got from receiving a little less data.
Some other things you might want to consider in you reasoning
Result Set Size - For many groups and members, your result set size may become a limiting factor as the size to retrieve and keep it in memory increases. This is likely to occur with the second option. You may want to consider paging the data, so that you are only retrieving a certain subset at a time.
Lazy Loading - If you are only getting the members of some groups, or a user is requesting the members one group at a time, consider Lazy Loading. This means only making the additional query to get the group's members when needed. This makes sense only in certain use cases, but it can be much more effective than retrieving all data up front.
Depending on the type of database and your frontend application, you can return the results of two SQL statements on one trip (A stored procedure in SQL Server 2005 for example).
If you are creating a report that requires many fields from the Group table, you may not want the increased amount of data with the first query.
If this is some type of data entry app, you've probably already presented the Group data to the user, so they could fill in the group id on the where clause (or preferably via some parameter) and now they need the member results.
It really, really, really depends on what use you will make of the data.
For insatnce, if you were assembling a list of group members for a mail shot, and you need the group name for each letter you're going to send to a member, and you have no use for the Group level then the single joined query makes a lot of sense.
But if, say, you're coding a master-detail screen or report, with a page for each group and displaying information at both the Group and the Member levels then the two separate queries is probably most useful.
Unless you are retrieving quite large amounts of data (tens of thousands of groups with hundreds of memebers per group, or similar orders of magnitude) it is unlikely you are going to see much difference between performances of the two approaches.
On a simple query like this I would to try to perform it in one query. The overhead of two database calls will probably exceed the additional SQL processing time from the query.
A UNION clause will do this for you:
SELECT id, someData1, someData2
FROM Group
WHERE id = 4
UNION
SELECT id, someData, null
FROM GroupMember
WHERE group_id = 4;