Performance concerns with Joining to ad-hock sub query - sql

When creating reports I find myself frequently joining to ad-hock sub queries like this:
SELECT * FROM Customer c
JOIN (SELECT CustomerId,ContractId FROM Contract WHERE StartDate > GetDate()) co ON co.CustomerId = c.CustomerId
Obviously this is just a qucik demo sql, but it illustrates the point.
I am wondering if there are some performance concerns with these types of queries. Are there other good alternatives? Typically the sub queries are more complicated with aggregates etc..
I know in this case it would make more sense to just join to the contact table directly, but I am just trying to illustrate the general idea without having to provide a convoluted example.
Any feedback would be appreciated :-)
Is this a common practice?

Not to be pedantic, but that's not technically an ad-hoc query, which would be defined as a query created by the end-users of the system. Instead, your example is really just a sub-query.
To address your question, if you find yourself joining to the same sub-query more than once, you should create a view, and then join to that view instead. This would allow you to simplify your code, and make any future changes to the sub-query in a single location.
If you only use these sub-queries in a single report, and there's nothing common between them across reports that could be placed in a view, then it's really just another way of writing your query. There's nothing necessarily bad about it, and it's not uncommon, but you might consider alternate approaches, such as replacing your subquery with additional joins - this may offer you a better execution plan, but that will vary on several other factors, including which RBDMS this is for.

Related

Is there a reason (performance or otherwise) to JOIN a derived table of only the needed columns in a table, and not just JOINing the table itself?

I'm refactoring and reformatting a long query, and I've noticed a pattern to the JOINs that doesn't seem to make sense. In the code I'm working on, many of the JOINs follow the same pattern as the JOIN involving the customers_to_accounts table in the example below:
SELECT
customers.name,
accounts.balance
FROM customers
INNER JOIN (
SELECT
customer_id,
account_id
FROM customers_to_accounts
) AS x ON x.customer_id = customers.customer_id
INNER JOIN accounts ON accounts.account_id = x.account_id
I'm really confused as to why the code wasn't written as:
SELECT
customers.name,
accounts.balance
FROM customers
INNER JOIN customers_to_accounts ON customers_to_accounts.customer_id = customers.customer_id
INNER JOIN accounts ON accounts.account_id = customers_to_accounts.account_id
Adding the derived table here makes it seem like the code is doing something a lot more complicated than it actually is, at first glance. The only benefit I can think of is some kind of performance boost - but I would think that creating the derived table would be detrimental to performance, if anything. If the goal was to assign a shorter alias to the table than customers_to_accounts, you could of course do that without creating a derived table.
I don't know whether the code was written by hand, or whether it was somehow generated - is this something that many ORM libraries like to do? What other reasons might there be for this?
The subquery is NOT going to give a performance boost -- at least not in any database that I'm familiar with. On the other hand, there are database that will tend to materialize the subquery, which can have an adverse effect on performance.
Why would the code be written this way? Probably just someone who overcomplicated the query, possibly due to an allergy of too many adjacent joins.
But there might be other reasons. For instance, perhaps once upon a time the logic was much more complicated -- and the junction table was only recently created. The person who originally wrote the code may have wanted to separate this particular logic.
Or, it might have been written this way to facilitate testing. It is easy to put in a limit 10 or top (2) into the subquery to validate that the code is doing what is expected.

Are joins the proper way to do cross table queries?

I have a few tables in their third normal form and I need to do some cross table queries to get the information I need.
I looked at joins but it seems like it will create a new table. Is this the proper way to perform such queries? Or should I just do nested queries ? I guess it might make sense if I have to do these queries alot? I'm really not sure how well optimize these operations are. I'm using the sequelize ORM and I'm not sure I see any clear solution.
It seems to me you are asking about joins vs subqueries. These are to some extent different. But let's start with a couple of points.
A join creates a new relvar, not a new table. A relvar is a variable standing in for the relation output by the join operation. It is transient (as opposed to a view which would be persistent).
Joins and subqueries are not always perfect substitutes. Sometimes you will need both.
Your query output is also a relvar.
The above being said, generally where possible I think joins are preferable. The major reason is that a SQL query that can be written using the structure below is far easier (as you master the language) to both understand and debug than most alternatives, and also subqueries in column lists necessarily perform badly:
SELECT [column_list]
FROM [initial_table]
[join list]
WHERE [filters]
GROUP BY [grouping list]
HAVING [post-aggregation filters]
LIMIT [limit and offset]
If your query fits the above structure then you can usually expect that specific kinds of problems will occur in logic in specific parts of the query. On the other hand, with subqueries, you have to check these independently.

SQL - Join Aggregated query or Aggregate/Sum after join?

I have a hard time figuring out what is best, or if there is difference at all,
however i have not found any material to help my understanding of this,
so i will ask this question, if not for me, then for others who might end up in the same situation.
Aggregating a sub-query before or after a join, in my specific situation the sub-query is rather slow due to fragmented data and bad normalization procedure,
I got a main query that is highly complex and a sub-query that is built from 3 small queries that is combined using union (will remove duplicate records)
i only need a single value from this sub-query (for each line), so at some point i will end up summing this value, (together with grouping the necessary control data with it so i can join)
what will have the greatest impact?
To sum sub-query before the join and then join with the aggregated version
To leave the data raw, and then sum the value together with the rest of the main query
remember there are thousands of records that will be summed for each line,
and the data is not native but built, and therefore may reside in memory,
(that is just a guess from the query optimizers perspective)
Usually I keep the group-by inside the subquery (referred as "inline view" in Oracle lingo).
This way the query is much more simple and clear.
Also I believe the execution plan is more efficient, because the data set to be aggregated is smaller and the resulting set of join keys is also smaller.
This is not a definitive answer though. If the row source that you are joining to the inline view has few matching rows, you may find that a early join reduces the aggregation effort.
The right anwer is: benchmark the queries for your particular data set.
I think in such a general way there is no right or wrong way to do it. The performance from a query like the one that you describe depends on many different factors:
what kind of join are you actually doing (and what algorithm is used in the background)
is the data to be joined small enough to fit into the memory of the machine joining it?
what query optimizations are you using, i.e. what DBMS (Oracle, MsSQL, MySQL, ...)
...
For your case I simply suggest benchmarking. I'm sorry if that does not seem like a satisfactory answer, but it is the way to go in many performance questions...
So set up a simple test using both your approaches and some test data, then pick whatever is faster.

LEFT JOIN vs. multiple SELECT statements

I am working on someone else's PHP code and seeing this pattern over and over:
(pseudocode)
result = SELECT blah1, blah2, foreign_key FROM foo WHERE key=bar
if foreign_key > 0
other_result = SELECT something FROM foo2 WHERE key=foreign_key
end
The code needs to branch if there is no related row in the other table, but couldn't this be done better by doing a LEFT JOIN in a single SELECT statement? Am I missing some performance benefit? Portability issue? Or am I just nitpicking?
This is definitely wrong. You are going over the wire a second time for no reason. DBs are very fast at their problem space. Joining tables is one of those and you'll see more of a performance degradation from the second query then the join. Unless your tablespace is hundreds of millions of records, this is not a good idea.
There is not enough information to really answer the question. I've worked on applications where decreasing the query count for one reason and increasing the query count for another reason both gave performance improvements. In the same application!
For certain combinations of table size, database configuration and how often the foreign table would be queried, doing the two queries can be much faster than a LEFT JOIN. But experience and testing is the only thing that will tell you that. MySQL with moderately large tables seems to be susceptable to this, IME. Performing three queries on one table can often be much faster than one query JOINing the three. I've seen speedups of an order of magnitude.
I'm with you - a single SQL would be better
There's a danger of treating your SQL DBMS as if it was a ISAM file system, selecting from a single table at a time. It might be cleaner to use a single SELECT with the outer join. On the other hand, detecting null in the application code and deciding what to do based on null vs non-null is also not completely clean.
One advantage of a single statement - you have fewer round trips to the server - especially if the SQL is prepared dynamically each time the other result is needed.
On average, then, a single SELECT statement is better. It gives the optimizer something to do and saves it getting too bored as well.
It seems to me that what you're saying is fairly valid - why fire off two calls to the database when one will do - unless both records are needed independently as objects(?)
Of course while it might not be as simple code wise to pull it all back in one call from the database and separate out the fields into the two separate objects, it does mean that you're only dependent on the database for one call rather than two...
This would be nicer to read as a query:
Select a.blah1, a.blah2, b.something From foo a Left Join foo2 b On a.foreign_key = b.key Where a.Key = bar;
And this way you can check you got a result in one go and have the database do all the heavy lifting in one query rather than two...
Yeah, I think it seems like what you're saying is correct.
The most likely explanation is that the developer simply doesn't know how outer joins work. This is very common, even among developers who are quite experienced in their own specialty.
There's also a widespread myth that "queries with joins are slow." So many developers blindly avoid joins at all costs, even to the extreme of running multiple queries where one would be better.
The myth of avoiding joins is like saying we should avoid writing loops in our application code, because running a line of code multiple times is obviously slower than running it once. To say nothing of the "overhead" of ++i and testing i<20 during every iteration!
You are completely correct that the single query is the way to go. To add some value to the other answers offered let me add this axiom: "Use the right tool for the job, the Database server should handle the querying work, the code should handle the procedural work."
The key idea behind this concept is that the compiler/query optimizers can do a better job if they know the entire problem domain instead of half of it.
Considering that in one database hit you have all the data you need having one single SQL statement would be better performance 99% of the time. Not sure if the connections is being creating dynamically in this case or not but if so doing so is expensive. Even if the process if reusing existing connections the DBMS is not getting optimize the queries be best way and not really making use of the relationships.
The only way I could ever see doing the calls like this for performance reasons is if the data being retrieved by the foreign key is a large amount and it is only needed in some cases. But in the sample you describe it just grabs it if it exists so this is not the case and therefore not gaining any performance.
The only "gotcha" to all of this is if the result set to work with contains a lot of joins, or even nested joins.
I've had two or three instances now where the original query I was inheriting consisted of a single query that had so a lot of joins in it and it would take the SQL a good minute to prepare the statement.
I went back into the procedure, leveraged some table variables (or temporary tables) and broke the query down into a lot of the smaller single select type statements and constructed the final result set in this manner.
This update dramatically fixed the response time, down to a few seconds, because it was easier to do a lot of simple "one shots" to retrieve the necessary data.
I'm not trying to object for objections sake here, but just to point out that the code may have been broken down to such a granular level to address a similar issue.
A single SQL query would lead in more performance as the SQL server (Which sometimes doesn't share the same location) just needs to handle one request, if you would use multiple SQL queries then you introduce a lot of overhead:
Executing more CPU instructions,
sending a second query to the server,
create a second thread on the server,
execute possible more CPU instructions
on the sever, destroy a second thread
on the server, send the second results
back.
There might be exceptional cases where the performance could be better, but for simple things you can't reach better performance by doing a bit more work.
Doing a simple two table join is usually the best way to go after this problem domain, however depending on the state of the tables and indexing, there are certain cases where it may be better to do the two select statements, but typically I haven't run into this problem until I started approaching 3-5 joined tables, not just 2.
Just make sure you have covering indexes on both tables to ensure you aren't scanning the disk for all records, that is the biggest performance hit a database gets (in my limited experience)
You should always try to minimize the number of query to the database when you can. Your example is perfect for only 1 query. This way you will be able later to cache more easily or to handle more request in same time because instead of always using 2-3 query that require a connexion, you will have only 1 each time.
There are many cases that will require different solutions and it isn't possible to explain all together.
Join scans both the tables and loops to match the first table record in second table. Simple select query will work faster in many cases as It only take cares for the primary/unique key(if exists) to search the data internally.

Transact-SQL - sub query or left-join?

I have two tables containing Tasks and Notes, and want to retrieve a list of tasks with the number of associated notes for each one. These two queries do the job:
select t.TaskId,
(select count(n.TaskNoteId) from TaskNote n where n.TaskId = t.TaskId) 'Notes'
from Task t
-- or
select t.TaskId,
count(n.TaskNoteId) 'Notes'
from Task t
left join
TaskNote n
on t.TaskId = n.TaskId
group by t.TaskId
Is there a difference between them and should I be using one over the other, or are they just two ways of doing the same job? Thanks.
On small datasets they are wash when it comes to performance. When indexed, the LOJ is a little better.
I've found on large datasets that an inner join (an inner join will work too.) will outperform the subquery by a very large factor (sorry, no numbers).
In most cases, the optimizer will treat them the same.
I tend to prefer the second, because it has less nesting, which makes it easier to read and easier to maintain. I have started to use SQL Server's common table expressions to reduce nesting as well for the same reason.
In addition, the second syntax is more flexible if there are further aggregates which may be added in the future in addition to COUNT, like MIN(some_scalar), MAX(), AVG() etc.
The subquery will be slower as it is being executed for every row in the outer query. The join will be faster as it is done once. I believe that the query optimiser will not rewrite this query plan as it can't recognize the equivalence.
Normally you would do a join and group by for this sort of count. Correlated subqueries of the sort you show are mainly of interest if they have to do some grouping or more complex predicate on a table that is not participating in another join.
If you're using SQL Server Management Studio, you can enter both versions into the Query Editor and then right-click and choose Display Estimated Execution Plan. It will give you two percentage costs relative to the batch. If they're expected to take the same time, they'll both show as 50% - in which case, choose whichever you prefer for other reasons (easier to read, easier to maintain, better fit with your coding standards etc). Otherwise, you can pick the one with the lower percentage cost relative to the batch.
You can use the same technique to look at changing any query to improve performance by comparing two versions that do the same thing.
Of course, because it's a cost relative to the batch, it doesn't mean that either query is as fast as it could be - it just tells you how they compare to each other, not to some notional optimum query to get the same results.
There's no clear-cut answer on this. You should view the SQL Plan. In terms of relational algebra, they are essentially equivalent.
I make it a point to avoid subqueries wherever possible. The join will generally be more efficient.
You can use either, and they are semantically identical. In general, the rule of thumb is to use whichever form is easier for you to read, unless performance is an issue.
If performance is an issue, then experiment with rewriting the query using the other form. Sometimes the optimizer will use an index for one form, and not the other.