Join Subquery result Performance - sql

I'm looking to optimize my SQL query and would like to know, in terms of optimization performance, if
SELECT A.this, B.another FROM A
JOIN B
ON A.this = B.that
WHERE B.another > 6
AND A.something < 3;
is better than:
SELECT A.this, B.another
FROM (SELECT this FROM A WHERE A.something < 3) AS A
JOIN (SELECT another FROM B WHERE B.another > 6) AS B
ON A.this = B.that;

The two queries will be identical when run. Try running the two queries preceded by explain analyze, and you should get the exact same query plan.
In a nutshell, Postgres will parse the query and come up with a query tree.
It'll then rewrite the query tree when appropriate, to remove redundant things such as an 1 = 1 where clause, replacing a small IN () clause or with its equivalent using ANY, or, in your case, collapsing your two subqueries into the parent query. The reason it'll do so is that it'll basically view them as select (select ...) with the inner select not being subjected to an aggregate, group by or limit clause.
Only then does it spend some time analyzing the query tree itself in search for a query plan it deems sufficiently optimal. And finally execute the query plan to return the rows you've asked for.
For the gory details you'll want to check out the Postgres manual on performance tips, as well as the explain and explain analyze syntax:
http://www.postgresql.org/docs/current/static/sql-explain.html
http://www.postgresql.org/docs/current/static/performance-tips.html
http://www.postgresql.org/docs/current/static/planner-stats-details.html
http://www.postgresql.org/docs/current/static/geqo.html
https://wiki.postgresql.org/wiki/Performance_Optimization
Also note the Postgres performance mailing list, whose archives you'll find here:
http://www.postgresql.org/list/pgsql-performance/
Studying them is well worth the effort, in the sense that doing so will give you a wealth of insights as to what is going on under the hood. Give attention to messages written by Tom Lane in particular -- he occasionally gives a first hand account of what's going on in the query rewriter and planner.

Related

How can this change be making my query slow (OR vs UNION) and can I fix it?

I've just been debugging a slow SQL query.
It's a join between 2 tables, with a WHERE clause conditioning on either a property of 1 table OR the other.
If I re-write it as a UNION then it's suddenly 2 orders of magnitude faster, even though those 2 queries produce identical outputs:
DECLARE #UserId UNIQUEIDENTIFIER = '0019813D-4379-400D-9423-56E1B98002CB'
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId) OR Bookings.MixedDealBroker in (#UserId))
--Execution time: ~4000ms
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
--Execution time: ~70ms
This seems rather surprising to me! I would have expected the SQL compiler to be entirely capable of identifying that the 2nd form was equivalent and would have used that compilation approach if it were available.
Some context notes:
I've checked and IN (#UserId) vs = #UserId makes no difference.
Nor does JOIN vs LEFT JOIN.
Those tables each have 100,000s records, and the filter cuts it down to ~100.
In the slow version it seems to be reading every row of both tables.
So:
Does anyone have any ideas for how this comes about.
What (if anything) can I do to fix the performance without just re-writing the query as a series of UNIONs (not viable for a variety of reasons.)
=-=-=-=-=-=-=
Execution Plans:
This is a common limitation of SQL engines, not just in SQL Server, but also other database systems as well. The OR complicates the predicate enough that the execution plan selected isn't always ideal. This probably relates to the fact that only one index can be seeked into per instance of a table object at a time (for the most part), or in your specific case, your OR predicate is across two different tables, and other factors with how SQL engines are designed.
By using a UNION clause, you now have two instances of the Bookings table referenced, which can individually be seeked on separately in the most efficient way possible. That allows the SQL Engine to pick a better execution plan to serve you query.
This is pretty much just one of those things that are the way they are because that's just the way it is, and you need to remember the UNION clause workaround for future encounters of this kind of performance issue.
Also, in response to your comment:
I don't understand how the difference can affect the EP, given that the 2 different "phrasings" of the query are identical?
A new execution plan is generated every time one doesn't exist in the plan cache for a given query, essentially. The way the Engine determines if a plan for a query is already cached is based on the exact hashing of that query statement, so even an extra space character at the end of the query can result in a new plan being generated. Theoretically that plan can be different. So a different written query (despite being logically the same) can surely result in a different execution plan.
There are other reasons a plan can change on re-generation too, such as different data and statistics of that data, in the tables referenced in the query between executions. But these reasons don't really apply to your question above.
As already stated, the OR condition prevents the database engine from efficiently using the indexes in a single query. Because the OR condition spans tables, I doubt that the Tuning Advisor will come up with anything useful.
If you have a case where the query you have posted is part of a larger query, or the results are complex and you do not want to repeat code, you can wrap your initial query in a Common Table Expression (CTE) or a subquery and then feed the combined results into the remainder of your query. Sometimes just selecting one or more PKs in your initial query will be sufficient.
Something like:
SELECT <complex select list>
FROM (
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings B
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
) PRE
JOIN Bookings B ON B.ID = PRE.BookingsID
JOIN BookingPricings BP ON BP.ID = PRE.BookingPricingsID
<more joins>
WHERE <more conditions>
Having just the IDs in your initial select make the UNION more efficient. The UNION can also be changed to a yet more-efficient UNION ALL with careful use of additional conditions, such as AND Bookings.MixedDealBroker <> #UserId in the second part, to avoid overlapping results.

Why isn't there any difference between filter scope on those nested subqueries

Given the two sub-queries below, I have been told today that query (1) would be inefficient because it would fetch all data from the view before filtering them as the where clause is outside the nested query.
(1) from ORM (where outside)
select * from (select * from VW_LOURD) as q where q.Expr5 = 'SYNC_FLAG'
(2) from self (where inside)
select * from (select * from VW_LOURD where Expr5 = 'SYNC_FLAG') as q
I rewrote the query as shown on (2) in order to have the filter inside sub-query. I did not found any noticeable performance differences. To be sure I compared both execution plan and they are exactly the same.
I came to the conclusion that both queries would fetch similar amount of data and the same way, whether the filter is inside or outside sub-query however, I am not sure if my conclusion is 100% correct, also I am not able to explain why both queries are similar.
Both query are functionaly identical, meaning that they will always produce the same results.
I understand that your intent is to push the filter to the subquery to increase efficientcy. Your database knows better anyway - as a matter of fact, it most likely rewrites both queries as:
select * from vw_lourd where expr5 = 'SYNC_FLAG'
... which is what they actually are, really. As a consequence, both execution plans are identical.

SQL query efficiency of subquery in select vs inner join

I have a query with the following structure:
SELECT
Id,
(SELECT COUNT(1) AS [A1]
FROM [dbo].Table2 AS [Extent4]
WHERE (Table1.Id = [Extent4].Id2)) AS [C1]
FROM TPO_User
This query structure is usually used by LINQ as opposed to the following structure:
SELECT Id
FROM Table1
LEFT OUTER JOIN
(SELECT COUNT(1) AS [A1], [Extent4].Id2
FROM [dbo].Table2 AS [Extent4]
GROUP BY [Extent4].Id2) AS [C1] ON C1.Id2 = Table1.Id
When I compare them, the second query has a shorter duration. Could someone explain the exact difference in execution of such a query?
And is it worth it to ever have a subquery in your select statement instead of an inner join?
I would expect both queries to have similar performance characteristics. When doing performance comparisons, you have to be sure you do them correctly. For instance, running two queries in a row is not a good comparison, because the table data has been loaded in to memory.
To really compare the queries, you need a quiescent server and cold caches. That said, the execution plan can be a big help in understanding what is happening.
I would expect the correlated subquery to have good performance with the right indexes. For your example, you want an index on Table2(Id2).
Which has better performance in general? Well, it is simple to devise scenarios where the correlated subquery is better. For instance, if TPO_User has 1 row and Table2 has 1,000,000 rows, then the correlated subquery will be better under almost any circumstances.
In my understanding:
the FROM clause is the definition of the target.
the SELECT clause is the projection (line-by-line) definition.
So the FROM clause load the data you need in memory and after that the projection is made on each line of your select statement.
So if you do a query (or call a function...) in the SELECT clause, you say that you want this sub-job to be done for each line of your projection. Seems quite heavy ;)
A little source about the running order of an SQL request : https://www.periscopedata.com/blog/sql-query-order-of-operations
Hope this helps (and do not hesitate people to correct me if I am wrong)
(And if I remember well there is now an automatic feature to optimize queries in sql server. I think it will do the correction by itself, should it not?)

Trying to understand the difference of behavior between two inner join queries

I am trying to optimize a query that uses an inner join, and I am puzzled by the difference of performance between two very similar queries.
I hope to shed some light on this.
The tables look like this:
Aggregates:
+-recid(key)-+-avg---+
+------------+-------+
History:
+-recid(key)-+-value-+
+------------+-------+
The aim is to get, for a given key (let's assume 1234), avg and value.
I have tried two queries who seem very similar to me:
SELECT a.avg, b.value FROM aggregates a, history b
WHERE a.recid = b.recid
AND a.recid = 1234
Takes 5 seconds to run
But,
SELECT a.avg, b.value FROM aggregates a, history b
WHERE a.recid = 1234
AND b.recid = 1234
runs in less than a second.
Those two queries give the very same result. I would like to understand the huge difference in performance (the end game being a better understanding, to achieve a better performance for this query!)
First, learn to use proper explicit JOIN syntax:
SELECT a.avg, h.value
FROM aggregates a JOIN
history h
ON a.recid = h.recid
WHERE a.recid = 1234;
This doesn't affect performance, but it is the correct, modern syntax.
Assuming you have indexes on aggregates(recid) and history(recid), then the two versions should have very similar execution plans in almost any database I can think of. Those two indexes would be recommended for queries like this.
One possibility is cold versus warm cache. The first time you run a query, the data needs to be loaded into memory. That can take longer. For proper timing, you need to take this into account.
Finally, if you really want to understand the difference, then you need to look at the execution plan. Most databases provide a simple way to "explain" how the query is run.
Not sure absolutely but it could be that your second query execution plan is already cached and thus DB optimizier didn't needed to bring one. BTW, your first query should be changes as below to use ANSI style JOIN syntax
SELECT a.avg, b.value FROM aggregates a
JOIN history b ON a.recid = b.recid
WHERE a.recid = 1234
The second query might be performing a cross join then filtering on the results though it would have to be a very old version of Oracle to be that dumb. But you need to look at the query plan to find out. If they consistently exhibit different performance then I guarantee the query plans will be different.

Question about SQL Server Optmization Sub Query vs. Join

Which of these queries is more efficient, and would a modern DBMS (like SQL Server) make the changes under the hood to make them equal?
SELECT DISTINCT S#
FROM shipments
WHERE P# IN (SELECT P#
FROM parts
WHERE color = ‘Red’)
vs.
SELECT DISTINCT S#
FROM shipments, parts
WHERE shipments.P# = parts.P#
AND parts.color = ‘Red’
The best way to satiate your curiosity about this kind of thing is to fire up Management Studio and look at the Execution Plan. You'll also want to look at SQL Profiler as well. As one of my professors said: "the compiler is the final authority." A similar ethos holds when you want to know the performance profile of your queries in SQL Server - just look.
Starting here, this answer has been updated
The actual comparison might be very revealing. For example, in testing that I just did, I found that either approach might yield the fastest time depending on the nature of the query. For example, a query of the form:
Select F1, F2, F3 From Table1 Where F4='X' And UID in (Select UID From Table2)
yielded a table scan on Table1 and a mere index scan on table 2 followed by a right semi join.
A query of the form:
Select A.F1, A.F2, A.F3 From Table1 A inner join Table2 B on (A.UID=B.UID)
Where A.Gender='M'
yielded the same execution plan with one caveat: the hash match was a simple right join this time. So that is the first thing to note: the execution plans were not dramatically different.
These are not duplicate queries though since the second one may return multiple, identical records (one for each record in table 2). The surprising thing here was the performance: the subquery was far faster than the inner join. With datasets in the low thousands (thank you Red Gate SQL Data Generator) the inner join was 40 times slower. I was fairly stunned.
Ok, how about a real apples to apples? This is the matching inner join - note the extra step to winnow out the duplicates:
Select Distinct A.F1, A.F2, A.F3 From Table1 A inner join Table2 B
on (A.UID=B.UID)
Where A.Gender='M'
The execution plan does change in that there is an extra step - a sort after the inner join. Oddly enough, though, the time drops dramatically such that the two queries are almost identical (on two out of five trials the inner join is very slightly faster). Now, I can imagine the first inner join (without the "distinct") being somewhat longer just due to the fact that more data is being forwarded to the query window - but it was only twice as much (two Table2 records for every Table1 record). I have no good explanation why the first inner join was so much slower.
When you add a predicate to the search on table 2 using a subquery:
Select F1, F2, F3 From Table1 Where F4='X' And UID in
(Select UID From Table2 Where F1='Y')
then the Index Scan is changed to a Clustered Index Scan (which makes sense since the UID field has its own index in the tables I am using) and the percentage of time it takes goes up. A Stream Aggregate operation is also added. Sure enough, this does slow the query down. However, plan caching obviously kicks in as the first run of the query shows a much greater effect than subsequent runs.
When you add a predicate using the inner join, the entire plan changes pretty dramatically (left as an exercise to the reader - this post is long enough). The performance, again, is pretty much the same as that of the subquery - as long as the "Distinct" is included. Similar to the first example, omitting distinct led to a significant increase in time to completion.
One last thing: someone suggested (and your question now includes) a query of the form:
Select Distinct F1, F2, F3 From table1, table2
Where (table1.UID=table2.UID) AND table1.F4='X' And table2.F1='Y'
The execution plan for this query is similar to that of the inner join (there is a sort after the original table scan on table2 and a merge join rather than a hash join of the two tables). The performance of the two is comparable as well. I may need a larger dataset to tease out difference but, so far, I'm not seeing any advantage to this construct or the "Exists" construct.
With all of this being said - your results may vary. I came nowhere near covering the full range of queries that you may run into when I was doing the above tests. As I said at the beginning, the tools included with SQL Server are your friends: use them.
So: why choose one over the other? It really comes down to your personal preferences since there appears to be no advantage for an inner join to a subquery in terms of time complexity across the range of examples I tests.
In most classic query cases I use an inner join just because I "grew up" with them. I do use subqueries, however, in two situations. First, some queries are simply easier to understand using a subquery: the relationship between the tables is manifest. The second and most important reason, though, is that I am often in a position of dynamically generating SQL from within my application and subqueries are almost always easier to generate automatically from within code.
So, the takeaway is simply that the best solution is the one that makes your development the most efficient.
Using IN is more readable, and I recommend using ANSI-92 over ANSI-89 join syntax:
SELECT DISTINCT S#
FROM SHIPMENTS s
JOIN PARTS p ON p.p# = s.p#
AND p.color = 'Red'
Check your explain plans to see which is better, because it depends on data and table setup.
If you aren't selecting anything from the table I would use an EXISTS clause.
SELECT DISTINCT S#
FROM shipments a
WHERE EXISTS (SELECT 1
FROM parts b
WHERE b.color = ‘Red’
AND a.P# = b.P#)
This will optimize out to be the same as the second one you posted.
SELECT DISTINCT S#
FROM shipments,parts
WHERE shipments.P# = parts.P# and parts.color = ‘Red’;
Using IN forces SQL Server to not use indexing on that column, and subqueries are usually slower