I was given someone else's code that joins 9 (!) tables - I've used it with no problem in the past, but now all the tables have grown over time so that I'm getting weird space errors.
I got advice to break up the joins and do multiple pairwise joins. Should be simple since all the joins are inner and everything I reads says order should make no difference in this case - but I'm getting a different number of cases than I should. Without giving any specific very complicated example, what are some possible reasons for this?
Thanks
To me, joining 9 tables in a single statement is a lot! Pairwise may
have been imprecise - I mean joining two tables then joining that
result to another table, then that result to another table. Obviously
they are ordered to the degree that the necessary key is available at
each point.
This is not obvious to me. In fact, it is not true. MOST SQL platforms (and you still have not said which one you are using) compile SQL statements and form an execution plan. That plan will optimize and move around when joins are executed. On many systems that run in parallel they will execute the joins at the same time.
The way to understand the "order" of the statements is to look at the execution plan.
The way to control the order (on many systems) is to use a CTE. Something like this:
WITH subsetofbigtable AS
(
SELECT *
FROM reallybigtable
WHERE date = '2014-01-01'
)
SELECT *
FROM subsetofbigtable
JOIN anothertable1 ...
JOIN anothertable2 ...
JOIN anothertable3 ...
JOIN anothertable4 ...
JOIN anothertable5 ...
You can also chain CTEs to "order" joins:
WITH subsetofbigtable AS
(
SELECT *
FROM reallybigtable
WHERE date = '2014-01-01'
), chain1 AS
(
SELECT *
FROM subsetofbigtable
JOIN anothertable1 ...
), chain2 AS
(
SELECT *
FROM chain1
JOIN anothertable2 ...
)
SELECT *
FROM chain2
JOIN anothertable3 ...
JOIN anothertable4 ...
JOIN anothertable5 ...
Related
I've just been debugging a slow SQL query.
It's a join between 2 tables, with a WHERE clause conditioning on either a property of 1 table OR the other.
If I re-write it as a UNION then it's suddenly 2 orders of magnitude faster, even though those 2 queries produce identical outputs:
DECLARE #UserId UNIQUEIDENTIFIER = '0019813D-4379-400D-9423-56E1B98002CB'
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId) OR Bookings.MixedDealBroker in (#UserId))
--Execution time: ~4000ms
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
--Execution time: ~70ms
This seems rather surprising to me! I would have expected the SQL compiler to be entirely capable of identifying that the 2nd form was equivalent and would have used that compilation approach if it were available.
Some context notes:
I've checked and IN (#UserId) vs = #UserId makes no difference.
Nor does JOIN vs LEFT JOIN.
Those tables each have 100,000s records, and the filter cuts it down to ~100.
In the slow version it seems to be reading every row of both tables.
So:
Does anyone have any ideas for how this comes about.
What (if anything) can I do to fix the performance without just re-writing the query as a series of UNIONs (not viable for a variety of reasons.)
=-=-=-=-=-=-=
Execution Plans:
This is a common limitation of SQL engines, not just in SQL Server, but also other database systems as well. The OR complicates the predicate enough that the execution plan selected isn't always ideal. This probably relates to the fact that only one index can be seeked into per instance of a table object at a time (for the most part), or in your specific case, your OR predicate is across two different tables, and other factors with how SQL engines are designed.
By using a UNION clause, you now have two instances of the Bookings table referenced, which can individually be seeked on separately in the most efficient way possible. That allows the SQL Engine to pick a better execution plan to serve you query.
This is pretty much just one of those things that are the way they are because that's just the way it is, and you need to remember the UNION clause workaround for future encounters of this kind of performance issue.
Also, in response to your comment:
I don't understand how the difference can affect the EP, given that the 2 different "phrasings" of the query are identical?
A new execution plan is generated every time one doesn't exist in the plan cache for a given query, essentially. The way the Engine determines if a plan for a query is already cached is based on the exact hashing of that query statement, so even an extra space character at the end of the query can result in a new plan being generated. Theoretically that plan can be different. So a different written query (despite being logically the same) can surely result in a different execution plan.
There are other reasons a plan can change on re-generation too, such as different data and statistics of that data, in the tables referenced in the query between executions. But these reasons don't really apply to your question above.
As already stated, the OR condition prevents the database engine from efficiently using the indexes in a single query. Because the OR condition spans tables, I doubt that the Tuning Advisor will come up with anything useful.
If you have a case where the query you have posted is part of a larger query, or the results are complex and you do not want to repeat code, you can wrap your initial query in a Common Table Expression (CTE) or a subquery and then feed the combined results into the remainder of your query. Sometimes just selecting one or more PKs in your initial query will be sufficient.
Something like:
SELECT <complex select list>
FROM (
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings B
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
) PRE
JOIN Bookings B ON B.ID = PRE.BookingsID
JOIN BookingPricings BP ON BP.ID = PRE.BookingPricingsID
<more joins>
WHERE <more conditions>
Having just the IDs in your initial select make the UNION more efficient. The UNION can also be changed to a yet more-efficient UNION ALL with careful use of additional conditions, such as AND Bookings.MixedDealBroker <> #UserId in the second part, to avoid overlapping results.
I have a query like this:
SELECT *
FROM my_table_1
INNER JOIN my_table 2
USING(column_name)
WHERE my_table_1.date=<my_date>
my_table_1 has millions of lines, but I only want the entries with date=<my_date>
How PSQL query is processed? Is it worth to make an inner join only in the part of the table my_table_1 I want, like:
SELECT *
FROM (
SELECT *
FROM my_table_1
WHERE my_table_1.date=<my_date>
) A
INNER JOIN my_table 2
USING(column_name)
Or the query is processed in a way such that where I put WHERE clause doesn't matter after all?
No, there is no reason to do so.
To the compiler, these two queries will look exactly the same after optimization. It will use a technique called "predicate pushdown", and other techniques such as switching join order, to transform the query into the most efficient form. Good indexing and up-to-date statistics can be very helpful here.
In very rare circumstances, where the compiler has not calculated correctly, it is necessary to force the order of joins and predicates. But this is not the way to do it, as the compiler can see straight through it.
You can see the execution plan that the compiler has used with EXPLAIN PLAN
I am writing a query that involves using several subqueries using a WITH clause.
i.e.
WITH z as
(WITH x AS
(SELECT col1, col2 FROM foo LEFT JOIN bar on foo.col1 = bar.col1)
SELECT foo, bar
FROM x
INNER JOIN table2
ON x.col2 = table.col2)
SELECT *
FROM z
LEFT JOIN table3
ON z.col1 = table3.col2
In reality, there are a few more subqueries and a lot more columns. Are there any performance issues with using the SELECT * on the subquery table (in this case, x or z)?
I want to avoid re-typing the same column names multiple times within one query but also need to optimize performance.
The answer depends on the database. CTEs can be handled by:
materializing an intermediate table and storing the results
merging the CTE code with the rest of the query
combining these two approaches
In the first approach, additional columns could have a small effect on performance. In the second, there should be no effect.
That said, what usually dominates query performance is the work done for joins and group bys. Assuming the columns are no unreasonably large, I wouldn't worry about the performance implications of using select * in a CTE.
I would question how you write the CTEs. There is no need for nested CTEs, because they can be defined sequentially.
I'm a few weeks into learning SQL, and just finished a problem on a homework assignment about using IN and NOT IN. I managed to get the correct answer, however, I used the EXCEPT clause, which we aren't really allowed to use yet. From what I can tell, EXCEPT and NOT IN are very similar statements in SQL, but I can't understand what the difference is. Here's the general format of my query:
SELECT *
FROM table
WHERE x IN (
SELECT x
/* ... some subquery*/
EXCEPT
SELECT x
/* ... some other subquery*/
)
Is there a way to rewrite this general query without using the EXCEPT statement? How do EXCEPT and NOT IN differ from each other in general?
Edit: This other post seems to have some good information, but it seems to focus on EXISTS and not IN, which have different purposes don't they?
This might help you to understand the Difference between Except and NOT IN
EXCEPT operator returns all distinct rows from left hand side table which does not exist in right hand side table.
On the other hand, "NOT IN" will return all rows from left hand side table which are not present in right hand side table but it will not remove duplicate rows from the result.
SQL is a declarative language. As a result, there are many constructs that can translate to the same relational algebra tree / execution plan. For example, you can specify an inner join using INNER JOIN or CROSS JOIN + WHERE or , + WHERE or LEFT/RIGHT JOIN + WHERE.
EXCEPT/ INTERSECT are set operators while IN/NOT IN are predicates. Consider your example,
SELECT *
FROM table
WHERE x IN (
SELECT x
/* ... some subquery*/
EXCEPT
SELECT x
/* ... some other subquery*/
)
This can be written using IN / NOT IN like:
SELECT *
FROM table as t1
WHERE t1.x IN (
SELECT t2.x
FROM t2
WHERE t2.x NOT IN (
SELECT t3.x
FROM t3
)
)
Now, in terms of syntax these mean the same but semantically they may not produce the same results. For example, if column X is nullable then NOT IN will produce different results.
Lastly, for the simple case if you look at the execution plan in SQL Server Management Studio for these two queries you will find that they use similar plans / join strategies. This is because SQL Server will translate the syntax to a relational tree that is optimized to produce an execution plan. After this translation the final plan may be identical for different queries.
I am currently working with a query in in MSSQL that looks like:
SELECT
...
FROM
(SELECT
...
)T1
JOIN
(SELECT
...
)T2
GROUP BY
...
The inner selects are relatively fast, but the outer select aggregates the inner selects and takes an incredibly long time to execute, often timing out. Removing the group by makes it run somewhat faster and changing the join to a LEFT OUTER JOIN speeds things up a bit as well.
Why would doing a group by on a select which aggregates two inner selects cause the query to run so slow? Why does an INNER JOIN run slower than a LEFT OUTER JOIN? What can I do to troubleshoot this further?
EDIT: What makes this even more perplexing is the two inner queries are date limited and the overall query only runs slow when looking at date ranges between the start of July and any other day in July, but if the date ranges are anytime before the the July 1 and Today then it runs fine.
Without some more detail of your query its impossible to offer any hints as to what may speed your query up. A possible guess is the two inner queries are blocking access to any indexes which might have been used to perform the join resulting in large scans but there are probably many other possible reasons.
To check where the time is used in the query check the execution plan, there is a detailed explanation here
http://www.sql-server-performance.com/tips/query_execution_plan_analysis_p1.aspx
The basic run down is run the query, and display the execution plan, then look for any large percentages - they are what is slowing your query down.
Try rewriting your query without the nested SELECTs, which are rarely necessary. When using nested SELECTs - except for trivial cases - the inner SELECT resultsets are not indexed, which makes joining them to anything slow.
As Tetraneutron said, post details of your query -- we may help you rewrite it in a straight-through way.
Have you given a join predicate? Ie join table A ON table.ColA = table.ColB. If you don't give a predicate then SQL may be forced to use nested loops, so if you have a lot of rows in that range it would explain a query slow down.
Have a look at the plan in the SQL studio if you have MS Sql Server to play with.
After your t2 statement add a join condition on t1.joinfield = t2.joinfield
The issue was with fragmented data. After the data was defragmented the query started running within reasonable time constraints.
JOIN = Cartesian Product. All columns from both tables will be joined in numerous permutations. It is slow because the inner queries are querying each of the separate tables, but once they hit the join, it becomes a Cartesian product and is more difficult to manage. This would occur at the outer select statement.
Have a look at INNER JOINs as Tetraneutron recommended.