WHERE and JOIN clauses order performance - sql

I have a query like this:
SELECT *
FROM my_table_1
INNER JOIN my_table 2
USING(column_name)
WHERE my_table_1.date=<my_date>
my_table_1 has millions of lines, but I only want the entries with date=<my_date>
How PSQL query is processed? Is it worth to make an inner join only in the part of the table my_table_1 I want, like:
SELECT *
FROM (
SELECT *
FROM my_table_1
WHERE my_table_1.date=<my_date>
) A
INNER JOIN my_table 2
USING(column_name)
Or the query is processed in a way such that where I put WHERE clause doesn't matter after all?

No, there is no reason to do so.
To the compiler, these two queries will look exactly the same after optimization. It will use a technique called "predicate pushdown", and other techniques such as switching join order, to transform the query into the most efficient form. Good indexing and up-to-date statistics can be very helpful here.
In very rare circumstances, where the compiler has not calculated correctly, it is necessary to force the order of joins and predicates. But this is not the way to do it, as the compiler can see straight through it.
You can see the execution plan that the compiler has used with EXPLAIN PLAN

Related

SQL query efficiency of subquery in select vs inner join

I have a query with the following structure:
SELECT
Id,
(SELECT COUNT(1) AS [A1]
FROM [dbo].Table2 AS [Extent4]
WHERE (Table1.Id = [Extent4].Id2)) AS [C1]
FROM TPO_User
This query structure is usually used by LINQ as opposed to the following structure:
SELECT Id
FROM Table1
LEFT OUTER JOIN
(SELECT COUNT(1) AS [A1], [Extent4].Id2
FROM [dbo].Table2 AS [Extent4]
GROUP BY [Extent4].Id2) AS [C1] ON C1.Id2 = Table1.Id
When I compare them, the second query has a shorter duration. Could someone explain the exact difference in execution of such a query?
And is it worth it to ever have a subquery in your select statement instead of an inner join?
I would expect both queries to have similar performance characteristics. When doing performance comparisons, you have to be sure you do them correctly. For instance, running two queries in a row is not a good comparison, because the table data has been loaded in to memory.
To really compare the queries, you need a quiescent server and cold caches. That said, the execution plan can be a big help in understanding what is happening.
I would expect the correlated subquery to have good performance with the right indexes. For your example, you want an index on Table2(Id2).
Which has better performance in general? Well, it is simple to devise scenarios where the correlated subquery is better. For instance, if TPO_User has 1 row and Table2 has 1,000,000 rows, then the correlated subquery will be better under almost any circumstances.
In my understanding:
the FROM clause is the definition of the target.
the SELECT clause is the projection (line-by-line) definition.
So the FROM clause load the data you need in memory and after that the projection is made on each line of your select statement.
So if you do a query (or call a function...) in the SELECT clause, you say that you want this sub-job to be done for each line of your projection. Seems quite heavy ;)
A little source about the running order of an SQL request : https://www.periscopedata.com/blog/sql-query-order-of-operations
Hope this helps (and do not hesitate people to correct me if I am wrong)
(And if I remember well there is now an automatic feature to optimize queries in sql server. I think it will do the correction by itself, should it not?)

Multiple inner joins in a single statement versus pairwise joins

I was given someone else's code that joins 9 (!) tables - I've used it with no problem in the past, but now all the tables have grown over time so that I'm getting weird space errors.
I got advice to break up the joins and do multiple pairwise joins. Should be simple since all the joins are inner and everything I reads says order should make no difference in this case - but I'm getting a different number of cases than I should. Without giving any specific very complicated example, what are some possible reasons for this?
Thanks
To me, joining 9 tables in a single statement is a lot! Pairwise may
have been imprecise - I mean joining two tables then joining that
result to another table, then that result to another table. Obviously
they are ordered to the degree that the necessary key is available at
each point.
This is not obvious to me. In fact, it is not true. MOST SQL platforms (and you still have not said which one you are using) compile SQL statements and form an execution plan. That plan will optimize and move around when joins are executed. On many systems that run in parallel they will execute the joins at the same time.
The way to understand the "order" of the statements is to look at the execution plan.
The way to control the order (on many systems) is to use a CTE. Something like this:
WITH subsetofbigtable AS
(
SELECT *
FROM reallybigtable
WHERE date = '2014-01-01'
)
SELECT *
FROM subsetofbigtable
JOIN anothertable1 ...
JOIN anothertable2 ...
JOIN anothertable3 ...
JOIN anothertable4 ...
JOIN anothertable5 ...
You can also chain CTEs to "order" joins:
WITH subsetofbigtable AS
(
SELECT *
FROM reallybigtable
WHERE date = '2014-01-01'
), chain1 AS
(
SELECT *
FROM subsetofbigtable
JOIN anothertable1 ...
), chain2 AS
(
SELECT *
FROM chain1
JOIN anothertable2 ...
)
SELECT *
FROM chain2
JOIN anothertable3 ...
JOIN anothertable4 ...
JOIN anothertable5 ...

Filter table before inner join condition

There's a similar question here, but my doubt is slight different:
select *
from process a inner join subprocess b on a.id=b.id and a.field=true
and b.field=true
So, when using inner join, which operation comes first: the join or the a.field=true condition?
As the two tables are very big, my goal is to filter table process first and after that join only the rows filtered with table subprocess.
Which is the best approach?
First things first:
which operation comes first: the join or the a.field=true condition?
Your INNER JOIN includes this (a.field=true) as part of the condition for the join. So it will prevent rows from being added during the JOIN process.
A part of an RDBMS is the "query optimizer" which will typically find the most efficient way to execute the query - there is no guarantee on the order of evaluation for the INNER JOIN conditions.
Lastly, I would recommend rewriting your query this way:
SELECT *
FROM process AS a
INNER JOIN subprocess AS b ON a.id = b.id
WHERE a.field = true AND b.field = true
This will effectively do the same thing as your original query, but it is widely seen as much more readable by SQL programmers. The optimizer can rearrange INNER JOIN and WHERE predicates as it sees fit to do so.
You are thinking about SQL in terms of a procedural language which it is not. SQL is a declarative language, and the engine is free to pick the execution plan that works best for a given situation. So, there is no way to predict if a join or a where will be executed first.
A better way to think about SQL is in terms of optimizing queries. Things like assuring that your joins and wheres are covered by indexes. Also, at least in MS Sql Server, you can preview an estimated or actual execution plan. There is nothing stopping you from doing that and seeing for yourself.

WHERE and JOIN order of operation

My question is similar to this SQL order of operations but with a little twist, so I think it's fair to ask.
I'm using Teradata. And I have 2 tables: table1, table2.
table1 has only an id column.
table2 has the following columns: id, val
I might be wrong but I think these two statements give the same results.
Statement 1.
SELECT table1.id, table2.val
FROM table1
INNER JOIN table2
ON table1.id = table2.id
WHERE table2.val<100
Statement 2.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT *
FROM table2
WHERE val<100
) table3
ON table1.id=table3.id
My questions is, will the query optimizer be smart enough to
- execute the WHERE clause first then JOIN later in Statement 1
- know that table 3 isn't actually needed in Statement 2
I'm pretty new to SQL, so please educate me if I'm misunderstanding anything.
this would depend on many many things (table size, index, key distribution, etc), you should just check the execution plan:
you don't say which database, but here are some ways:
MySql EXPLAIN
SQL Server SET SHOWPLAN_ALL (Transact-SQL)
Oracle EXPLAIN PLAN
what is explain in teradata?
Teradata Capture and compare plans faster with Visual Explain and XML plan logging
Depending on the availability of statistics and indexes for the tables in question the query rewrite mechanism in the optimizer will may or may not opt to scan Table2 for records where val < 100 before scanning Table1.
In certain situations, based on data demographics, joins, indexing and statistics you may find that the optimizer is not eliminating records in the query plan when you feel that it should. Even if you have a derived table such as the one in your example. You can force the optimizer to process a derived table by simply placing a GROUP BY in your derived table. The optimizer is then obligated to resolve the GROUP BY aggregate before it can consider resolving the join between the two tables in your example.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT table2.id, tabl2.val
FROM table2
WHERE val<100
GROUP BY 1,2
) table3
ON table1.id=table3.id
This is not to say that your standard approach should be to run with this through out your code. This is typically one of my last resorts when I have a query plan that simply doesn't eliminate extraneous records earlier enough in the plan and results in too much data being scanned and carried around through the various SPOOL files. This is simply a technique you can put in your toolkit to when you encounter such a situation.
The query rewrite mechanism is continually being updated from one release to the next and the details about how it works can be found in the SQL Transaction Processing Manual for Teradata 13.0.
Unless I'm missing something, Why do you even need Table1??
Just query Table2
Select id, val
From table2
WHERE val<100
or are you using the rows in table1 as a filter? i.e., Does table1 only copntain a subset of the Ids in Table2??
If so, then this will work as well ...
Select id, val
From table2
Where val<100
And id In (Select id
From table1)
But to answer your question, Yes the query optimizer should be intelligent enough to figure out the best order in which to execute the steps necessary to translate your logical instructions into a physical result. It uses the strored statistics that the database maintains on each table to determine what to do (what type of join logic to use for example), as wekll as what order to perform the operations in in order to minimize Disk IOs and processing costs.
Q1. execute the WHERE clause first then JOIN later in Statement 1
The thing is, if you switch the order of inner join, i.e. table2 INNER JOIN table1, then I guess WHERE clause can be processed before JOIN operation, during the preparation phase. However, I guess even if you don't change the original query, the optimizer should be able to switch their order, if it thinks the join operation will be too expensive with fetching the whole row, so it will apply WHERE first. Just my guess.
Q2. know that table 3 isn't actually needed in Statement 2
Teradata will interpret your second query in such way that the derived table is necessary, so it will keep processing table 3 involved operation.

Is derived table executed once or three times?

Every time you make use of a derived table, that query is going to be executed. When using a CTE, that result set is pulled back once and only once within a single query.
Does the quote suggest that the following query will cause derived table to be executed three times ( once for each aggregate function’s call ):
SELECT
AVG(OrdersPlaced),MAX(OrdersPlaced),MIN(OrdersPlaced)
FROM (
SELECT
v.VendorID,
v.[Name] AS VendorName,
COUNT(*) AS OrdersPlaced
FROM Purchasing.PurchaseOrderHeader AS poh
INNER JOIN Purchasing.Vendor AS v ON poh.VendorID = v.VendorID
GROUP BY v.VendorID, v.[Name]
) AS x
thanx
No that should be one pass, take a look at the execution plan
here is an example where something will run for every row in table table2
select *,(select COUNT(*) from table1 t1 where t1.id <= t2.id) as Bla
from table2 t2
Stuff like this with a running counts will fire for each row in the table2 table
CTE or a nested (uncorrelated) subquery will generally have no different execution plan. Whether a CTE or a subquery is used has never had an effect on my intermediate queries being spooled.
With regard to the Tony Rogerson link - the explicit temp table performs better than the self-join to the CTE because it's indexed better - many times when you go beyond declarative SQL and start to anticipate the work process for the engine, you can get better results.
Sometimes, the benefit of a simpler and more maintainable query with many layered CTEs instead of a complex multi-temp-table process outweighs the performance benefits of a multi-table process. A CTE-based approach is a single SQL statement, which cannot be as quietly broken by a step being accidentally commented out or a schema changing.
Probably not, but it may spool the derived results so it only needs to access it once.
In this case, there should be no difference between a CTE and derived table.
Where is the quote from?