Join table vs Join Subquery - which is more efficient? - sql

I was recently going through a lot of SQL code where Join sections were filled with complex subqueries, and started wondering if there is any benefit of joining subquery with limited column selection vs joining entire table and selecting only necessary columns.
To ilustrate that:
Let's say we have 2 tables: Table1, Table2 each with columns (PK, FK, a, b ,c, d, e, f).
I want to join Table1 with Table2, but retrieve only a few fields from Table2.
Which is more efficient, what are the benefits of each?
SELECT
Table1.*,
Table2.a,
Table2.b
FROM Table1
LEFT JOIN Table2 ON Table1.PK = Table2.FK
OR
SELECT
Table1.*,
Table2sub.*
FROM Table1
LEFT JOIN (SELECT FK, a, b FROM Table2) AS Table2sub ON Table1.PK = Table2sub.FK

SQL is a descriptive language, not a procedural language. That is, a SQL query describes what the result set looks like, not how the result is produced.
In fact, what the engine runs is called a directed acyclic graph (DAG) -- and that looks nothing like a query. The SQL engine first parses the query, then compiles it, then optimizes it to produce the DAG.
SQL Server has a good optimizer. It is not going to be confused by subqueries. Some SQL compilers are not quite as smart and will materialize the subquery -- which could have a big impact on performance.
If you look at the execution plans, you will see that they are the same in this case.

Related

Is there ALWAYS a "base table" in any database query?

Ok this is slightly theoretical so it would be great if an unbiased database enthusiast gave an opinion.
For the sake of argument let's agree that there is such a concept as a "base table" w.r.t. to a query, where one table is driving the majority of information of the result set. Imagine a query where there are three relations - TableA, TableB, and TableC
Let's say TableA has cardinality of 1 million records and TableC has 500 records and TableC has 10,000.
Let's say the query is like so -
SELECT A.Col1
, A.Col2
, A.Col3
, A.Col4
, A.Col5
FROM TableA A
LEFT JOIN TableB B ON B.ID = A.TableBID
LEFT JOIN TableC C ON C.ID = A.TableCID
Ok, clearly TableA is the base relation above. It is the biggest table, it is driving the result set by being joined "from", and visually the columns are even on the "left side" of the result set. (The left side thing actually was a criterion to my colleague).
Now, let's assume that TableA has 1 million rows again, TableB is a "junction" or "bridge" table and has like 500,000 rows and TableC has 1,000,000 rows. So assume the query is just an outer join to get all columns in TableA and TableC where a relationship exists like below...
SELECT A.*
, C.*
FROM TableC C
FULL OUTER JOIN TableB B ON C.ID = B.TableAID
FULL OUTER JOIN TableA A ON A.ID = B.TableCID
Ok so given the last query, can anyone tell me what the "base relation" is? I don't think there is one, but was hoping for another database person's opinion.
The term "base table" has a definition and it has nothing to do with what you describe. A "base table" is pretty much just a "table". That is, it is not a view, it is not a table valued function, it is not the result of a query. It is what gets stored in the database as an explicit table.
What you are seem to be grasping for seems more related to optimization strategies. I have used similar terminology -- in the context of optimization -- to describe the "driving table" being accessed by the optimizer. The purpose of this is to distinguish between different execution plans.
Consider the query:
from t1 join t2 using (col)
There are multiple different execution plans. Here are some methods and what might be considered the "driving table" (if any) for them:
for each row in t1
for each row in t2
compare col
--> t1 is the "driving table"
for each row in t2
for each row in t1
compare col
--> t2 is the "driving table"
for each row in t1
look up t2 value using index on t2(col)
--> t1 is the "driving table"
sort t1 by col
sort t2 by col
compare the rows in the two sorted sets
--> no "driving table"
hash t1 by col
hash t2 by col
compare the hash maps
--> no "driving table"
In other words, the "driving" table has little to do with the query structure. It is based on the optimization strategies used for the query. That said, left joins and right joins limit the optimization paths. So, in a nested loop or index-lookup situation, the "first" (or "last") table would be the driving table.
The concept of a "driving" table is really an assumption about how the DBMS is expected to execute a query internally. A rule-based query optimizer, in the absence of any index-related preferences, may treat the ordering of tables and joins in a query as significant when it comes to choosing the execution plan. Under a cost-based optimizer, there is no significance to the order of tables and joins so nothing about the structure of the query itself will tell you which table gets read first or in what order the join conditions get evaluated.
When conceptualizing a query it may help to have a mental image of one table being the starting point for the query but I think the answer to the question here must be no. Logically speaking there is no such thing as a driving table.
A base table is a given named table-valued variable--a database table. That's it. In a query expression its name is a leaf expression denoting its value. "Given table variable" would be more descriptive. A query can use literal notation for a table. It would be reasonable for a given named table-valued constant to also be called "base". It's nothing about some kind of "main" table.
The relational model is founded on a table holding the rows that make a true proposition (statement) from its (characteristic) predicate (statement template parameterized by column names). We give base table rows & get query expression rows.
A query expression that is a base table name comes with a predicate given by the designer.
/* (person, liked) rows where [liker] likes [liked] */
/* (person, liked) rows where Likes(liker, liked) */
SELECT * FROM Likes
A query expression that is a table literal has a certain predicate in terms of columns being equal to values.
/* (person) rows where
person = 'Bob'
*/
SELECT * FROM (VALUES ('Bob')) dummy (person)
Otherwise a query expression has a predicate built from its constituent table expression predicates according to its relation operator.
Every algebra operator corresponds to a certain logic operator.
NATURAL JOIN & AND
RESTRICTtheta & ANDtheta
UNION & OR
MINUS & AND NOT
PROJECTall butC & EXISTS C
etc
/* (person) rows where
(FOR SOME liked, Likes(person, liked))
OR person = 'Bob'
*/
SELECT liker AS person
FROM Likes
UNION
VALUES ('Bob')
/* (person, liked) rows where
FOR SOME [values for] l1.*, l2.*,
person = l1.liker AND liked = l2.liked
AND Likes(l1.liker, l1.liked)
AND Likes(l2.liker, l2.liked)
AND l1.liked = l2.liker
AND person = 'Bob'
AND NOT Likes(l1.liked, 'Ed')
*/
Likes l1 INNER JOIN Likes l2
ON l1.liked = l2.liker
WHERE l1.liker = 'Bob'
AND NOT (l1.liked, 'Ed') IN (SELECT * FROM Likes)
There's no difference to how a base, literal or operator-call query expression is used in determining a containing query expression's predicate.
Is there any rule of thumb to construct SQL query from a human-readable description?
Relational algebra - recode column values
Let me suggest a perspective where the base table is the first one in the FROM clause (ie not a JOINed table). In the case where a statement can be equally written with either one table or another as base table, we would say that there are two (or more) base tables.
In your first query, the base table is TableA. If you invert TableA and TableC in the query, you are not guaranteed to get the same results, because of the LEFT JOIN.
In the second query, as you are using FULL JOINs, all 3 tables could be inverted without changing the result, so this is indeed a use case of a query where all tables are base tables.

Improving SQL cartesian product performance by reducing columns

I have an SQL query which uses cartesian product on a large table. However, I only need one column from one of the tables. Would it actually perform better, if I selected only that one column before using the cartesian product?
So, in other words, would this:
SELECT A.Id, B.Id
FROM (SELECT Id FROM Table1) AS A , Table2 AS B;
be faster than this, given that Table1 has more columns than Id?:
SELECT A.Id, B.Id
FROM Table1 AS A , Table2 AS B;
Or does the number of columns not matter?
On most databases, the two forms would have the same execution plan.
The first could would be worse on a database (such as MySQL) that materializes subqueries.
The second should be better with indexes on the two tables . . . table1(id) and table2(id). The index would be used to get the value rather than the base data.
Try it out yourself! But generally speaking having a subquery reduce the number of rows will help improve the performance. Your query should, however, be written differently:
select a.id aid, b.id bid from
(Select id from table1 where id=<specific_id>) a, table2 b

Performance of join vs pre-select on MsSQL

I can do the same query in two ways as following, will #1 be more efficient as we don't have join?
1
select table1.* from table1
inner join table2 on table1.key = table2.key
where table2.id = 1
2
select * from table1
where key = (select key from table2 where id=1)
These are doing two different things. The second will return an error if more than one row is returned by the subquery.
In practice, my guess is that you have an index on table2(id) or table2(id, key), and that id is unique in table2. In that case, both should be doing index lookups and the performance should be very comparable.
And, the general answer to performance question is: try them on your servers with your data. That is really the only way to know if the performance difference makes a difference in your environment.
I executed these two statements after running set statistics io on (on SQL Server 2008 R2 Enterprise - which supposedly has the best optimization compared to Standard).
select top 5 * from x2 inner join ##prices on
x1.LIST_PRICE = ##prices.i1
and
select top 5 * from x2 where LIST_PRICE in (select i1 from ##prices)
and the statistics matched exactly. I have always preferred the first type of join but the second allows me to select just that part and see what rows are being returned.
I was taught that joins vs subqueries are mostly equivalent when it comes to performance. I would also look at the resulting query plans to see if one is better then the other. The query plans matched exactly.
MS SQL Server is smart enough to understand that it is the same action in such a simple query.
However if you have more than 1 record in subquery then you'll probably use IN. In is slow operation and it will never work faster than JOIN. It can be the same but never faster.
The best option for your case is to use EXISTS. It will be always faster or the same as JOIN or IN operation. Example:
select * from table1 t1
where EXISTS (select * from table2 t2 where id=1 AND t1.key = t2.key)

WHERE and JOIN order of operation

My question is similar to this SQL order of operations but with a little twist, so I think it's fair to ask.
I'm using Teradata. And I have 2 tables: table1, table2.
table1 has only an id column.
table2 has the following columns: id, val
I might be wrong but I think these two statements give the same results.
Statement 1.
SELECT table1.id, table2.val
FROM table1
INNER JOIN table2
ON table1.id = table2.id
WHERE table2.val<100
Statement 2.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT *
FROM table2
WHERE val<100
) table3
ON table1.id=table3.id
My questions is, will the query optimizer be smart enough to
- execute the WHERE clause first then JOIN later in Statement 1
- know that table 3 isn't actually needed in Statement 2
I'm pretty new to SQL, so please educate me if I'm misunderstanding anything.
this would depend on many many things (table size, index, key distribution, etc), you should just check the execution plan:
you don't say which database, but here are some ways:
MySql EXPLAIN
SQL Server SET SHOWPLAN_ALL (Transact-SQL)
Oracle EXPLAIN PLAN
what is explain in teradata?
Teradata Capture and compare plans faster with Visual Explain and XML plan logging
Depending on the availability of statistics and indexes for the tables in question the query rewrite mechanism in the optimizer will may or may not opt to scan Table2 for records where val < 100 before scanning Table1.
In certain situations, based on data demographics, joins, indexing and statistics you may find that the optimizer is not eliminating records in the query plan when you feel that it should. Even if you have a derived table such as the one in your example. You can force the optimizer to process a derived table by simply placing a GROUP BY in your derived table. The optimizer is then obligated to resolve the GROUP BY aggregate before it can consider resolving the join between the two tables in your example.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT table2.id, tabl2.val
FROM table2
WHERE val<100
GROUP BY 1,2
) table3
ON table1.id=table3.id
This is not to say that your standard approach should be to run with this through out your code. This is typically one of my last resorts when I have a query plan that simply doesn't eliminate extraneous records earlier enough in the plan and results in too much data being scanned and carried around through the various SPOOL files. This is simply a technique you can put in your toolkit to when you encounter such a situation.
The query rewrite mechanism is continually being updated from one release to the next and the details about how it works can be found in the SQL Transaction Processing Manual for Teradata 13.0.
Unless I'm missing something, Why do you even need Table1??
Just query Table2
Select id, val
From table2
WHERE val<100
or are you using the rows in table1 as a filter? i.e., Does table1 only copntain a subset of the Ids in Table2??
If so, then this will work as well ...
Select id, val
From table2
Where val<100
And id In (Select id
From table1)
But to answer your question, Yes the query optimizer should be intelligent enough to figure out the best order in which to execute the steps necessary to translate your logical instructions into a physical result. It uses the strored statistics that the database maintains on each table to determine what to do (what type of join logic to use for example), as wekll as what order to perform the operations in in order to minimize Disk IOs and processing costs.
Q1. execute the WHERE clause first then JOIN later in Statement 1
The thing is, if you switch the order of inner join, i.e. table2 INNER JOIN table1, then I guess WHERE clause can be processed before JOIN operation, during the preparation phase. However, I guess even if you don't change the original query, the optimizer should be able to switch their order, if it thinks the join operation will be too expensive with fetching the whole row, so it will apply WHERE first. Just my guess.
Q2. know that table 3 isn't actually needed in Statement 2
Teradata will interpret your second query in such way that the derived table is necessary, so it will keep processing table 3 involved operation.

IN vs. JOIN with large rowsets

I'm wanting to select rows in a table where the primary key is in another table. I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any significant performance difference between these two SQL queries with a large dataset (i.e. millions of rows)?
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a JOIN b ON a.c = b.d
Update:
This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:
IN vs. JOIN vs. EXISTS
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a
JOIN b
ON a.c = b.d
These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).
The equivalent of the first query is the following:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d
If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.
SQL Server can employ one of the following methods to run this query:
If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)
If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.
If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used
If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.
Neither of these methods reevaluates the whole subquery each time.
See this entry in my blog for more detail on how this works:
Counting missing rows: SQL Server
There are links for all RDBMS's of the big four.
Neither. Use an ANSI-92 JOIN:
SELECT a.*
FROM a JOIN b a.c = b.d
However, it's best as an EXISTS
SELECT a.*
FROM a
WHERE EXISTS (SELECT * FROM b WHERE a.c = b.d)
This remove the duplicates that could be generated by the JOIN, but runs just as fast if not faster
Speaking from experience on a Table with 49,000,000 rows I would recommend LEFT OUTER JOIN.
Using IN, or EXISTS Took 5 minutes to complete where the LEFT OUTER JOIN finishes in 1 second.
SELECT a.*
FROM a LEFT OUTER JOIN b ON a.c = b.d
WHERE b.d is not null -- Given b.d is a primary Key with index
Actually in my query I do this across 9 tables.
The IN is evaluated (and the select from b re-run) for each row in a, whereas the JOIN is optimized to use indices and other neat paging tricks...
In most cases, though, the optimizer would likely be able to construct a JOIN out of a correlated subquery and end up with the same execution plan anyway.
Edit: Kindly read the comments below for further... discussion about the validity of this answer, and the actual answer to the OP's question. =)
Aside from going and actually testing it out on a big swath of test data for yourself, I would say use the JOINS. I've always had better performance using them in most cases compared to an IN subquery, and you have a lot more customization options as far as how to join, what is selected, what isn't, etc.
They are different queries with different results. With the IN query you will get 1 row from table 'a' whenever the predicate matches. With the INNER JOIN query you will get a*b rows whenever the join condition matches.
So with values in a of {1,2,3} and b of {1,2,2,3} you will get 1,2,2,3 from the JOIN and 1,2,3 from the IN.
EDIT - I think you may come across a few answers in here that will give you a misconception. Go test it yourself and you will see these are all fine query plans:
create table t1 (t1id int primary key clustered)
create table t2 (t2id int identity primary key clustered
,t1id int references t1(t1id)
)
insert t1 values (1)
insert t1 values (2)
insert t1 values (3)
insert t1 values (4)
insert t1 values (5)
insert t2 values (1)
insert t2 values (2)
insert t2 values (2)
insert t2 values (3)
insert t2 values (4)
select * from t1 where t1id in (select t1id from t2)
select * from t1 where exists (select 1 from t2 where t2.t1id = t1.t1id)
select t1.* from t1 join t2 on t1.t1id = t2.t1id
The first two plans are identical. The last plan is a nested loop, this difference is expected because as I mentioned above the join has different semantics.
From MSDN documentation on Subquery Fundamentals:
Many Transact-SQL statements that
include subqueries can be
alternatively formulated as joins.
Other questions can be posed only with
subqueries. In Transact-SQL, there is
usually no performance difference
between a statement that includes a
subquery and a semantically equivalent
version that does not. However, in
some cases where existence must be
checked, a join yields better
performance. Otherwise, the nested
query must be processed for each
result of the outer query to ensure
elimination of duplicates. In such
cases, a join approach would yield
better results.
In the example you've provided, the nested query need only be processed a single time for each of the outer query results, so there should be no performance difference. Checking the execution plans for both queries should confirm this.
Note: Though the question itself didn't specify SQL Server 2005, I answered with that assumption based on the question tags. Other database engines (even different SQL Server versions) may not optimize in the same way.
Observe the execution plan for both types and draw your conclusions. Unless the number of records returned by the subquery in the "IN" statement is very small, the IN variant is almost certainly slower.
I would use a join, betting that it'll be a heck of a lot faster than IN. This presumes that there are primary keys defined, of course, thus letting indexing speed things up tremendously.
It's generally held that a join would be more efficient than the IN subquery; however the SQL*Server optimizer normally results in no noticeable performance difference. Even so, it's probably best to code using the join condition to keep your standards consistent. Also, if your data and code ever needs to be migrated in the future, the database engine may not be so forgiving (for example using a join instead of an IN subquery makes a huge difference in MySql).
Theory will only get you so far on questions like this. At the end of the day, you'll want to test both queries and see which actually runs faster. I've had cases where the JOIN version took over a minute and the IN version took less than a second. I've also had cases where JOIN was actually faster.
Personally, I tend to start off with the IN version if I know I won't need any fields from the subquery table. If that starts running slow, I'll optimize. Fortunately, for large datasets, rewriting the query makes such a noticeable difference that you can simply time it from Query Analyzer and know you're making progress.
Good luck!
Ive always been a supporter of the IN methodology. This link contains details of a test conducted in PostgresSQL.
http://archives.postgresql.org/pgsql-performance/2005-02/msg00327.php