I have to use 5 tables of my database to obtain data and I wrote a SQL query like this:
SELECT *
FROM A
INNER JOIN B ON A.id_b = B.id_b
INNER JOIN C ON B.id_c = C.id_c
INNER JOIN D ON D.id_d = C.id_d
INNER JOIN E ON E.id_e = D.id_e
WHERE A.column1 = somevalue
The columns I select doesn't matter for my explanation, but I require some columns of all tables to do the operation.
I'd like to know: If set A is empty according to the WHERE clause requirements, will it progress on the successive inner joins?
Yes and no and maybe. In all likelihood, the optimizer is going to choose an execution plan that starts with A, because you have filtering conditions in the WHERE clause.
Then, the subsequent JOINs are going to be really fast, because the SQL engine doesn't have to do much work to JOIN an empty set to anything else.
That said, there is no guarantee the the optimizer will start with the first table. So, this is really a happenstance, but a reasonable expectation given your filtering conditions. Also, the subsequent JOINs will be in the execution plan, but they will be fast, because each one will have one set that is empty.
You can try the following methods to optimize your query performance, however the actual result depends on many factors, and you have to test them to see whether they really work for your scenario or not, or which one works best:
Use stored procedure and parametrized somevalue instead of raw query;
Make sure there is an index on A.column1 and rebuild it regularly (clustered index is better if possible)
Use "with (Forceseek)" on table a; (assuming that the index on column1 is already created)
Use "OPTION (FAST 1)";
Store the result of table A to a temporary table first, then use the temporary table for the following script. (This method should be the most effective one in
most cases and table A is guaranteed to be executed first.)
Scripts will be like this:
SELECT *
INTO #A
FROM A
WHERE A.column1 = somevalue
SELECT *
FROM #A
INNER JOIN B ON A.id_b = B.id_b
INNER JOIN C ON B.id_c = C.id_c
INNER JOIN D ON D.id_d = C.id_d
INNER JOIN E ON E.id_e = D.id_e
Related
What will happen in an Oracle SQL join if I don't use all the tables in the WHERE clause that were mentioned in the FROM clause?
Example:
SELECT A.*
FROM A, B, C, D
WHERE A.col1 = B.col1;
Here I didn't use the C and D tables in the WHERE clause, even though I mentioned them in FROM. Is this OK? Are there any adverse performance issues?
It is poor practice to use that syntax at all. The FROM A,B,C,D syntax has been obsolete since 1992... more than 30 YEARS now. There's no excuse anymore. Instead, every join should always use the JOIN keyword, and specify any join conditions in the ON clause. The better way to write the query looks like this:
SELECT A.*
FROM A
INNER JOIN B ON A.col1 = B.col1
CROSS JOIN C
CROSS JOIN D;
Now we can also see what happens in the question. The query will still run if you fail to specify any conditions for certain tables, but it has the effect of using a CROSS JOIN: the results will include every possible combination of rows from every included relation (where the "A,B" part counts as one relation). If each of the three parts of those joins (A&B, C, D) have just 100 rows, the result set will have 1,000,000 rows (100 * 100 * 100). This is rarely going to give the results you expect or intend, and it's especially suspect when the SELECT clause isn't looking at any of the fields from the uncorrelated tables.
Any table lacking join definition will result in a Cartesian product - every row in the intermediate rowset before the join will match every row in the target table. So if you have 10,000 rows and it joins without any join predicate to a table of 10,000 rows, you will get 100,000,000 rows as a result. There are only a few rare circumstances where this is what you want. At very large volumes it can cause havoc for the database, and DBAs are likely to lock your account.
If you don't want to use a table, exclude it entirely from your SQL. If you can't for reason due to some constraint we don't know about, then include the proper join predicates to every table in your WHERE clause and simply don't list any of their columns in your SELECT clause. If there's a cost to the join and you don't need anything from it and again for some very strange reason can't leave the table out completely from your SQL (this does occasionally happen in reusable code), then you can disable the joins by making the predicates always false. Remember to use outer joins if you do this.
Native Oracle method:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10) -- test data
SELECT A.*
FROM data a,
data b,
data c,
data d
WHERE a.col = b.col
AND DECODE('Y','Y',NULL,a.col) = c.col(+)
AND DECODE('Y','Y',NULL,a.col) = d.col(+)
ANSI style:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10)
SELECT A.*
FROM data a
INNER JOIN data b ON a.col = b.col
LEFT OUTER JOIN data c ON DECODE('Y','Y',NULL,a.col) = b.col
LEFT OUTER JOIN data d ON DECODE('Y','Y',NULL,a.col) = d.col
You can plug in a variable for the first Y that you set to Y or N (e.g. var_disable_join). This will bypass the join and avoid both the associated performance penalty and the Cartesian product effect. But again, I want to reiterate, this is an advanced hack and is probably NOT what you need. Simply leaving out the unwanted tables it the right approach 95% of the time.
Say that someone came up to you and said we're going to cut down the amount of SQL that we write by replacing equals with IN. The use would be both for single scalar values and lists of numbers.
SELECT *
FROM table
WHERE id = 1
OR
SELECT *
FROM table
WHERE id IN (1)
Are these statement equivalent to what the optimizer produces?
This looks really simple on the surface, but it leads to simplification for two reasons:
large blocks of SQL don't need to be duplicated
we don't overuse dynamic SQL.
Example
This is a contrived example, but consider the following:
SELECT a.*
FROM tablea a
JOIN tableb b ON a.id = b.id
JOIN tablec c ON b.id2 = c.id2
LEFT JOIN tabled d ON c.id3 = c.id3
WHERE d.type = 1
... and the same again for the more than one case:
SELECT a.*
FROM tablea a
JOIN tableb b ON a.id = b.id
JOIN tablec c ON b.id2 = c.id2
LEFT JOIN tabled d ON c.id3 = c.id3
WHERE d.type IN (1, 2, 3, 4)
(this isn't even a large statement)
Conceivably you could do string concatenation, but this isn't desirable in light of ORM usage, and dynamic SQL string concatenation always starts off with good intentions (at least in these parts).
The two will produce the same execution plan - either a table scan, index scan, or index seek, depending on if/how you have your table indexed.
You can see for yourself - Displaying Graphical Execution Plans (SQL Server Management Studio) - See the section called "Using the Execution Plan Options".
Those two specific statements are equivalent to the optimizer (did you compare the execution plans?), but I think the more important benefit you get out of the latter is that
WHERE id = 1 OR id = 2 OR id = 3 OR id = 4 OR id = 5
Can be expressed as the following, much more concise and readable (but semantically equivalent to the optimizer) version:
WHERE id IN (1,2,3,4,5)
The problem is that when expressed as the latter most people think they can pass a string, like #list = '1,2,3,4,5' and then say:
WHERE id IN (#list)
This does not work because #list is a single scalar string, and not an array of integers.
For cases where you have a single value, I don't see how that "optimization" helps anything. You haven't written less SQL, you've actually written more. Can you outline in more detail how this is going to lead to less SQL?
Okay, I know there are a few posts that discuss this, but my problem cannot be solved by a conditional where statement on a join (the common solution).
I have three join statements, and depending on the query parameters, I may need to run any combination of the three. My Join statement is quite expensive, so I want to only do the join when the query needs it, and I'm not prepared to write a 7 combination IF..ELSE.. statement to fulfill those combinations.
Here is what I've used for solutions thus far, but all of these have been less than ideal:
LEFT JOIN joinedTable jt
ON jt.someCol = someCol
WHERE jt.someCol = conditions
OR #neededJoin is null
(This is just too expensive, because I'm performing the join even when I don't need it, just not evaluating the join)
OUTER APPLY
(SELECT TOP(1) * FROM joinedTable jt
WHERE jt.someCol = someCol
AND #neededjoin is null)
(this is even more expensive than always left joining)
SELECT #sql = #sql + ' INNER JOIN joinedTable jt ' +
' ON jt.someCol = someCol ' +
' WHERE (conditions...) '
(this one is IDEAL, and how it is written now, but I'm trying to convert it away from dynamic SQL).
Any thoughts or help would be great!
EDIT:
If I take the dynamic SQL approach, I'm trying to figure out what would be most efficient with regards to structuring my query. Given that I have three optional conditions, and I need the results from all of them my current query does something like this:
IF condition one
SELECT from db
INNER JOIN condition one
UNION
IF condition two
SELECT from db
INNER JOIN condition two
UNION
IF condition three
SELECT from db
INNER JOIN condition three
My non-dynamic query does this task by performing left joins:
SELECT from db
LEFT JOIN condition one
LEFT JOIN condition two
LEFT JOIN condition three
WHERE condition one is true
OR condition two is true
OR condition three is true
Which makes more sense to do? since all of the code from the "SELECT from db" statement is the same? It appears that the union condition is more efficient, but my query is VERY long because of it....
Thanks!
LEFT JOIN
joinedTable jt ON jt.someCol = someCol AND jt.someCol = conditions AND #neededjoin ...
...
OR
LEFT JOIN
(
SELECT col1, someCol, col2 FROM joinedTable WHERE someCol = conditions AND #neededjoin ...
) jt ON jt.someCol = someCol
...
OR
;WITH jtCTE AS
(SELECT col1, someCol, col2 FROM joinedTable WHERE someCol = conditions AND #neededjoin ...)
SELECT
...
LEFT JOIN
jtCTE ON jtCTE.someCol = someCol
...
To be honest, there is no such construct as a conditional JOIN unless you use literals.
If it's in the SQL statement it's evaluated... so don't have it in the SQL statement by using dynamic SQL or IF ELSE
the dynamic sql solution is usually the best for these situations, but if you really need to get away from that a series of if statments in a stroed porc will do the job. It's a pain and you have to write much more code but it will be faster than trying to make joins conditional in the statement itself.
I would go for a simple and straightforward approach like this:
DECLARE #ret TABLE(...) ;
IF <coondition one> BEGIN ;
INSERT INTO #ret() SELECT ...
END ;
IF <coondition two> BEGIN ;
INSERT INTO #ret() SELECT ...
END ;
IF <coondition three> BEGIN ;
INSERT INTO #ret() SELECT ...
END ;
SELECT DISTINCT ... FROM #ret ;
Edit: I am suggesting a table variable, not a temporary table, so that the procedure will not recompile every time it runs. Generally speaking, three simpler inserts have a better chance of getting better execution plans than one big huge monster query combining all three.
However, we can not guess-timate performance. we must benchmark to determine it. Yet simpler code chunks are better for readability and maintainability.
Try this:
LEFT JOIN joinedTable jt
ON jt.someCol = someCol
AND jt.someCol = conditions
AND #neededJoin = 1 -- or whatever indicates join is needed
I think you'll find it is good performance and does what you need.
Update
If this doesn't give the performance I claimed, then perhaps that's because the last time I did this using joins to a table. The value I needed could come from one of 3 tables, based on 2 columns, so I built a 'join-map' table like so:
Col1 Col2 TableCode
1 2 A
1 4 A
1 3 B
1 5 B
2 2 C
2 5 C
1 11 C
Then,
SELECT
V.*,
LookedUpValue =
CASE M.TableCode
WHEN 'A' THEN A.Value
WHEN 'B' THEN B.Value
WHEN 'C' THEN C.Value
END
FROM
ValueMaster V
INNER JOIN JoinMap M ON V.Col1 = M.oOl1 AND V.Col2 = M.Col2
LEFT JOIN TableA A ON M.TableCode = 'A'
LEFT JOIN TableB B ON M.TableCode = 'B'
LEFT JOIN TableC C ON M.TableCode = 'C'
This gave me a huge performance improvement querying these tables (most of them dozens or hundreds of million-row tables).
This is why I'm asking if you actually get improved performance. Of course it's going to throw a join into the execution plan and assign it some cost, but overall it's going to do a lot less work than some plan that just indiscriminately joins all 3 tables and then Coalesce()s to find the right value.
If you find that compared to dynamic SQL it's only 5% more expensive to do the joins this way, but with the indiscriminate joins is 100% more expensive, it might be worth it to you to do this because of the correctness, clarity, and simplicity over dynamic SQL, all of which are probably more valuable than a small improvement (depending on what you're doing, of course).
Whether the cost scales with the number of rows is also another factor to consider. If even with a huge amount of data you only save 200ms of CPU on a query that isn't run dozens of times a second, it's a no-brainer to use it.
The reason I keep hammering on the fact that I think it's going to perform well is that even with a hash match, it wouldn't have any rows to probe with, or it wouldn't have any rows to create a hash of. The hash operation is going to stop a lot earlier compared to using the WHERE clause OR-style query of your initial post.
The dynamic SQL solution is best in most respects; you are trying to run different queries with different numbers of joins without rewriting the query to do different numbers of joins - and that doesn't work very well in terms of performance.
When I was doing this sort of stuff an æon or so ago (say the early 90s), the language I used was I4GL and the queries were built using its CONSTRUCT statement. This was used to generate part of a WHERE clause, so (based on the user input), the filter criteria it generated might look like:
a.column1 BETWEEN 1 AND 50 AND
b.column2 = 'ABCD' AND
c.column3 > 10
In those days, we didn't have the modern JOIN notations; I'm going to have to improvise a bit as we go. Typically there is a core table (or a set of core tables) that are always part of the query; there are also some tables that are optionally part of the query. In the example above, I assume that 'c' is the alias for the main table. The way the code worked would be:
Note that table 'a' was referenced in the query:
Add 'FullTableName AS a' to the FROM clause
Add a join condition 'AND a.join1 = c.join1' to the WHERE clause
Note that table 'b' was referenced...
Add bits to the FROM clause and WHERE clause.
Assemble the SELECT statement from the select-list (usually fixed), the FROM clause and the WHERE clause (occasionally with decorations such as GROUP BY, HAVING or ORDER BY too).
The same basic technique should be applied here - but the details are slightly different.
First of all, you don't have the string to analyze; you know from other circumstances which tables you need to add to your query. So, you still need to design things so that they can be assembled, but...
The SELECT clause with its select-list is probably fixed. It will identify the tables that must be present in the query because values are pulled from those tables.
The FROM clause will probably consist of a series of joins.
One part will be the core query:
FROM CoreTable1 AS C1
JOIN CoreTable2 AS C2
ON C1.JoinColumn = C2.JoinColumn
JOIN CoreTable3 AS M
ON M.PrimaryKey = C1.ForeignKey
Other tables can be added as necessary:
JOIN AuxilliaryTable1 AS A
ON M.ForeignKey1 = A.PrimaryKey
Or you can specify a full query:
JOIN (SELECT RelevantColumn1, RelevantColumn2
FROM AuxilliaryTable1
WHERE Column1 BETWEEN 1 AND 50) AS A
In the first case, you have to remember to add the WHERE criterion to the main WHERE clause, and trust the DBMS Optimizer to move the condition into the JOIN table as shown. A good optimizer will do that automatically; a poor one might not. Use query plans to help you determine how able your DBMS is.
Add the WHERE clause for any inter-table criteria not covered in the joining operations, and any filter criteria based on the core tables. Note that I'm thinking primarily in terms of extra criteria (AND operations) rather than alternative criteria (OR operations), but you can deal with OR too as long as you are careful to parenthesize the expressions sufficiently.
Occasionally, you may have to add a couple of JOIN conditions to connect a table to the core of the query - that is not dreadfully unusual.
Add any GROUP BY, HAVING or ORDER BY clauses (or limits, or any other decorations).
Note that you need a good understanding of the database schema and the join conditions. Basically, this is coding in your programming language the way you have to think about constructing the query. As long as you understand this and your schema, there aren't any insuperable problems.
Good luck...
Just because no one else mentioned this, here's something that you could use (not dynamic). If the syntax looks weird, it's because I tested it in Oracle.
Basically, you turn your joined tables into sub-selects that have a where clause that returns nothing if your condition does not match. If the condition does match, then the sub-select returns data for that table. The Case statement lets you pick which column is returned in the overall select.
with m as (select 1 Num, 'One' Txt from dual union select 2, 'Two' from dual union select 3, 'Three' from dual),
t1 as (select 1 Num from dual union select 11 from dual),
t2 as (select 2 Num from dual union select 22 from dual),
t3 as (select 3 Num from dual union select 33 from dual)
SELECT m.*
,CASE 1
WHEN 1 THEN
t1.Num
WHEN 2 THEN
t2.Num
WHEN 3 THEN
t3.Num
END SelectedNum
FROM m
LEFT JOIN (SELECT * FROM t1 WHERE 1 = 1) t1 ON m.Num = t1.Num
LEFT JOIN (SELECT * FROM t2 WHERE 1 = 2) t2 ON m.Num = t2.Num
LEFT JOIN (SELECT * FROM t3 WHERE 1 = 3) t3 ON m.Num = t3.Num
Does it matter which way I order the criteria in the ON clause for a JOIN?
select a.Name, b.Status from a
inner join b
on a.StatusID = b.ID
versus
select a.Name, b.Status from a
inner join b
on b.ID = a.StatusID
Is there any impact on performance? What if I had multiple criteria?
Is one order more maintainable than another?
JOIN order can be forced by putting the tables in the right order in the FROM clause:
MySQL has a special clause called STRAIGHT_JOIN which makes the order matter.
This will use an index on b.id:
SELECT a.Name, b.Status
FROM a
STRAIGHT_JOIN
b
ON b.ID = a.StatusID
And this will use an index on a.StatusID:
SELECT a.Name, b.Status
FROM b
STRAIGHT_JOIN
a
ON b.ID = a.StatusID
Oracle has a special hint ORDERED to enforce the JOIN order:
This will use an index on b.id or build a hash table on b:
SELECT /*+ ORDERED */
*
FROM a
JOIN b
ON b.ID = a.StatusID
And this will use an index on a.StatusID or build a hash table on a:
SELECT /*+ ORDERED */
*
FROM b
JOIN a
ON b.ID = a.StatusID
SQL Server has a hint called FORCE ORDER to do the same:
This will use an index on b.id or build a hash table on b:
SELECT *
FROM a
JOIN b
ON b.ID = a.StatusID
OPTION (FORCE ORDER)
And this will use an index on a.StatusID or build a hash table on a:
SELECT *
FROM b
JOIN a
ON b.ID = a.StatusID
OPTION (FORCE ORDER)
PostgreSQL guys, sorry. Your TODO list says:
Optimizer hints (not wanted)
Optimizer hints are used to work around problems in the optimizer. We would rather have the problems reported and fixed.
As for the order in the comparison, it doesn't matter in any RDBMS, AFAIK.
Though I personally always try to estimate which column will be searched for and put this column in the left (for it to seem like an lvalue).
See this answer for more detail.
No it does not.
What i do (for readability) is your 2nd example.
No. The database should be determining the best execution plan based on the entire criteria, not creating it by looking at each item in sequence. You can confirm this by requesting the execution plan for both queries, you'll see they are the same (you'll find that even vastly different queries, as long as they ultimately specify the same logic, are often compiled into the same execution plan).
No there is not. At the end of the day, you are really just evaluating whether a=b.
And as the symmetric property of equality states:
For any quantities a and b, if a = b, then b = a.
so whether you check for (12)*=12 or 12=(12)* makes logically no difference.
If values are equal, join, if not, don't. And whether you specify it as in your first example or the second, makes no difference.
As many have said: The order does not make a difference in result or performance.
What I want to point out though is that LINQ to SQL only allows the first case!
Eg, following example works well, ...
var result = from a in db.a
join b in db.b on a.StatusID equals b.ID
select new { Name = a.Name, Status = b.Status }
... while this will throw errors in Visual Studio:
var result = from a in db.a
join b in db.b on b.ID equals a.StatusID
select new { Name = a.Name, Status = b.Status }
Which throws these compiler errors:
CS1937: The name 'name' is not in scope on the left side of 'equals'. Consider swapping the expressions on either side of 'equals'.
CS1938: The name 'name' is not in scope on the right side of 'equals'. Consider swapping the expressions on either side of 'equals'.
Though not relevant in standard SQL coding, this might be a point to consider, when accustoming oneself to either one of those.
Read this
SqlServer contains an optimisation for situations far more complex than this.
If you have multiple criteria stuff is usually lazy evaluated (but I need to do a bit of research around edge cases if any.)
For readability I usually prefer
SELECT Name, Status FROM a
JOIN b
ON a.StatusID = b.ID
I think it makes better sense to reference the variable in the same order they were declared but its really a personal taste thing.
The only reason I wouldn't use your second example:
select a.Name, b.Status
from a
inner join b
on b.ID = a.StatusID
Your user is more likely to come back and say 'Can I see all the a.name's even if they have no status records?' rather than 'Can I see all of b.status even if they don't have a name record?', so just to plan ahead for this example, I would use On a.StatusID = b.ID in anticipation of a LEFT Outer Join. This assumes you could have table 'a' record without 'b'.
Correction: It won't change the result.
This is probably a moot point since users never want to change their requirements.
nope, doesn't matter. but here's an example to help make your queries more readable (at least to me)
select a.*, b.*
from tableA a
inner join tableB b
on a.name=b.name
and a.type=b.type
each table reference is on a separate line, and each join criteria is on a separate line. the tabbing helps keep what belongs to what straight.
another thing i like to do is make my criteria in my on statements flow the same order as the table. so if a is first and then b, a will be on the left side and b on the right.
ERROR: ON clause references tables to its right (php sqlite 3.2)
Replace this
LEFT JOIN itm08 i8 ON i8.id= **cdd01.idcmdds** and i8.itm like '%ormit%'
LEFT JOIN **comodidades cdd01** ON cdd01.id_registro = u.id_registro
For this
LEFT JOIN **comodidades cdd01** ON cdd01.id_registro = u.id_registro
LEFT JOIN itm08 i8 ON i8.id= **cdd01.idcmdds** and i8.itm like '%ormit%'
I have a particularly slow query due to the vast amount of information being joined together. However I needed to add a where clause in the shape of id in (select id from table).
I want to know if there is any gain from the following, and more pressing, will it even give the desired results.
select a.* from a where a.id in (select id from b where b.id = a.id)
as an alternative to:
select a.* from a where a.id in (select id from b)
Update:
MySQL
Can't be more specific sorry
table a is effectively a join between 7 different tables.
use of * is for examples
Edit, b doesn't get selected
Your question was about the difference between these two:
select a.* from a where a.id in (select id from b where b.id = a.id)
select a.* from a where a.id in (select id from b)
The former is a correlated subquery. It may cause MySQL to execute the subquery for each row of a.
The latter is a non-correlated subquery. MySQL should be able to execute it once and cache the results for comparison against each row of a.
I would use the latter.
Both queries you list are the equivalent of:
select a.*
from a
inner join b on b.id = a.id
Almost all optimizers will execute them in the same way.
You could post a real execution plan, and someone here might give you a way to speed it up. It helps if you specify what database server you are using.
YMMV, but I've often found using EXISTS instead of IN makes queries run faster.
SELECT a.* FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id = a.id)
Of course, without seeing the rest of the query and the context, this may not make the query any faster.
JOINing may be a more preferable option, but if a.id appears more than once in the id column of b, you would have to throw a DISTINCT in there, and you more than likely go backwards in terms of optimization.
I would never use a subquery like this. A join would be much faster.
select a.*
from a
join b on a.id = b.id
Of course don't use select * either (especially never use it when doing a join as at least one field is repeated) and it wastes network resources to send unnneeded data.
Have you looked at the execution plan?
How about
select a.*
from a
inner join b
on a.id = b.id
presumably the id fields are primary keys?
Select a.* from a
inner join (Select distinct id from b) c
on a.ID = c.AssetID
I tried all 3 versions and they ran about the same. The execution plan was the same (inner join, IN (with and without where clause in subquery), Exists)
Since you are not selecting any other fields from B, I prefer to use the Where IN(Select...) Anyone would look at the query and know what you are trying to do (Only show in a if in b.).
your problem is most likely in the seven tables within "a"
make the FROM table contain the "a.id"
make the next join: inner join b on a.id = b.id
then join in the other six tables.
you really need to show the entire query, list all indexes, and approximate row counts of each table if you want real help