SQL (any) Request for insight on a query optimization - sql

I have a particularly slow query due to the vast amount of information being joined together. However I needed to add a where clause in the shape of id in (select id from table).
I want to know if there is any gain from the following, and more pressing, will it even give the desired results.
select a.* from a where a.id in (select id from b where b.id = a.id)
as an alternative to:
select a.* from a where a.id in (select id from b)
Update:
MySQL
Can't be more specific sorry
table a is effectively a join between 7 different tables.
use of * is for examples
Edit, b doesn't get selected

Your question was about the difference between these two:
select a.* from a where a.id in (select id from b where b.id = a.id)
select a.* from a where a.id in (select id from b)
The former is a correlated subquery. It may cause MySQL to execute the subquery for each row of a.
The latter is a non-correlated subquery. MySQL should be able to execute it once and cache the results for comparison against each row of a.
I would use the latter.

Both queries you list are the equivalent of:
select a.*
from a
inner join b on b.id = a.id
Almost all optimizers will execute them in the same way.
You could post a real execution plan, and someone here might give you a way to speed it up. It helps if you specify what database server you are using.

YMMV, but I've often found using EXISTS instead of IN makes queries run faster.
SELECT a.* FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id = a.id)
Of course, without seeing the rest of the query and the context, this may not make the query any faster.
JOINing may be a more preferable option, but if a.id appears more than once in the id column of b, you would have to throw a DISTINCT in there, and you more than likely go backwards in terms of optimization.

I would never use a subquery like this. A join would be much faster.
select a.*
from a
join b on a.id = b.id
Of course don't use select * either (especially never use it when doing a join as at least one field is repeated) and it wastes network resources to send unnneeded data.

Have you looked at the execution plan?
How about
select a.*
from a
inner join b
on a.id = b.id
presumably the id fields are primary keys?

Select a.* from a
inner join (Select distinct id from b) c
on a.ID = c.AssetID
I tried all 3 versions and they ran about the same. The execution plan was the same (inner join, IN (with and without where clause in subquery), Exists)
Since you are not selecting any other fields from B, I prefer to use the Where IN(Select...) Anyone would look at the query and know what you are trying to do (Only show in a if in b.).

your problem is most likely in the seven tables within "a"
make the FROM table contain the "a.id"
make the next join: inner join b on a.id = b.id
then join in the other six tables.
you really need to show the entire query, list all indexes, and approximate row counts of each table if you want real help

Related

Oracle: Use only few tables in WHERE clause but mentioned more tables in 'FROM' in a jon SQL

What will happen in an Oracle SQL join if I don't use all the tables in the WHERE clause that were mentioned in the FROM clause?
Example:
SELECT A.*
FROM A, B, C, D
WHERE A.col1 = B.col1;
Here I didn't use the C and D tables in the WHERE clause, even though I mentioned them in FROM. Is this OK? Are there any adverse performance issues?
It is poor practice to use that syntax at all. The FROM A,B,C,D syntax has been obsolete since 1992... more than 30 YEARS now. There's no excuse anymore. Instead, every join should always use the JOIN keyword, and specify any join conditions in the ON clause. The better way to write the query looks like this:
SELECT A.*
FROM A
INNER JOIN B ON A.col1 = B.col1
CROSS JOIN C
CROSS JOIN D;
Now we can also see what happens in the question. The query will still run if you fail to specify any conditions for certain tables, but it has the effect of using a CROSS JOIN: the results will include every possible combination of rows from every included relation (where the "A,B" part counts as one relation). If each of the three parts of those joins (A&B, C, D) have just 100 rows, the result set will have 1,000,000 rows (100 * 100 * 100). This is rarely going to give the results you expect or intend, and it's especially suspect when the SELECT clause isn't looking at any of the fields from the uncorrelated tables.
Any table lacking join definition will result in a Cartesian product - every row in the intermediate rowset before the join will match every row in the target table. So if you have 10,000 rows and it joins without any join predicate to a table of 10,000 rows, you will get 100,000,000 rows as a result. There are only a few rare circumstances where this is what you want. At very large volumes it can cause havoc for the database, and DBAs are likely to lock your account.
If you don't want to use a table, exclude it entirely from your SQL. If you can't for reason due to some constraint we don't know about, then include the proper join predicates to every table in your WHERE clause and simply don't list any of their columns in your SELECT clause. If there's a cost to the join and you don't need anything from it and again for some very strange reason can't leave the table out completely from your SQL (this does occasionally happen in reusable code), then you can disable the joins by making the predicates always false. Remember to use outer joins if you do this.
Native Oracle method:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10) -- test data
SELECT A.*
FROM data a,
data b,
data c,
data d
WHERE a.col = b.col
AND DECODE('Y','Y',NULL,a.col) = c.col(+)
AND DECODE('Y','Y',NULL,a.col) = d.col(+)
ANSI style:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10)
SELECT A.*
FROM data a
INNER JOIN data b ON a.col = b.col
LEFT OUTER JOIN data c ON DECODE('Y','Y',NULL,a.col) = b.col
LEFT OUTER JOIN data d ON DECODE('Y','Y',NULL,a.col) = d.col
You can plug in a variable for the first Y that you set to Y or N (e.g. var_disable_join). This will bypass the join and avoid both the associated performance penalty and the Cartesian product effect. But again, I want to reiterate, this is an advanced hack and is probably NOT what you need. Simply leaving out the unwanted tables it the right approach 95% of the time.

NOT EXISTS query doesn't work on Informix while same query with NOT IN works

select id from license where not exists(
select a.id from license a,version b, mediapart c where
c.version_id = b.id and b.cntnt_id = a.cntnt_id and c.is_highdef=1);
This query does not gives any rows in result set. Even on using different aliases for the outer license table and the inner one.
However, this one using NOT IN works fine:
select id from license where id not in(
select a.id from license a,version b, mediapart c where
c.version_id = b.id and b.cntnt_id = a.cntnt_id and c.is_highdef=1);
Any suggestions on how can I achieve similar query with NOT EXISTS?
It's a home grown framework that I have to achieve this on and it won't be possible to write a query which is like following
select id from license a where not exists(
select a.id from version b, mediapart c where
c.version_id = b.id and b.cntnt_id = a.cntnt_id and c.is_highdef=1);
The above query works but with the framework I am using, I will have to use all three table names and aliases in inner query.
The NOT EXISTS query is correlated, but NOT IN is not. In other words the result of the subquery in the latter version is independent of the result of the main query.
I suspect there are records in license that have no children in version or mediapart, so they fall out of the correlated version of the query.
Without knowing more about the data design, I suggest you probably need to look at using an OUTER JOIN to ensure you get all the license records.

SQL Server : does order of full outer join matter?

I have 4 full-outer joins in my query and its really slow, So does the order of FULL OUTER JOIN make a difference in performance / result ?
FULL OUTER JOIN = ⋈
Then,
I have a situation : A ⋈ B ⋈ C ⋈ D
All joins occur on a key common to all k contained in all A,B,C,D
Then:
Will changing the order of ⋈ joins make a difference to performance ?
Will changing the order of ⋈ change the result ?
I feel that it should not affect the result, but will it affect the performance or not I am not sure !
Update:
Will SQL Server automatically rearrange the joins for better performance assuming the result set will be independent of the order ?
No, rearranging the JOIN orders should not affect the performance. MSSQL (as with other DBMS) has a query optimizer whose job it is to find the most efficient query plan for any given query. Generally, these do a pretty good job - so you're unlikely to beat the optimizer easily.
That said, they do get it wrong occasionally. That's where reading an execution plan comes into play. You can add JOIN hints to tell MSSQL how to join your tables (at which point, ordering does matter). You'd generally order from smallest to largest table (though, with a FULL JOIN, it's not likely to matter very much) and follow the rules of thumb for join types.
Since you're doing FULL JOINS, you're basically reading the entirety of 4 tables off disk. That's likely to be very expensive. You may want to re-examine the problem, and see if it can be accomplished in a different way.
Will changing the order of ⋈ change the result ?
No, the order of the FULL JOIN does not matter, the result will be the same. Notice however, that you can't use something like this (the following may give different results depending on the order of joins):
SELECT
COALESCE(a.id, b.id, c.id, d.id) AS id, --- Key columns used in FULL JOIN
a.*, b.*, c.*, d.* --- other columns
FROM a
FULL JOIN b
ON b.id = a.id
FULL JOIN c
ON c.id = a.id
FULL JOIN d
ON d.id = a.id ;
You have to use something like this (no difference in results whatever the order of joins):
SELECT
COALESCE(a.id, b.id, c.id, d.id) AS id,
a.*, b.*, c.*, d.*
FROM a
FULL JOIN b
ON b.id = a.id
FULL JOIN c
ON c.id = COALESCE(a.id, b.id)
FULL JOIN d
ON d.id = COALESCE(a.id, b.id, c.id) ;
Will changing the order of ⋈ joins make a difference to performance?
Taking into consideration that the second and third joins have to be done on the COALESCE() of the columns and not the columns themselves, I think only testing with large enough tables will show if the indexes can be used effectively.
Changing the order of a Full outer join shouldn't affect performance or results. The only thing that will be affected based on order of a Full Outer Join is the default order of the columns produced if using a SELECT *. You may be having performance issues simply from trying to do multiple joins with large tables. If there is no where clause to limit the tables, you could be going through hundreds of thousands of results.

SQL Method of checking that INNER / LEFT join doesn't duplicate rows

Is there a good or standard SQL method of asserting that a join does not duplicate any rows (produces 0 or 1 copies of the source table row)? Assert as in causes the query to fail or otherwise indicate that there are duplicate rows.
A common problem in a lot of queries is when a table is expected to be 1:1 with another table, but there might exist 2 rows that match the join criteria. This can cause errors that are hard to track down, especially for people not necessarily entirely familiar with the tables.
It seems like there should be something simple and elegant - this would be very easy for the SQL engine to detect (have I already joined this source row to a row in the other table? ok, error out) but I can't seem to find anything on this. I'm aware that there are long / intrusive solutions to this problem, but for many ad hoc queries those just aren't very fun to work out.
EDIT / CLARIFICATION: I'm looking for a one-step query-level fix. Not a verification step on the results of that query.
If you are only testing for linked rows rather than requiring output, then you'd use EXISTS.
More correctly, you need a "semi-join" but this isn't supported by most RDBMS unless as EXISTS
SELECT a.*
FROM TableA a
WHERE EXISTS (SELECT * FROM TableB b WHERE a.id = b.id)
Also see:
Using 'IN' with a sub-query in SQL Statements
EXISTS vs JOIN and use of EXISTS clause
SELECT JoinField
FROM MyJoinTable
GROUP BY JoinField
HAVING COUNT(*) > 1
LIMIT 1
Is that simple enough? Don't have Postgres but I think it's valid syntax.
Something along the lines of
SELECT a.id, COUNT(b.id)
FROM TableA a
JOIN TableB b ON a.id = b.id
GROUP BY a.id
HAVING COUNT(b.id) > 1
Should return rows in TableA that have more than one associated row in TableB.

Does the order of tables referenced in the ON clause of the JOIN matter?

Does it matter which way I order the criteria in the ON clause for a JOIN?
select a.Name, b.Status from a
inner join b
on a.StatusID = b.ID
versus
select a.Name, b.Status from a
inner join b
on b.ID = a.StatusID
Is there any impact on performance? What if I had multiple criteria?
Is one order more maintainable than another?
JOIN order can be forced by putting the tables in the right order in the FROM clause:
MySQL has a special clause called STRAIGHT_JOIN which makes the order matter.
This will use an index on b.id:
SELECT a.Name, b.Status
FROM a
STRAIGHT_JOIN
b
ON b.ID = a.StatusID
And this will use an index on a.StatusID:
SELECT a.Name, b.Status
FROM b
STRAIGHT_JOIN
a
ON b.ID = a.StatusID
Oracle has a special hint ORDERED to enforce the JOIN order:
This will use an index on b.id or build a hash table on b:
SELECT /*+ ORDERED */
*
FROM a
JOIN b
ON b.ID = a.StatusID
And this will use an index on a.StatusID or build a hash table on a:
SELECT /*+ ORDERED */
*
FROM b
JOIN a
ON b.ID = a.StatusID
SQL Server has a hint called FORCE ORDER to do the same:
This will use an index on b.id or build a hash table on b:
SELECT *
FROM a
JOIN b
ON b.ID = a.StatusID
OPTION (FORCE ORDER)
And this will use an index on a.StatusID or build a hash table on a:
SELECT *
FROM b
JOIN a
ON b.ID = a.StatusID
OPTION (FORCE ORDER)
PostgreSQL guys, sorry. Your TODO list says:
Optimizer hints (not wanted)
Optimizer hints are used to work around problems in the optimizer. We would rather have the problems reported and fixed.
As for the order in the comparison, it doesn't matter in any RDBMS, AFAIK.
Though I personally always try to estimate which column will be searched for and put this column in the left (for it to seem like an lvalue).
See this answer for more detail.
No it does not.
What i do (for readability) is your 2nd example.
No. The database should be determining the best execution plan based on the entire criteria, not creating it by looking at each item in sequence. You can confirm this by requesting the execution plan for both queries, you'll see they are the same (you'll find that even vastly different queries, as long as they ultimately specify the same logic, are often compiled into the same execution plan).
No there is not. At the end of the day, you are really just evaluating whether a=b.
And as the symmetric property of equality states:
For any quantities a and b, if a = b, then b = a.
so whether you check for (12)*=12 or 12=(12)* makes logically no difference.
If values are equal, join, if not, don't. And whether you specify it as in your first example or the second, makes no difference.
As many have said: The order does not make a difference in result or performance.
What I want to point out though is that LINQ to SQL only allows the first case!
Eg, following example works well, ...
var result = from a in db.a
join b in db.b on a.StatusID equals b.ID
select new { Name = a.Name, Status = b.Status }
... while this will throw errors in Visual Studio:
var result = from a in db.a
join b in db.b on b.ID equals a.StatusID
select new { Name = a.Name, Status = b.Status }
Which throws these compiler errors:
CS1937: The name 'name' is not in scope on the left side of 'equals'. Consider swapping the expressions on either side of 'equals'.
CS1938: The name 'name' is not in scope on the right side of 'equals'. Consider swapping the expressions on either side of 'equals'.
Though not relevant in standard SQL coding, this might be a point to consider, when accustoming oneself to either one of those.
Read this
SqlServer contains an optimisation for situations far more complex than this.
If you have multiple criteria stuff is usually lazy evaluated (but I need to do a bit of research around edge cases if any.)
For readability I usually prefer
SELECT Name, Status FROM a
JOIN b
ON a.StatusID = b.ID
I think it makes better sense to reference the variable in the same order they were declared but its really a personal taste thing.
The only reason I wouldn't use your second example:
select a.Name, b.Status
from a
inner join b
on b.ID = a.StatusID
Your user is more likely to come back and say 'Can I see all the a.name's even if they have no status records?' rather than 'Can I see all of b.status even if they don't have a name record?', so just to plan ahead for this example, I would use On a.StatusID = b.ID in anticipation of a LEFT Outer Join. This assumes you could have table 'a' record without 'b'.
Correction: It won't change the result.
This is probably a moot point since users never want to change their requirements.
nope, doesn't matter. but here's an example to help make your queries more readable (at least to me)
select a.*, b.*
from tableA a
inner join tableB b
on a.name=b.name
and a.type=b.type
each table reference is on a separate line, and each join criteria is on a separate line. the tabbing helps keep what belongs to what straight.
another thing i like to do is make my criteria in my on statements flow the same order as the table. so if a is first and then b, a will be on the left side and b on the right.
ERROR: ON clause references tables to its right (php sqlite 3.2)
Replace this
LEFT JOIN itm08 i8 ON i8.id= **cdd01.idcmdds** and i8.itm like '%ormit%'
LEFT JOIN **comodidades cdd01** ON cdd01.id_registro = u.id_registro
For this
LEFT JOIN **comodidades cdd01** ON cdd01.id_registro = u.id_registro
LEFT JOIN itm08 i8 ON i8.id= **cdd01.idcmdds** and i8.itm like '%ormit%'