INNER JOIN keywords | with and without using them - sql

SELECT * FROM TableA
INNER JOIN TableB
ON TableA.name = TableB.name
SELECT * FROM TableA, TableB
where TableA.name = TableB.name
Which is the preferred way and why?
Will there be any performance difference when keywords like JOIN is used?
Thanks

The second way is the classical way of doing it, from before the join keyword existed.
Normally the query processor generates the same database operations from the two queries, so there would be no difference in performance.
Using join better describes what you are doing in the query. If you have many joins, it's also better because the joined table and it's condition are beside each other, instead of putting all tables in one place and all conditions in another.
Another aspect is that it's easier to do an unbounded join by mistake using the second way, resulting in a cross join containing all combinations from the two tables.

Use the first one, as it is:
More explicit
Is the Standard way
As for performance - there should be no difference.

find out by using EXPLAIN SELECT …
it depends on the engine used, on the query optimizer, on the keys, on the table; on pretty much everything

In some SQL engines the second form (associative joins) is depreicated. Use the first form.
Second is less explicit, causes begginers to SQL to pause when writing code. Is much more difficult to manage in complex SQL due to the sequence of the join match requirement to match the WHERE clause sequence - they (squence in the code) must match or the results returned will change making the returned data set change which really goes against the thought that sequence should not change the results when elements at the same level are considered.
When joins containing multiple tables are created, it gets REALLY difficult to code, quite fast using the second form.
EDIT: Performance: I consider coding, debugging ease part of personal performance, thus ease of edit/debug/maintenance is better performant using the first form - it just takes me less time to do/understand stuff during the development and maintenance cycles.

Most current databases will optimize both of those queries into the exact same execution plan. However, use the first syntax, it is the current standard. By learning and using this join syntax, it will help when you do queries with LEFT OUTER JOIN and RIGHT OUTER JOIN. which become tricky and problematic using the older syntax with the joins in the WHERE clause.

Filtering joins solely using WHERE can be extremely inefficient in some common scenarios. For example:
SELECT * FROM people p, companies c WHERE p.companyID = c.id AND p.firstName = 'Daniel'
Most databases will execute this query quite literally, first taking the Cartesian product of the people and companies tables and then filtering by those which have matching companyID and id fields. While the fully-unconstrained product does not exist anywhere but in memory and then only for a moment, its calculation does take some time.
A better approach is to group the constraints with the JOINs where relevant. This is not only subjectively easier to read but also far more efficient. Thusly:
SELECT * FROM people p JOIN companies c ON p.companyID = c.id
WHERE p.firstName = 'Daniel'
It's a little longer, but the database is able to look at the ON clause and use it to compute the fully-constrained JOIN directly, rather than starting with everything and then limiting down. This is faster to compute (especially with large data sets and/or many-table joins) and requires less memory.
I change every query I see which uses the "comma JOIN" syntax. In my opinion, the only purpose for its existence is conciseness. Considering the performance impact, I don't think this is a compelling reason.

Related

Does my previous SQL query/ies affect my current query?

I have multiple SQL queries that I run one after the other to get a set of data. In each query, there are a bunch of tables joined that are exactly the same with the other queries. For example:
Query1
SELECT * FROM
Product1TableA A1
INNER JOIN Product1TableB B on A1.BId = B.Id
INNER JOIN CommonTable1 C on C.Id = B.CId
INNER JOIN CommonTable2 D on D.Id = B.DId
...
Query2
SELECT * FROM Product2TableA A2
INNER JOIN Product2TableB B on A2.BId = B.Id
INNER JOIN CommonTable1 C on C.Id = B.CId
INNER JOIN CommonTable2 D on D.Id = B.DId
...
I am playing around re-ordering the joins (around 2 dozen tables joined per query) and I read here that they should not really affect query execution unless SQL "gives up" during optimization because of how big the query is...
What I am wondering is if bunching up common table joins at the start of all my queries actually helps...
In theory, the order of the joins in the from clause doesn't make a difference on query performance. For a small number of tables, there should be no difference. The optimizer should find the best execution path.
For a larger number of tables, the optimizer may have to short-circuit its search regarding join order. It would then be using heuristics -- and these could be affected by join order.
Earlier queries would have no effect on a particular execution plan.
If you are having problems with performance, I am guessing that join order is not the root cause. The most common problem that I have in SQL Server are inappropriate nested-loop joins -- and these can be handled with an optimizer hint.
I think I understood what he was trying to say/to do:
What I am wondering is if bunching up common table joins at the start
of all my queries actually helps...
Imagine that you have some queries and every query has more than 3 inner joins. The queries are different but always have (for example) 3 tables in common that are joined on the same fields. Now the question is:
what will happen if every query will start with these 3 tables in join, and all the other tables are joined after?
The answer is it will change nothing, i.e. optimizer will rearrange the tables in the way it thinks will bring to optimal execution.
The thing may change if, for example, you save the result of these 3 joins into a temporary table and then use this saved result to join with other tables. But this depends on the filters that your queries use. If you have appropriate indexes and your query filters are selective enough(so that your query returns very few rows) there is no need to cache intermediate no-filtered result that has too many rows because optimizer can choose to first filter every table and only then to join them
Gordon's answer is a good explanation, but this answer explains the JOIN's behavior and also specifies that SQL Server's version is relevant:
Although the join order is changed in optimisation, the optimiser
does't try all possible join orders. It stops when it finds what it
considers a workable solution as the very act of optimisation uses
precious resources.
While the optimizer tries its best in choosing a good order for the JOINs, having many JOINs creates a bigger chance of obtaining a not so good plan.
Personally, I have seen many JOINs in some views within an ERP and they usually ran ok. However, from time to time (based on client's data volume, instance configuration etc.), some selects from these views took much more than expected.
If this data reaches an actual application (.NET, JAVA etc.), a way is to cache information from all small tables, store it as dictionaries (hashes) and perform O(1) lookups based on the keys.
This provides the advantages of reducing the JOIN count and not performing reads from the database for these tables (except once when caching data). However, this increases the complexity of the application (cache management).
Another solution is use temporary tables and populate them in multiple queries to avoid many JOINs per single query. This solution usually performs better and also increases debuggability (if the query does not provide the correct data or no data at all, which of the 10-15 JOINs is the problem?).
So, my answer to your question is: you might get some benefit from reordering the JOIN clauses, but I recommend avoiding lots of JOINs in the first place.

correct query design? cross joins driving ad-hoc reporting interface

I'm hoping some of the more experienced database/dwh developers or DBAs can weigh in on this one:
My team is using OBIEE as a front-end tool to drive ad-hoc reporting being done by our business units.
There is a lot of latency when generating sets that are relatively small. We are facing ~1 hour to produce ~50k records.
I looked into one of the queries that is behaving this way, and I was surprised to find that all of the tables being referenced are being cross-joined, and then filters are being applied in the WHERE clause.
So, to illustrate, the queries tend to look like this:
SELECT ...
FROM tbl1
,tbl2
,tbl3
,tbl4
WHERE tbl1.col1 = tbl2.col1
and tbl3.col2 = tbl2.col2
and tbl4.col3 = tbl3.col3
instead of like this:
SELECT ...
FROM tbl1
INNER JOIN tbl2
ON tbl1.col1 = tbl2.col1
INNER JOIN tbl3
ON tbl3.col2 = tbl2.col2
INNER JOIN tbl4
ON tbl4.col3 = tbl3.col3
Now, from what I know about the order of query operations, the FROM clause gets performed before the WHERE clause, so the first example would perform much more slowly than the latter example. Am I correct (please answer only if you know the answer in the context of Oracle DB)? Unfortunately, I don't have the admin rights to run a trace against the 2 different versions of the query.
Is there a reason to set up the query the first way, related to how the OBIEE interface works? Remember, the query is the result of a user drag-and-dropping attributes into a sandbox, from a 'bank' of attributes. Selecting any combination of the attributes is supposed to generate output (if the data exists). The attributes come from many different tables. I don't have any experience in designing the mecahnism that generates the SQL based on this kind of ad-hoc attribute selection, so I don't know whether the query design in the first example is required to service this kind of reporting tool.
Don't worry, historically Oracle used the first notation for inner joins but later on adopted ANSI SQL standards.
The results in terms of performance and returned recordsets are exactly the same, the implicit 'comma' joins are not crossing resultset but effectively integrating the WHERE filters. If you doubt it, run an EXPLAIN SELECT command for both queries and you will see the forcasted algorithms will be identical.
Expanding this answer you may notice in the future the analogous notation (+) in place of outer joins. This answer will also stand correct in that context.
The real issue comes when both notations (implicit and explicit joins) are mixed in the same query. This would be asking for trouble big time, but I doubt you find such a case in OBIEE.
Those are inner joins, not cross joins, they just use the old syntax for doing it rather than ANSI as you were expecting.
Most join queries contain at least one join condition, either in the FROM clause or in the WHERE clause. (Oracle Documentation)
For a simple query such as in your example the execution should be exactly the same.
Where you have set outer joins (in the business model join) you will see OBI produce a query where the inner joins are made in the WHERE clause and the outer joins are done ANSI in the FROM statement – just to make things really hard to debug!
SELECT ...
FROM tbl1
,tbl2
,tbl3 left outer join
tbl4 on tbl3.col1 = tbl4.col2
WHERE tbl1.col1 = tbl2.col1
and tbl3.col2 = tbl2.col2
and tbl4.col3 = tbl3.col3

Optimizing INNER JOIN query performance

I'm using a database that requires optimized queries and I'm wondering which one of those queries are the optimized one, I used a timer but the result are too close. so I do not have to clue which one to use.
QUERY 1:
SELECT A.MIG_ID_ACTEUR, A.FL_FACTURE_FSL , B.VAL_NOM,
B.VAL_PRENOM, C.VAL_CDPOSTAL, C.VAL_NOM_COMMUNE, D.CCB_ID_ACTEUR
FROM MIG_FACTURE A
INNER JOIN MIG_ACTEUR B
ON A.MIG_ID_ACTEUR= B.MIG_ID_ACTEUR
INNER JOIN MIG_ADRESSE C
ON C.MIG_ID_ADRESSE = B.MIG_ID_ADRESSE
INNER JOIN MIG_CORR_REF_ACTEUR D
ON A.MIG_ID_ACTEUR= D.MIG_ID_ACTEUR;
QUERY 2:
SELECT A.MIG_ID_ACTEUR, A.FL_FACTURE_FSL , B.VAL_NOM, B.VAL_PRENOM,
C.VAL_CDPOSTAL, C.VAL_NOM_COMMUNE, D.CCB_ID_ACTEUR
FROM MIG_FACTURE A , MIG_ACTEUR B, MIG_ADRESSE C, MIG_CORR_REF_ACTEUR D
WHERE A.MIG_ID_ACTEUR= B.MIG_ID_ACTEUR
AND C.MIG_ID_ADRESSE = B.MIG_ID_ADRESSE
AND A.MIG_ID_ACTEUR= D.MIG_ID_ACTEUR;
If you are asking whether it is more efficient to use the SQL 99 join syntax (a inner join b) or whether it is more efficient to use the older join syntax of listing the join predicates in the WHERE clause, it shouldn't matter. I'd expect that the query plans for the two queries would be identical. If the query plans are identical, performance will be identical. If the plans are not identical, that would generally imply that you had encountered a bug in the database's query parsing engine.
Personally, I'd use the SQL 99 syntax (query 1) both because it is more portable when you want to do an outer join and because it generally makes the query more readable and decreases the probability that you'll accidentally leave out a join condition. That's solely a readability and maintainability consideration, though, not a performance consideration.
First things first:
"I used a timer but the result are too close" -- This is actually not a good way to test performance. Databases have caches. The results you get back won't be comparable with a stopwatch. You have system load to contend with, caching, and a million other things that make that particular comparison worthless. Instead of that, try using EXPLAIN to figure out the execution plan. Use SHOW PROFILES and SHOW STATUS to see where and how the queries are spending time. Check last_query_cost. But don't check your stopwatch. That won't tell you anything.
Second: this question can't be answered with the info your provided. In point of fact the queries are identical (verify that with Explain) and simply boil down to implicit vs explicit joins. Doesn't make either one of them optimized though. Again, you need to dig into the join itself and see if it's making use of indices, for example, or if it's doing a lot temp tables or file sorts.
Optimizing the query is a good thing... but these two are the same. A stop watch won't help you. Use explain, show profiles, show status.... not a stop watch :-)

SQL "in" vs Join for delete [duplicate]

This question already has answers here:
SQL JOIN vs IN performance?
(6 answers)
Closed 8 years ago.
I have another programmer who wrote a bunch of delete statements that look like this:
DELETE dbo.Test WHERE TestId IN (SELECT TestId FROM #Tests )
(This one is simple but there are others with sub and sub-sub in statements like this)
I always write those kinds of statements as a join. It seems to me that this is like having an in-line function that will be called over and over.
However, I know the optimizer is capable of some serious magic, and new things are added all the time. I have not researched the difference between Join vs In for a while and I thought I would ask if it is still something that should be a join.
Does it matter if you use "join" or "in"?
Most modern SQL optimizers will figure out a join from a clause like this, but it's not guaranteed, and the more complex the query gets, the less likely the optimizer will choose the proper action.
As a general rule, using IN in this sort of scenario is not a good practice. (personal opinion warning) It's really not meant to be used that way.
A good rule of thumb (again, this is debatable but not wrong) is, for using IN, stick to finite lists. For example:
SELECT DISTINCT * FROM foo WHERE id IN (1, 2, 3, ...);
When going against another table, one of these is preferable:
SELECT DISTINCT f.* FROM foo AS f
INNER JOIN bar as b on b.foo_id = f.id;
SELECT DISTINCT * FROM foo AS f
WHERE EXISTS (SELECT NULL FROM bar AS b WHERE b.foo_id = f.id);
Depending on what you are doing, and the nature of your data, your mileage will vary with these.
Note that in this simple example, the IN, the JOIN, and the EXISTS will very likely produce exactly the same query plan. When you start getting into some serious business logic against multiple tables, however, you may find the query plans significantly diverge.
There are three ways we can look at code. Does it functionally work? Does it provide good code maintenance/read-ability? And does it perform well?
Functionally speaking, there is no difference between writing the IN clause or using the join, if both preform the same operation.
From a maintenance/read-ability aspect, one could argue that in the simple cases the join syntax would be straightforward. However, if the sub-query used within the IN clause was a complex multi-join operation, then that may be more descriptive and easier to debug at a later time (put yourself in the shoes of the person who has to look at the code with limited context.)
Finally, from a performance perspective, this would depend on the number of rows in the tables, indexes available (including their statistics), and how the cost based optimizer handles the query ( which may vary depending on the SQL version) as to which would perform better.
So as with most decisions in the IT field, the real answer is … it depends.
The most effective route will be
Delete t1
From table1 t1
Inner Join table2 t2 on t1.col1=t2.col2
In table2 you can assign the temp table (#Tests) which will be much faster.

SQL Joins Vs SQL Subqueries (Performance)?

I wish to know if I have a join query something like this -
Select E.Id,E.Name from Employee E join Dept D on E.DeptId=D.Id
and a subquery something like this -
Select E.Id,E.Name from Employee Where DeptId in (Select Id from Dept)
When I consider performance which of the two queries would be faster and why ?
Also is there a time when I should prefer one over the other?
Sorry if this is too trivial and asked before but I am confused about it. Also, it would be great if you guys can suggest me tools i should use to measure performance of two queries. Thanks a lot!
Well, I believe it's an "Old but Gold" question. The answer is: "It depends!".
The performances are such a delicate subject that it would be too much silly to say: "Never use subqueries, always join".
In the following links, you'll find some basic best practices that I have found to be very helpful:
Optimizing Subqueries
Optimizing Subqueries with Semijoin Transformations
Rewriting Subqueries as Joins
I have a table with 50000 elements, the result i was looking for was 739 elements.
My query at first was this:
SELECT p.id,
p.fixedId,
p.azienda_id,
p.categoria_id,
p.linea,
p.tipo,
p.nome
FROM prodotto p
WHERE p.azienda_id = 2699 AND p.anno = (
SELECT MAX(p2.anno)
FROM prodotto p2
WHERE p2.fixedId = p.fixedId
)
and it took 7.9s to execute.
My query at last is this:
SELECT p.id,
p.fixedId,
p.azienda_id,
p.categoria_id,
p.linea,
p.tipo,
p.nome
FROM prodotto p
WHERE p.azienda_id = 2699 AND (p.fixedId, p.anno) IN
(
SELECT p2.fixedId, MAX(p2.anno)
FROM prodotto p2
WHERE p.azienda_id = p2.azienda_id
GROUP BY p2.fixedId
)
and it took 0.0256s
Good SQL, good.
I would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN is a very slow operator, since SQL normally evaluates it as a series of WHERE clauses separated by "OR" (WHERE x=Y OR x=Z OR...).
As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.
The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!
Performance is based on the amount of data you are executing on...
If it is less data around 20k. JOIN works better.
If the data is more like 100k+ then IN works better.
If you do not need the data from the other table, IN is good, But it is alwys better to go for EXISTS.
All these criterias I tested and the tables have proper indexes.
Start to look at the execution plans to see the differences in how the SQl Server will interpret them. You can also use Profiler to actually run the queries multiple times and get the differnce.
I would not expect these to be so horribly different, where you can get get real, large performance gains in using joins instead of subqueries is when you use correlated subqueries.
EXISTS is often better than either of these two and when you are talking left joins where you want to all records not in the left join table, then NOT EXISTS is often a much better choice.
The performance should be the same; it's much more important to have the correct indexes and clustering applied on your tables (there exist some good resources on that topic).
(Edited to reflect the updated question)
I know this is an old post, but I think this is a very important topic, especially nowadays where we have 10M+ records and talk about terabytes of data.
I will also weight in with the following observations. I have about 45M records in my table ([data]), and about 300 records in my [cats] table. I have extensive indexing for all of the queries I am about to talk about.
Consider Example 1:
UPDATE d set category = c.categoryname
FROM [data] d
JOIN [cats] c on c.id = d.catid
versus Example 2:
UPDATE d set category = (SELECT TOP(1) c.categoryname FROM [cats] c where c.id = d.catid)
FROM [data] d
Example 1 took about 23 mins to run. Example 2 took around 5 mins.
So I would conclude that sub-query in this case is much faster. Of course keep in mind that I am using M.2 SSD drives capable of i/o # 1GB/sec (thats bytes not bits), so my indexes are really fast too. So this may affect the speeds too in your circumstance
If its a one-off data cleansing, probably best to just leave it run and finish. I use TOP(10000) and see how long it takes and multiply by number of records before I hit the big query.
If you are optimizing production databases, I would strongly suggest pre-processing data, i.e. use triggers or job-broker to async update records, so that real-time access retrieves static data.
The two queries may not be semantically equivalent. If a employee works for more than one department (possible in the enterprise I work for; admittedly, this would imply your table is not fully normalized) then the first query would return duplicate rows whereas the second query would not. To make the queries equivalent in this case, the DISTINCT keyword would have to be added to the SELECT clause, which may have an impact on performance.
Note there is a design rule of thumb that states a table should model an entity/class or a relationship between entities/classes but not both. Therefore, I suggest you create a third table, say OrgChart, to model the relationship between employees and departments.
You can use an Explain Plan to get an objective answer.
For your problem, an Exists filter would probably perform the fastest.