This question already has answers here:
SQL JOIN vs IN performance?
(6 answers)
Closed 8 years ago.
I have another programmer who wrote a bunch of delete statements that look like this:
DELETE dbo.Test WHERE TestId IN (SELECT TestId FROM #Tests )
(This one is simple but there are others with sub and sub-sub in statements like this)
I always write those kinds of statements as a join. It seems to me that this is like having an in-line function that will be called over and over.
However, I know the optimizer is capable of some serious magic, and new things are added all the time. I have not researched the difference between Join vs In for a while and I thought I would ask if it is still something that should be a join.
Does it matter if you use "join" or "in"?
Most modern SQL optimizers will figure out a join from a clause like this, but it's not guaranteed, and the more complex the query gets, the less likely the optimizer will choose the proper action.
As a general rule, using IN in this sort of scenario is not a good practice. (personal opinion warning) It's really not meant to be used that way.
A good rule of thumb (again, this is debatable but not wrong) is, for using IN, stick to finite lists. For example:
SELECT DISTINCT * FROM foo WHERE id IN (1, 2, 3, ...);
When going against another table, one of these is preferable:
SELECT DISTINCT f.* FROM foo AS f
INNER JOIN bar as b on b.foo_id = f.id;
SELECT DISTINCT * FROM foo AS f
WHERE EXISTS (SELECT NULL FROM bar AS b WHERE b.foo_id = f.id);
Depending on what you are doing, and the nature of your data, your mileage will vary with these.
Note that in this simple example, the IN, the JOIN, and the EXISTS will very likely produce exactly the same query plan. When you start getting into some serious business logic against multiple tables, however, you may find the query plans significantly diverge.
There are three ways we can look at code. Does it functionally work? Does it provide good code maintenance/read-ability? And does it perform well?
Functionally speaking, there is no difference between writing the IN clause or using the join, if both preform the same operation.
From a maintenance/read-ability aspect, one could argue that in the simple cases the join syntax would be straightforward. However, if the sub-query used within the IN clause was a complex multi-join operation, then that may be more descriptive and easier to debug at a later time (put yourself in the shoes of the person who has to look at the code with limited context.)
Finally, from a performance perspective, this would depend on the number of rows in the tables, indexes available (including their statistics), and how the cost based optimizer handles the query ( which may vary depending on the SQL version) as to which would perform better.
So as with most decisions in the IT field, the real answer is … it depends.
The most effective route will be
Delete t1
From table1 t1
Inner Join table2 t2 on t1.col1=t2.col2
In table2 you can assign the temp table (#Tests) which will be much faster.
Related
I have a case where using a JOIN or an IN will give me the correct results... Which typically has better performance and why? How much does it depend on what database server you are running? (FYI I am using MSSQL)
Generally speaking, IN and JOIN are different queries that can yield different results.
SELECT a.*
FROM a
JOIN b
ON a.col = b.col
is not the same as
SELECT a.*
FROM a
WHERE col IN
(
SELECT col
FROM b
)
, unless b.col is unique.
However, this is the synonym for the first query:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT col
FROM b
)
ON b.col = a.col
If the joining column is UNIQUE and marked as such, both these queries yield the same plan in SQL Server.
If it's not, then IN is faster than JOIN on DISTINCT.
See this article in my blog for performance details:
IN vs. JOIN vs. EXISTS
This Thread is pretty old but still mentioned often. For my personal taste it is a bit incomplete, because there is another way to ask the database with the EXISTS keyword which I found to be faster more often than not.
So if you are only interested in values from table a you can use this query:
SELECT a.*
FROM a
WHERE EXISTS (
SELECT *
FROM b
WHERE b.col = a.col
)
The difference might be huge if col is not indexed, because the db does not have to find all records in b which have the same value in col, it only has to find the very first one. If there is no index on b.col and a lot of records in b a table scan might be the consequence. With IN or a JOIN this would be a full table scan, with EXISTS this would be only a partial table scan (until the first matching record is found).
If there a lots of records in b which have the same col value you will also waste a lot of memory for reading all these records into a temporary space just to find that your condition is satisfied. With exists this can be usually avoided.
I have often found EXISTS faster then IN even if there is an index. It depends on the database system (the optimizer), the data and last not least on the type of index which is used.
That's rather hard to say - in order to really find out which one works better, you'd need to actually profile the execution times.
As a general rule of thumb, I think if you have indices on your foreign key columns, and if you're using only (or mostly) INNER JOIN conditions, then the JOIN will be slightly faster.
But as soon as you start using OUTER JOIN, or if you're lacking foreign key indexes, the IN might be quicker.
Marc
A interesting writeup on the logical differences: SQL Server: JOIN vs IN vs EXISTS - the logical difference
I am pretty sure that assuming that the relations and indexes are maintained a Join will perform better overall (more effort goes into working with that operation then others). If you think about it conceptually then its the difference between 2 queries and 1 query.
You need to hook it up to the Query Analyzer and try it and see the difference. Also look at the Query Execution Plan and try to minimize steps.
Each database's implementation but you can probably guess that they all solve common problems in more or less the same way. If you are using MSSQL have a look at the execution plan that is generated. You can do this by turning on the profiler and executions plans. This will give you a text version when you run the command.
I am not sure what version of MSSQL you are using but you can get a graphical one in SQL Server 2000 in the query analyzer. I am sure that this functionality is lurking some where in SQL Server Studio Manager in later versions.
Have a look at the exeuction plan. As far as possible avoid table scans unless of course your table is small in which case a table scan is faster than using an index. Read up on the different join operations that each different scenario produces.
The optimizer should be smart enough to give you the same result either way for normal queries. Check the execution plan and they should give you the same thing. If they don't, I would normally consider the JOIN to be faster. All systems are different, though, so you should profile the code on your system to be sure.
I'm using a database that requires optimized queries and I'm wondering which one of those queries are the optimized one, I used a timer but the result are too close. so I do not have to clue which one to use.
QUERY 1:
SELECT A.MIG_ID_ACTEUR, A.FL_FACTURE_FSL , B.VAL_NOM,
B.VAL_PRENOM, C.VAL_CDPOSTAL, C.VAL_NOM_COMMUNE, D.CCB_ID_ACTEUR
FROM MIG_FACTURE A
INNER JOIN MIG_ACTEUR B
ON A.MIG_ID_ACTEUR= B.MIG_ID_ACTEUR
INNER JOIN MIG_ADRESSE C
ON C.MIG_ID_ADRESSE = B.MIG_ID_ADRESSE
INNER JOIN MIG_CORR_REF_ACTEUR D
ON A.MIG_ID_ACTEUR= D.MIG_ID_ACTEUR;
QUERY 2:
SELECT A.MIG_ID_ACTEUR, A.FL_FACTURE_FSL , B.VAL_NOM, B.VAL_PRENOM,
C.VAL_CDPOSTAL, C.VAL_NOM_COMMUNE, D.CCB_ID_ACTEUR
FROM MIG_FACTURE A , MIG_ACTEUR B, MIG_ADRESSE C, MIG_CORR_REF_ACTEUR D
WHERE A.MIG_ID_ACTEUR= B.MIG_ID_ACTEUR
AND C.MIG_ID_ADRESSE = B.MIG_ID_ADRESSE
AND A.MIG_ID_ACTEUR= D.MIG_ID_ACTEUR;
If you are asking whether it is more efficient to use the SQL 99 join syntax (a inner join b) or whether it is more efficient to use the older join syntax of listing the join predicates in the WHERE clause, it shouldn't matter. I'd expect that the query plans for the two queries would be identical. If the query plans are identical, performance will be identical. If the plans are not identical, that would generally imply that you had encountered a bug in the database's query parsing engine.
Personally, I'd use the SQL 99 syntax (query 1) both because it is more portable when you want to do an outer join and because it generally makes the query more readable and decreases the probability that you'll accidentally leave out a join condition. That's solely a readability and maintainability consideration, though, not a performance consideration.
First things first:
"I used a timer but the result are too close" -- This is actually not a good way to test performance. Databases have caches. The results you get back won't be comparable with a stopwatch. You have system load to contend with, caching, and a million other things that make that particular comparison worthless. Instead of that, try using EXPLAIN to figure out the execution plan. Use SHOW PROFILES and SHOW STATUS to see where and how the queries are spending time. Check last_query_cost. But don't check your stopwatch. That won't tell you anything.
Second: this question can't be answered with the info your provided. In point of fact the queries are identical (verify that with Explain) and simply boil down to implicit vs explicit joins. Doesn't make either one of them optimized though. Again, you need to dig into the join itself and see if it's making use of indices, for example, or if it's doing a lot temp tables or file sorts.
Optimizing the query is a good thing... but these two are the same. A stop watch won't help you. Use explain, show profiles, show status.... not a stop watch :-)
Of course this is not much of a factor for most small transactions, but with larger data sets, assuming there are useful indexes in place, is it faster to do an INSERT using a JOIN instead of using CASE?
Example using CASE (in T-SQL):
INSERT into Foo
(CarID, CarMake)
(SELECT CarID,
CASE WHEN CarModel = 'Odyssey' THEN 'Honda'
WHEN CarModel = 'Sienna' THEN 'Toyota' END
as CarMake
FROM FooCar)
Example using JOIN:
INSERT into Foo
(CarID, CarMakeID)
(SELECT A.CarID, B.CarMake
FROM FooCar as A
JOIN FooMake as B on (A.CarModel = B.CarModel)
)
Which of those two statements would meet the objective of more speed? Please look past the example data sample which does not represent the problem very well in terms of magnitude.
I can't find another question like this, but it seems like something that should be asked and answered already.
The case statement would, in general, be faster. The join is better in most other respects. The purpose of having reference tables is to avoid having to hard-code specific values in the code. It is better to use the reference table.
The performance difference would generally be very small, especially if FooMake is well designed and has an index on CarModel. You should not be looking for such micro-optimizations on queries. Strive first to get the query to work correctly, using reasonable SQL, and then move toward performance.
The two versions are only equivalent if all makes are present in and appear exactly once in FooMake.
Hard coded query values (like those string literals in your CASE statement) will get translated into in-memory machine language commands and will always be faster than the disk I/O required to perform the join.
The reason you have the table instead of CASE statements is maintainability not speed. What if you have the CASE statement act on two fields instead of one? Or add some branching logic? Hard-coding your queries for speed is like racing across the frozen tundra on a snowmobile which flips over, trapping you underneath. At night, the ice-weasels come.
I wish to know if I have a join query something like this -
Select E.Id,E.Name from Employee E join Dept D on E.DeptId=D.Id
and a subquery something like this -
Select E.Id,E.Name from Employee Where DeptId in (Select Id from Dept)
When I consider performance which of the two queries would be faster and why ?
Also is there a time when I should prefer one over the other?
Sorry if this is too trivial and asked before but I am confused about it. Also, it would be great if you guys can suggest me tools i should use to measure performance of two queries. Thanks a lot!
Well, I believe it's an "Old but Gold" question. The answer is: "It depends!".
The performances are such a delicate subject that it would be too much silly to say: "Never use subqueries, always join".
In the following links, you'll find some basic best practices that I have found to be very helpful:
Optimizing Subqueries
Optimizing Subqueries with Semijoin Transformations
Rewriting Subqueries as Joins
I have a table with 50000 elements, the result i was looking for was 739 elements.
My query at first was this:
SELECT p.id,
p.fixedId,
p.azienda_id,
p.categoria_id,
p.linea,
p.tipo,
p.nome
FROM prodotto p
WHERE p.azienda_id = 2699 AND p.anno = (
SELECT MAX(p2.anno)
FROM prodotto p2
WHERE p2.fixedId = p.fixedId
)
and it took 7.9s to execute.
My query at last is this:
SELECT p.id,
p.fixedId,
p.azienda_id,
p.categoria_id,
p.linea,
p.tipo,
p.nome
FROM prodotto p
WHERE p.azienda_id = 2699 AND (p.fixedId, p.anno) IN
(
SELECT p2.fixedId, MAX(p2.anno)
FROM prodotto p2
WHERE p.azienda_id = p2.azienda_id
GROUP BY p2.fixedId
)
and it took 0.0256s
Good SQL, good.
I would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN is a very slow operator, since SQL normally evaluates it as a series of WHERE clauses separated by "OR" (WHERE x=Y OR x=Z OR...).
As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.
The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!
Performance is based on the amount of data you are executing on...
If it is less data around 20k. JOIN works better.
If the data is more like 100k+ then IN works better.
If you do not need the data from the other table, IN is good, But it is alwys better to go for EXISTS.
All these criterias I tested and the tables have proper indexes.
Start to look at the execution plans to see the differences in how the SQl Server will interpret them. You can also use Profiler to actually run the queries multiple times and get the differnce.
I would not expect these to be so horribly different, where you can get get real, large performance gains in using joins instead of subqueries is when you use correlated subqueries.
EXISTS is often better than either of these two and when you are talking left joins where you want to all records not in the left join table, then NOT EXISTS is often a much better choice.
The performance should be the same; it's much more important to have the correct indexes and clustering applied on your tables (there exist some good resources on that topic).
(Edited to reflect the updated question)
I know this is an old post, but I think this is a very important topic, especially nowadays where we have 10M+ records and talk about terabytes of data.
I will also weight in with the following observations. I have about 45M records in my table ([data]), and about 300 records in my [cats] table. I have extensive indexing for all of the queries I am about to talk about.
Consider Example 1:
UPDATE d set category = c.categoryname
FROM [data] d
JOIN [cats] c on c.id = d.catid
versus Example 2:
UPDATE d set category = (SELECT TOP(1) c.categoryname FROM [cats] c where c.id = d.catid)
FROM [data] d
Example 1 took about 23 mins to run. Example 2 took around 5 mins.
So I would conclude that sub-query in this case is much faster. Of course keep in mind that I am using M.2 SSD drives capable of i/o # 1GB/sec (thats bytes not bits), so my indexes are really fast too. So this may affect the speeds too in your circumstance
If its a one-off data cleansing, probably best to just leave it run and finish. I use TOP(10000) and see how long it takes and multiply by number of records before I hit the big query.
If you are optimizing production databases, I would strongly suggest pre-processing data, i.e. use triggers or job-broker to async update records, so that real-time access retrieves static data.
The two queries may not be semantically equivalent. If a employee works for more than one department (possible in the enterprise I work for; admittedly, this would imply your table is not fully normalized) then the first query would return duplicate rows whereas the second query would not. To make the queries equivalent in this case, the DISTINCT keyword would have to be added to the SELECT clause, which may have an impact on performance.
Note there is a design rule of thumb that states a table should model an entity/class or a relationship between entities/classes but not both. Therefore, I suggest you create a third table, say OrgChart, to model the relationship between employees and departments.
You can use an Explain Plan to get an objective answer.
For your problem, an Exists filter would probably perform the fastest.
SELECT * FROM TableA
INNER JOIN TableB
ON TableA.name = TableB.name
SELECT * FROM TableA, TableB
where TableA.name = TableB.name
Which is the preferred way and why?
Will there be any performance difference when keywords like JOIN is used?
Thanks
The second way is the classical way of doing it, from before the join keyword existed.
Normally the query processor generates the same database operations from the two queries, so there would be no difference in performance.
Using join better describes what you are doing in the query. If you have many joins, it's also better because the joined table and it's condition are beside each other, instead of putting all tables in one place and all conditions in another.
Another aspect is that it's easier to do an unbounded join by mistake using the second way, resulting in a cross join containing all combinations from the two tables.
Use the first one, as it is:
More explicit
Is the Standard way
As for performance - there should be no difference.
find out by using EXPLAIN SELECT …
it depends on the engine used, on the query optimizer, on the keys, on the table; on pretty much everything
In some SQL engines the second form (associative joins) is depreicated. Use the first form.
Second is less explicit, causes begginers to SQL to pause when writing code. Is much more difficult to manage in complex SQL due to the sequence of the join match requirement to match the WHERE clause sequence - they (squence in the code) must match or the results returned will change making the returned data set change which really goes against the thought that sequence should not change the results when elements at the same level are considered.
When joins containing multiple tables are created, it gets REALLY difficult to code, quite fast using the second form.
EDIT: Performance: I consider coding, debugging ease part of personal performance, thus ease of edit/debug/maintenance is better performant using the first form - it just takes me less time to do/understand stuff during the development and maintenance cycles.
Most current databases will optimize both of those queries into the exact same execution plan. However, use the first syntax, it is the current standard. By learning and using this join syntax, it will help when you do queries with LEFT OUTER JOIN and RIGHT OUTER JOIN. which become tricky and problematic using the older syntax with the joins in the WHERE clause.
Filtering joins solely using WHERE can be extremely inefficient in some common scenarios. For example:
SELECT * FROM people p, companies c WHERE p.companyID = c.id AND p.firstName = 'Daniel'
Most databases will execute this query quite literally, first taking the Cartesian product of the people and companies tables and then filtering by those which have matching companyID and id fields. While the fully-unconstrained product does not exist anywhere but in memory and then only for a moment, its calculation does take some time.
A better approach is to group the constraints with the JOINs where relevant. This is not only subjectively easier to read but also far more efficient. Thusly:
SELECT * FROM people p JOIN companies c ON p.companyID = c.id
WHERE p.firstName = 'Daniel'
It's a little longer, but the database is able to look at the ON clause and use it to compute the fully-constrained JOIN directly, rather than starting with everything and then limiting down. This is faster to compute (especially with large data sets and/or many-table joins) and requires less memory.
I change every query I see which uses the "comma JOIN" syntax. In my opinion, the only purpose for its existence is conciseness. Considering the performance impact, I don't think this is a compelling reason.