Update 11/2
After some additional troubleshooting, my team was able to tie this Oracle bug directly to a parameter change that was made on the 12c database the night before the query stopped working. After experiencing some performance issues from an application tied to this database, my team had our DBA change the OPTIMIZER_FEATURES_ENABLE parameter from 12.1.02 to 11.2.0.4. This fixed the performance issue for the problem application but caused the bug I have described above. To verify, I've been able to replicate this same issue in a separate environment by changing this parameter. My DBA has filed a ticket with Oracle to have this looked at.
As a workaround, I was able to make a slight change to my query in order to retrieve the expected results. Specifically, I combined Subquery1 with Subquery2 and I moved a few predicates in Subquery1 from the WHERE clause to the JOIN (where they more properly belonged). This change edited my execution plan (it is slightly less efficient than what was listed before) but was enough to address the original issue.
Original Post
Firstly, let me apologize for any vagueness in this question but I'm dealing with a confidential financial system so I am forced to hide certain implementation details.
Background
I have an Oracle query that I put into production a long time ago that has recently stopped producing expected results coincidentally after an upgrade from 11g to 12c. To my (and my production support team's) knowledge this query had been working fine for well over a year before that.
Details
The query is overly complicated and not very efficient but this is in large part because I am dealing with non-normalized tables (historically modeled after a Mainframe) and poor data input from upstream systems. In order to deal with a complicated business situation I leveraged multiple levels of Subquery Factoring (the WITH statement) and then my final statement joins together two Inline Views. The basic structure of the query without all of the complicated predicates is as follows:
I have 3 tables Table1, Table2, Table3. Table1 is a processing table made up of records from Table2.
--This grabs a subset from Table1
WITH Subquery1 as (
SELECT FROM Table1),
--This eliminates certain records from the first subset based on sister records
--from the original source table
Subquery2 as (
SELECT FROM Subquery1
WHERE NOT EXISTS FROM (SELECT from Table2)),
--This ties the records from Subquery2 to Table3
Subquery3 as (
SELECT FROM Table3
JOIN (SELECT Max(Date) FROM Table3)
JOIN Subquery2)
--This final query evaluates subquery3 in two different ways and
--only takes those records which fit the criteria items from both sets
SELECT FROM
(SELECT FROM Subquery3) -- Call this Inline View A
JOIN (SELECT FROM Subquery3) -- Call this Inline View B
The final query is pretty basic:
SELECT A.Group_No, B.Sub_Group, B.Key, B.Lob
FROM (SELECT Group_No, Lob, COUNT(Sub_Group)
FROM Subquery3
GROUP BY Group_No, Lob
HAVING COUNT(Sub_Group) = 1) A
JOIN (SELECT Group_No, Sub_Group, Key, Lob
FROM Subquery3
WHERE Sub_Group LIKE '0000%') B
ON A.Group_No = B.Group_No
AND A.Lob = B.Lob
Problem
If I edit the final query to remove the second Inline View and evaluate the output of the A inline view, I come away with 0 returned rows. I've manually evaluated the records for each individual subquery and can confirm this is an expected result.
Likewise, if I edit the final query to produce the output of only the 'B' inline view, I come away with 6 returned rows. Again, I've manually evaluated the data and this is exactly as expected.
Now when joining these two subsets (Inline View A and Inline View B) together, I would expect that the final query result would be 0 rows (since an inner join between a full set and an empty set produces no matches). However, when I run the entire query with the inner join as described above, I am getting back 1158 rows!
I have reviewed the Execution Plan but nothing jumps out at me:
Questions
Clearly I have done something to confuse the Oracle Optimizer and the updated query plan is pulling back a much different query than the one I have submitted. My best guess is that with all of these temporary views floating around within the same query, I have confused Oracle into evaluating some set before one that it depends upon.
To this day I've been unable to locate the official Oracle documentation around the WITH statement so I've never been completely confident about the order that subqueries are evaluated. I did notice in searching SO (can't find it now) someone mentioned that a factored subquery cannot refer to another factored query. I've never known this to be true before but the bizarre output above is making me wonder if I had only been lucky before with this query?
Can anyone explain the behavior I am seeing? Am I attempting to do something obviously incorrect with this query plan? Or alternatively, is there any chance that something changed between 11g and 12c that could explain why the behavior of this query might have changed?
This sounds like a "wrong results" bug in Oracle. These bugs are usually extremely specific to the version and the features you are using. There's nothing obviously wrong with the queries or execution plan you posted.
You have two ways of handling this:
Try to find the precise bug. What you're doing with common table expressions looks fine. There are some rare times when your query is technically invalid, you get "lucky" in one version and it works, and when you upgrade it fails. But when that happens the new version usually throws an error, not return wrong results. There's probably some extremely weird, specific combination of features you're using that's causing the issue. To find the real issue you need to massively simplify the query until you can make the smallest possible change and see the problem appear and disappear. You'll also want to remove all objects and only use DUAL. This process can take hours. At the end, when you're left with only a few lines of code, either post them here, look on Oracle Support, or create a Service Request.
Avoid the bug. Even if you go through the above steps there may not be a fix anyway. Sometimes the best work-around is to do something differently. It's nice to get to the bottom of every problem but you don't always have time. Instead, try re-writing the query in syntactically different but logically equivalent ways. Remove some or all of the common table expressions, maybe even repeat some SQL. But be sure to leave a comment warning future programmers of why you're doing things in a weird way.
Related
So, I have an SQL query for MSSQL looking like this (simplified for readability):
SELECT ...
FROM (
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA
WHERE STATUS NOT IN (107)
GROUP BY ...
) q
WHERE q.Tdays > 0
GROUP BY ...
It works fine, but I need a comparison against another table in the inner query, so I added a left join and said comparison:
SELECT ...
FROM (
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA
LEFT JOIN OTHER_TABLE ON MY_DATA.ID=OTHER_TABLE.ID //new JOIN
WHERE STATUS NOT IN (107) AND (DEPARTMENT_ID='SP' OR DEPARTMENT_ID='BL') //new AND branch
GROUP BY ...
) q
WHERE q.Tdays > 0
GROUP BY ...
This query works, but is A LOT slower thant the previous one. The wierd thing is, commenting out the new AND-branch of the WHERE clause while leaving the JOIN as it is makes it faster again. As if it's not joining another table that is slowing the query down, but the actual string comparisons... I am lost as to why this is so slow, or how I could speed it up... any advice would be appreciated!
Use an INNER JOIN. The outer join is being undone by the WHERE clause anyway:
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA d INNER JOIN
OTHER_TABLE ot
ON d.ID = ot.ID //new JOIN
WHERE od.STATUS NOT IN (107) AND DEPARTMENT_ID IN ('SP', 'BL') //new AND branch
GROUP BY ...
(The IN shouldn't make a difference; it is just easier to write.)
Next, if this still has slow performance, then look at the execution plans. It means that SQL Server is making a poor decision, probably on the JOIN algorithm. Normally, I fix this by forbidding nested loop joins, but there might be other solutions as well.
It's hard to say definitively what will or won't speed things up without seeing the execution plan. Also, understanding how fast you need it to be affects what steps you might want to (or not want to) consider taking.
What follows is admittedly somewhat vague, but these are a few things that came to mind when I thought about this. Take a look at the execution plan as Philip Couling suggested in that good link to get an idea where the pain points are, and of course, take these suggestions with a grain of salt.
You might consider adding some indexes to either or both of the tables. The execution plan might even give you suggestions on what could be useful, but off the top of my head, something on OTHER_TABLE.DEPARTMENT_ID probably wouldn't hurt.
You might be able to build potential new indexes as Filtered Indexes if you know those hard-coded search terms (like STATUS and DEPARTMENT_ID are always going to be the same).
You could pre-calculate some of this information if it's not changing so rapidly that you need to query it fresh on every call. This comes back to how fast you need it to go, because for just about any query, you can add columns or pre-populated lookup tables to avoid doing work at run time. For example, you could make an new bit field like IsNewOrBranch or IsStatusNot107 (both somewhat egregious steps, but things which could work). Or that might be pre-aggregating the data in the inner query ahead of time.
I know you simplified the query for our benefit, but that also makes it a little hard to know what's going on with the subquery, and the subsequent GROUP BY being performed against that subquery. There might be a way to avoid having to do two group bys.
Along the same vein, you might also look into splitting what you're doing into separate statements if SQL is having a difficult time figuring out how best to return the data. For example, you might populate a temp table or table variable with the results of your inner query, then perform your subsequent GROUP BY on that. While this approach isn't always useful, there are many times where trying to cram all the work into a single query will actually end up being worse than several individual, simple, optimized steps would be.
And as Gordon Linoff suggested, there are a number of query hints which could be used to coax the execution plan into doing things a specific way. But be careful, often that way lies madness.
Your SQL is fine, and restricting your data with an additional AND clause should usually not make it slower.
As it happens, choosing a fast execution path is a hard problem, and SQL Server sometimes (albeit seldom) gets it wrong.
What you can do to help SQL Server find the best execution path is to:
make sure the statistics on your tables are up-to-date and
make sure that there is an "obviously suitable" index that SQL Server can use. SQL Server Management studio will usually give you suggestions on missing indexes when selecting the "show actual execution plan" option.
Where I am working I have been recently told that using distinct in your queries is a bad sign of a programmer. So I am wondering I guess the only way to not use this function is to use a group by .
It was my understanding that the distinct function works very similarly to a group by except in how its read. A distinct function checks each individual selection criteria vs a group by which does the same thing only done as a whole.
Keep in mind I only do reporting . I do not create/alter the data. So my question is for best practices should I be using distinct or group by. If neither then is there an alternative. Maybe the group by should be used in more complex queries than my non-real example here, but you get the idea. I could not find an answer that really explained why or why not I should use distinct in my queries
select distinct
spriden_user_id as "ID",
spriden_last_name as "last",
spriden_first_name as "first",
spriden_mi_name as "MI",
spraddr_street_line1 as "Street",
spraddr_street_line2 as "Street2",
spraddr_city as "city",
spraddr_stat_code as "State",
spraddr_zip as "zip"
from spriden, spraddr
where spriden_user_id = spraddr_id
and spraddr_mail_type = 'MA'
VS
select
spriden_user_id as "ID",
spriden_last_name as "last",
spriden_first_name as "first",
spriden_mi_name as "MI",
spraddr_street_line1 as "Street",
spraddr_street_line2 as "Street2",
spraddr_city as "city",
spraddr_stat_code as "State",
spraddr_zip as "zip"
from spriden, spraddr
where spriden_user_id = spraddr_id
and spraddr_mail_type = 'MA'
group by "ID","last","first","MI","Street","Street2","city","State","zip"
Databases are smart to recognize what you mean. I expect both of your queries to perform equally well. It is important for someone else maintaining your query to know what you meant. If you really meant to retrieve distinct records, use DISTINCT. If your intention was to do aggregation, use GROUP BY
Take a look at this question. There are some nice answers that might help.
The answer provided by #zedfoxus is useful to understand the context.
However, I don't believe your query should require distinct records if the data is designed correctly.
It appears you are selecting the primary key of table spriden, so all that data should be unique. You're also joining onto the spraddr table; does that table really contain valid duplicate data? Or is there perhaps an additional join criterium that's required to filter out those duplicates?
This is why I get nervous about use of "distinct" - the spraddr table may include additional columns which you should use to filter out data, and "distinct" may be hiding that.
Also, you may be generating a massive result set which needs to be filtered by the "distinct" clause, which can cause performance issues. For instance, if there are 1 million rows in spraddr for each row in spriden, and you should use the "is_current" flag to find the 2 or 3 "real" ones.
Finally, I get nervous when I see "group by" used as a substitute for distinct, not because it's "wrong", but because stylistically, I believe group by should be used for aggregate functions. That's just a personal preference.
In your example distinct and group by do the same thing. I think your colleagues means that your query should not return duplicates in the first instance and that you should be able to write your query without a distinct or group by clause. You maybe be able to reduce the duplicates by extending your join conditions.
Ask them why is it a bad practice. A lot of people make up rules or come up with things that they consider bad practice from reading the first page of the book or the first result of a google search. If it does the job and doesn't cause any issues there is no reason to create more work by finding alternatives. From the two options you have posted I would use distinct too because its shorter and easier to read and maintain.
Whoever told you using DISTINCT is a bad sign in itself is wrong. In reality, it all depends on what problem you are trying to solve by using DISTINCT in the first place.
If you're querying a table that is expected to have repeated values of some field or combination of fields, and you're reporting a list of the values or combinations of values (and not performing any aggregations on them), then DISTINCT is the most sensible thing to use. It doesn't really make sense in my mind to use GROUP BY instead just because somebody thinks DISTINCT shouldn't be used. Indeed, I think this is the kind of thing DISTINCT is designed for.
If OTOH you've found that your query has a bug meaning that repeated values are being returned, you shouldn't use either DISTINCT or GROUP BY to cancel out this bug. Rather, you should figure out the cause of the bug and fix it.
Using DISTINCT as a safety net is also a poor practice, as it potentially hides problems, and furthermore it can be computationally expensive (typically O(n log n) or O(n2)). In this scenario, I can't see that using GROUP BY instead would help you.
Yes, Distinct tends to raise a little alarm in my head when I come across it in someones' query. It is required in some cases ofcourse, but most data models should not require it. It tends to be a last resort, or outlier case, for having to use it. It may also be systemic of a bad application sitting ontop of the database, allowing duplicate entries to be inserted or updated to be duplicates (and likewise, no corresponding database level constraints to prevent such actions). So the first thing to check is the data. It could be a sign of bad datamodel design. But most likely the query should not get to that stage in a select where duplicate rows are lingering.
In constructing a large query, normally I would start with the nugget of a subquery which is specifying the unique fields, and any subquery after that must Inner join or Left join onto that but never add or reduce the number of rows already defined by the nugget query.. and remembering to handle the possible NULLs of the left joins.
So for example, the nugget query could select the right rows also by using Partitions to, for example, select the most recent row of a joined table, or to do some other grouping at that stage.
In your example, I would not expect duplicates. If a person can have historical addresses, fine, but then do you need to see all addresses, or only the most recent, and if there were duplicate addresses, for the same person, does that mean incorrectly duplicated data, or does it mean the person left that address but returned to it later... in which case the partition select would fix that with much better control than a distinct.. especially when fields are added to the query by someone else later and breaks the distinct-ness.
This means that all other data hangs off this nugget of a sub query.. you stick the other possible fields onto the right of the core set of fields.
If Distincts are a last resort, then they are typically reserved for cases where the data is known to have duplicate entries in that table for that set of fields, and it's perfectly normal. In my head though a distinct is a slow, post-select process in the plan especially when it's a large result set being returned. I ought to verify that one of these days.
Provided your queries are correct, DISTINCT and GROUP BY provide the same result set, but your colleagues are correct in stating that DISTINCT hides problems. If you are missing a join and using a GROUP BY, you'll get back more information than you're expecting. If you are missing a join and using DISTINCT the SQL engine will perform an unbounded (or partially bounded) join, narrow the results down, and then come up with the expected answer.
Beyond the obvious performance degradation of generating more data than is necessary, you also run the risk of filling your tempdb (i.e.: running out of room on the hard drive where your tempdb lives).
Use GROUP BY in production.
Thought this would be a good place to ask for some "brainstorming." Apologies if it's a little broad/off subject.
I was wondering if anyone here had any ideas on how to approach the following problem:
First assume that I have a select statement stored somewhere as an object (this can be the tree form of the query). For example (for simplicity):
SELECT A, B FROM table_A WHERE A > 10;
It's easy to determine the below would change the result of the above query:
INSERT INTO table_A (A,B) VALUES (12,15);
But, given any possible Insert/Update/Whatever statement, as well as any possible starting Select (but we know the Selects and can analyze them all day) I'd like to determine if it would affect the result of the Select Statement.
It's fine to assume that there won't be any "outside" queries, and that we know about all the queries being sent to the DB. It is also assumed we know the DB schema.
No, this isn't for homework. Just a brain teaser I've been thinking about and started to get stuck on (obviously, SQL can get very complicated.)
Based on the reply to the comment, I'd say that without additional criteria, this ranges between very hard and impossible.
Very hard (leastways, it would be for me) because you'd have to write something to parse and interpret your SQL statements into a workable frame of reference for your goals. Doable, but can it be worth the effort?
Impossible because some queries transcend phrases like "Byzantinely complex". (Think nested queries, correlated subqueries, views, common table expressions, triggers, outer joins, and who knows what all.) Without setting criteria such as "no subqueries, no views or triggers, no more than X joins" and so forth, the problem becomes open-ended enough to warrant an NP Complete answer.
My first thought would be to put a trigger on table_A, where if any of the columns you're affecting (col A in this case) changes to meet (or no longer meet) the condition (> 10 here), then the trigger records that an "affecting" change has taken place.
E.g. have another little table to record a "last update timestamp", which the trigger could pop a getdate() into when it detects such a change.
Then, you could check that table to see if the timestamp has changed since the last time you ran the select query - if it has, then you know you need to re-run it, if it hasn't, then you know the results would be the same.
The table could hold many such timestamps (one per row, perhaps with the table/trigger name as a key value in another column) to service many such triggers.
Advantage? Being done in a trigger on the table means no risk of a change that could affect the select statement being missed.
Disadvantage? I guess depending on how your select statements come into existence, you might have an undesirable/unmanageable overhead in creating the trigger(s).
I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.
Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.
I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.
It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.
I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...
There are a few possible explanations as to why you see this behavior. Some common ones are
The subquery or CTE may be being repeatedly re-evaluated.
Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.
I do not believe there is a query hint that instructs the engine to spool each subquery in turn.
There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.
Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.
Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:
CREATE FUNCTION SalesByStore (#storeid varchar(30))
RETURNS #t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
qty smallint NOT NULL) AS
BEGIN
INSERT #t (title, qty)
SELECT t.title, s.qty
FROM sales s
JOIN titles t ON t.title_id = s.title_id
WHERE s.stor_id = #storeid
RETURN
END
CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')
Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).
SELECT bar.*
FROM
(SELECT * FROM TableFoo WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.
SELECT bar.*
FROM
(SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.
A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.
Let's say I have following query:
SELECT Id, Name, ForeignKeyId,
(SELECT TOP (1) FtName FROM ForeignTable WHERE FtId = ForeignKeyId)
FROM Table
Would that query execute faster if it is written with JOIN:
SELECT Id, Name, ForeignKeyId, FtName
FROM Table t
LEFT OUTER JOIN ForeignTable ft
ON ft.FtId = t.ForeignTableIf
Just curious... also, if JOINs are faster, will it be faster in all cases (tables with lots of columns, large number of rows)?
EDIT: Queries I wrote are just for illustrating concept of TOP (1) vs JOIN. Yes - I know about Query Execution Plan in SQL Server but I'm not looking to optimize single query - I'm trying to understand if there is certain theory behind SELECT TOP (1) vs JOIN and if certain approach is preferred because of speed (not because of personal preference or readability).
EDIT2: I would like to thank Aaron for his detailed answer and encourage to people to check his company's SQL Sentry Plan Explorer free tool he mentioned in his answer.
Originally, I wrote:
The first version of the query is MUCH less readable to me. Especially
since you don't bother aliasing the matched column inside the
correlated subquery. JOINs are much clearer.
I still believe and stand by those statements, but I'd like to add to my original response based on the new information added to the question. You asked, are there general rules or theories about what performs better, a TOP (1) or a JOIN, leaving readability and preference aside)? I will re-state as I commented that no, there are no general rules or theories. When you have a specific example, it is very easy to prove what works better. Let's take these two queries, similar to yours but which run against system objects that we can all verify:
-- query 1:
SELECT name,
(SELECT TOP (1) [object_id]
FROM sys.all_sql_modules
WHERE [object_id] = o.[object_id]
)
FROM sys.all_objects AS o;
-- query 2:
SELECT o.name, m.[object_id]
FROM sys.all_objects AS o
LEFT OUTER JOIN sys.all_sql_modules AS m
ON o.[object_id] = m.[object_id];
These return the exact same results (3,179 rows on my system), but by that I mean the same data and the same number of rows. One clue that they're not really the same query (or at least not following the same execution plan) is that the results come back in a different order. While I wouldn't expect a certain order to be maintained or obeyed, because I didn't include an ORDER BY anywhere, I would expect SQL Server to choose the same ordering if they were, in fact, using the same plan.
But they're not. We can see this by inspecting the plans and comparing them. In this case I'll be using SQL Sentry Plan Explorer, a free execution plan analysis tool from my company - you can get some of this information from Management Studio, but other parts are much more readily available in Plan Explorer (such as actual duration and CPU). The top plan is the subquery version, the bottom one is the join. Again, the subquery is on the top, the join is on the bottom:
[click for full size]
[click for full size]
The actual execution plans: 85% of the overall cost of running the two queries is in the subquery version. This means it is more than 5 times as expensive as the join. Both CPU and I/O are much higher with the subquery version - look at all those reads! 6,600+ pages to return ~3,000 rows, whereas the join version returns the data using much less I/O - only 110 pages.
But why? Because the subquery version works essentially like a scalar function, where you're going and grabbing the TOP matching row from the other table, but doing it for every row in the original query. We can see that the operation occurs 3,179 times by looking at the Top Operations tab, which shows number of executions for each operation. Once again, the more expensive subquery version is on top, and the join version follows:
I'll spare you more thorough analysis, but by and large, the optimizer knows what it's doing. State your intent (a join of this type between these tables) and 99% of the time it will work out on its own what is the best underlying way to do this (e.g. execution plan). If you try to out-smart the optimizer, keep in mind that you're venturing into quite advanced territory.
There are exceptions to every rule, but in this specific case, the subquery is definitely a bad idea. Does that mean the proposed syntax in the first query is always a bad idea? Absolutely not. There may be obscure cases where the subquery version works just as well as the join. I can't think that there are many where the subquery will work better. So I would err on the side of the one that is more likely to be as good or better and the one that is more readable. I see no advantages to the subquery version, even if you find it more readable, because it is most likely going to result in worse performance.
In general, I highly advise you to stick to the more readable, self-documenting syntax unless you find a case where the optimizer is not doing it right (and I would bet in 99% of those cases the issue is bad statistics or parameter sniffing, not a query syntax issue). I would suspect that, outside of those cases, the repros you could reproduce where convoluted queries that work better than their more direct and logical equivalents would be quite rare. Your motivation for trying to find those cases should be about the same as your preference for the unintuitive syntax over generally accepted "best practice" syntax.
Your queries do different things. The first is more akin to a LEFT OUTER JOIN.
It depends how your indexes are setup for performance. But JOINs are more clear.
I agree with statements above (Rick). Run this in Execution Plan...you'll get a clear answer. No speculation needed.
I agree with Daniel and Davide, that these are two different SQL statements. If the ForeignTable has multiple records of the same FtId value, then you'll have get duplication of data. Assuming the 1st SQL statement is correct, you'll have to rewrite the 2nd with some GROUP BY clause.