Am I overusing cross apply to create new columns? - sql

I am using t-sql on various sql server versions.
From the moment I learned about cross apply, I have been using it quite frequently. The specific kind of use I want to discuss is the use of one
cross apply (select really_long_statement where correlated_join_condition) as tablename(fieldname)
statement. I am using this to make code clearer, as I have read here on the "Introduce new columns!" section:
http://bradsruminations.blogspot.gr/2011/04/t-sql-tuesday-017-it-slices-it-dices-it.html
It is obvious to me that the exact same function of the statement I mentioned earlied can be achieved by this:
inner join (select correlated_join_condition_fields,really_long_statement) as tablename(fieldname) on correlated_join_condition
Being a little lazy, initially I would prefer the first solution, because i do not have to type twice the join condition fields.
Then I thought, cross apply is less known and used from inner join, so maybe it would be a good idea to stick to inner join for the sake of writing easier to maintain code.
But then, I peeked the query plan for an instance of those two scenarios, and noticed the cross apply was actually faster!! I did not expect this; from all the posts I've read about plans, it seems that most of the time, when two queries do the same thing and are not hugely different, the plan analyzer will figure this and make the same plan.
I am not able to upload the plans at the time of writing this, but I will be able to in a couple of days.
So, the question is, why does cross apply seem faster? Is it because it does not have to make the join condition fields available?
Plus any opinions about the balance of performance/easy to read code/short to write code would be welcome.

Related

Optimizing INNER JOIN query performance

I'm using a database that requires optimized queries and I'm wondering which one of those queries are the optimized one, I used a timer but the result are too close. so I do not have to clue which one to use.
QUERY 1:
SELECT A.MIG_ID_ACTEUR, A.FL_FACTURE_FSL , B.VAL_NOM,
B.VAL_PRENOM, C.VAL_CDPOSTAL, C.VAL_NOM_COMMUNE, D.CCB_ID_ACTEUR
FROM MIG_FACTURE A
INNER JOIN MIG_ACTEUR B
ON A.MIG_ID_ACTEUR= B.MIG_ID_ACTEUR
INNER JOIN MIG_ADRESSE C
ON C.MIG_ID_ADRESSE = B.MIG_ID_ADRESSE
INNER JOIN MIG_CORR_REF_ACTEUR D
ON A.MIG_ID_ACTEUR= D.MIG_ID_ACTEUR;
QUERY 2:
SELECT A.MIG_ID_ACTEUR, A.FL_FACTURE_FSL , B.VAL_NOM, B.VAL_PRENOM,
C.VAL_CDPOSTAL, C.VAL_NOM_COMMUNE, D.CCB_ID_ACTEUR
FROM MIG_FACTURE A , MIG_ACTEUR B, MIG_ADRESSE C, MIG_CORR_REF_ACTEUR D
WHERE A.MIG_ID_ACTEUR= B.MIG_ID_ACTEUR
AND C.MIG_ID_ADRESSE = B.MIG_ID_ADRESSE
AND A.MIG_ID_ACTEUR= D.MIG_ID_ACTEUR;
If you are asking whether it is more efficient to use the SQL 99 join syntax (a inner join b) or whether it is more efficient to use the older join syntax of listing the join predicates in the WHERE clause, it shouldn't matter. I'd expect that the query plans for the two queries would be identical. If the query plans are identical, performance will be identical. If the plans are not identical, that would generally imply that you had encountered a bug in the database's query parsing engine.
Personally, I'd use the SQL 99 syntax (query 1) both because it is more portable when you want to do an outer join and because it generally makes the query more readable and decreases the probability that you'll accidentally leave out a join condition. That's solely a readability and maintainability consideration, though, not a performance consideration.
First things first:
"I used a timer but the result are too close" -- This is actually not a good way to test performance. Databases have caches. The results you get back won't be comparable with a stopwatch. You have system load to contend with, caching, and a million other things that make that particular comparison worthless. Instead of that, try using EXPLAIN to figure out the execution plan. Use SHOW PROFILES and SHOW STATUS to see where and how the queries are spending time. Check last_query_cost. But don't check your stopwatch. That won't tell you anything.
Second: this question can't be answered with the info your provided. In point of fact the queries are identical (verify that with Explain) and simply boil down to implicit vs explicit joins. Doesn't make either one of them optimized though. Again, you need to dig into the join itself and see if it's making use of indices, for example, or if it's doing a lot temp tables or file sorts.
Optimizing the query is a good thing... but these two are the same. A stop watch won't help you. Use explain, show profiles, show status.... not a stop watch :-)

Cross apply vs. cursor - syntax and differences

I'm using SQL Server.
Can any one help me organize the syntax and difference between
cursor and cross apply' ?
Update: my intention is:
I have one user-SP_1 which gets a varchar "id" as param. I have built another user-SP_2 that gets manay "ids", parse them and then I want to send them to user-SP_1 in a loop.
Cursors allow you to loop through data one record at a time. They are generally discouraged because they are ridiculously slow compared to set based operations, and I would venture to say that the need for them is most often caused by a poor database design. They are nonetheless necessary sometimes though.
Here is a good SO page on Cross Apply: When should I use Cross Apply over Inner Join?
I suspect that one of these two things isn't quite what you are thinking it is though since as marc_s said they are completely different things. So let us know what you're trying to accomplish if you need more help.

What is better - SELECT TOP (1) or INNER JOIN?

Let's say I have following query:
SELECT Id, Name, ForeignKeyId,
(SELECT TOP (1) FtName FROM ForeignTable WHERE FtId = ForeignKeyId)
FROM Table
Would that query execute faster if it is written with JOIN:
SELECT Id, Name, ForeignKeyId, FtName
FROM Table t
LEFT OUTER JOIN ForeignTable ft
ON ft.FtId = t.ForeignTableIf
Just curious... also, if JOINs are faster, will it be faster in all cases (tables with lots of columns, large number of rows)?
EDIT: Queries I wrote are just for illustrating concept of TOP (1) vs JOIN. Yes - I know about Query Execution Plan in SQL Server but I'm not looking to optimize single query - I'm trying to understand if there is certain theory behind SELECT TOP (1) vs JOIN and if certain approach is preferred because of speed (not because of personal preference or readability).
EDIT2: I would like to thank Aaron for his detailed answer and encourage to people to check his company's SQL Sentry Plan Explorer free tool he mentioned in his answer.
Originally, I wrote:
The first version of the query is MUCH less readable to me. Especially
since you don't bother aliasing the matched column inside the
correlated subquery. JOINs are much clearer.
I still believe and stand by those statements, but I'd like to add to my original response based on the new information added to the question. You asked, are there general rules or theories about what performs better, a TOP (1) or a JOIN, leaving readability and preference aside)? I will re-state as I commented that no, there are no general rules or theories. When you have a specific example, it is very easy to prove what works better. Let's take these two queries, similar to yours but which run against system objects that we can all verify:
-- query 1:
SELECT name,
(SELECT TOP (1) [object_id]
FROM sys.all_sql_modules
WHERE [object_id] = o.[object_id]
)
FROM sys.all_objects AS o;
-- query 2:
SELECT o.name, m.[object_id]
FROM sys.all_objects AS o
LEFT OUTER JOIN sys.all_sql_modules AS m
ON o.[object_id] = m.[object_id];
These return the exact same results (3,179 rows on my system), but by that I mean the same data and the same number of rows. One clue that they're not really the same query (or at least not following the same execution plan) is that the results come back in a different order. While I wouldn't expect a certain order to be maintained or obeyed, because I didn't include an ORDER BY anywhere, I would expect SQL Server to choose the same ordering if they were, in fact, using the same plan.
But they're not. We can see this by inspecting the plans and comparing them. In this case I'll be using SQL Sentry Plan Explorer, a free execution plan analysis tool from my company - you can get some of this information from Management Studio, but other parts are much more readily available in Plan Explorer (such as actual duration and CPU). The top plan is the subquery version, the bottom one is the join. Again, the subquery is on the top, the join is on the bottom:
[click for full size]
[click for full size]
The actual execution plans: 85% of the overall cost of running the two queries is in the subquery version. This means it is more than 5 times as expensive as the join. Both CPU and I/O are much higher with the subquery version - look at all those reads! 6,600+ pages to return ~3,000 rows, whereas the join version returns the data using much less I/O - only 110 pages.
But why? Because the subquery version works essentially like a scalar function, where you're going and grabbing the TOP matching row from the other table, but doing it for every row in the original query. We can see that the operation occurs 3,179 times by looking at the Top Operations tab, which shows number of executions for each operation. Once again, the more expensive subquery version is on top, and the join version follows:
I'll spare you more thorough analysis, but by and large, the optimizer knows what it's doing. State your intent (a join of this type between these tables) and 99% of the time it will work out on its own what is the best underlying way to do this (e.g. execution plan). If you try to out-smart the optimizer, keep in mind that you're venturing into quite advanced territory.
There are exceptions to every rule, but in this specific case, the subquery is definitely a bad idea. Does that mean the proposed syntax in the first query is always a bad idea? Absolutely not. There may be obscure cases where the subquery version works just as well as the join. I can't think that there are many where the subquery will work better. So I would err on the side of the one that is more likely to be as good or better and the one that is more readable. I see no advantages to the subquery version, even if you find it more readable, because it is most likely going to result in worse performance.
In general, I highly advise you to stick to the more readable, self-documenting syntax unless you find a case where the optimizer is not doing it right (and I would bet in 99% of those cases the issue is bad statistics or parameter sniffing, not a query syntax issue). I would suspect that, outside of those cases, the repros you could reproduce where convoluted queries that work better than their more direct and logical equivalents would be quite rare. Your motivation for trying to find those cases should be about the same as your preference for the unintuitive syntax over generally accepted "best practice" syntax.
Your queries do different things. The first is more akin to a LEFT OUTER JOIN.
It depends how your indexes are setup for performance. But JOINs are more clear.
I agree with statements above (Rick). Run this in Execution Plan...you'll get a clear answer. No speculation needed.
I agree with Daniel and Davide, that these are two different SQL statements. If the ForeignTable has multiple records of the same FtId value, then you'll have get duplication of data. Assuming the 1st SQL statement is correct, you'll have to rewrite the 2nd with some GROUP BY clause.

INNER JOIN keywords | with and without using them

SELECT * FROM TableA
INNER JOIN TableB
ON TableA.name = TableB.name
SELECT * FROM TableA, TableB
where TableA.name = TableB.name
Which is the preferred way and why?
Will there be any performance difference when keywords like JOIN is used?
Thanks
The second way is the classical way of doing it, from before the join keyword existed.
Normally the query processor generates the same database operations from the two queries, so there would be no difference in performance.
Using join better describes what you are doing in the query. If you have many joins, it's also better because the joined table and it's condition are beside each other, instead of putting all tables in one place and all conditions in another.
Another aspect is that it's easier to do an unbounded join by mistake using the second way, resulting in a cross join containing all combinations from the two tables.
Use the first one, as it is:
More explicit
Is the Standard way
As for performance - there should be no difference.
find out by using EXPLAIN SELECT …
it depends on the engine used, on the query optimizer, on the keys, on the table; on pretty much everything
In some SQL engines the second form (associative joins) is depreicated. Use the first form.
Second is less explicit, causes begginers to SQL to pause when writing code. Is much more difficult to manage in complex SQL due to the sequence of the join match requirement to match the WHERE clause sequence - they (squence in the code) must match or the results returned will change making the returned data set change which really goes against the thought that sequence should not change the results when elements at the same level are considered.
When joins containing multiple tables are created, it gets REALLY difficult to code, quite fast using the second form.
EDIT: Performance: I consider coding, debugging ease part of personal performance, thus ease of edit/debug/maintenance is better performant using the first form - it just takes me less time to do/understand stuff during the development and maintenance cycles.
Most current databases will optimize both of those queries into the exact same execution plan. However, use the first syntax, it is the current standard. By learning and using this join syntax, it will help when you do queries with LEFT OUTER JOIN and RIGHT OUTER JOIN. which become tricky and problematic using the older syntax with the joins in the WHERE clause.
Filtering joins solely using WHERE can be extremely inefficient in some common scenarios. For example:
SELECT * FROM people p, companies c WHERE p.companyID = c.id AND p.firstName = 'Daniel'
Most databases will execute this query quite literally, first taking the Cartesian product of the people and companies tables and then filtering by those which have matching companyID and id fields. While the fully-unconstrained product does not exist anywhere but in memory and then only for a moment, its calculation does take some time.
A better approach is to group the constraints with the JOINs where relevant. This is not only subjectively easier to read but also far more efficient. Thusly:
SELECT * FROM people p JOIN companies c ON p.companyID = c.id
WHERE p.firstName = 'Daniel'
It's a little longer, but the database is able to look at the ON clause and use it to compute the fully-constrained JOIN directly, rather than starting with everything and then limiting down. This is faster to compute (especially with large data sets and/or many-table joins) and requires less memory.
I change every query I see which uses the "comma JOIN" syntax. In my opinion, the only purpose for its existence is conciseness. Considering the performance impact, I don't think this is a compelling reason.

Can an INNER JOIN offer better performance than EXISTS

I've been investigating making performance improvements on a series of procedures, and recently a colleague mentioned that he had achieved significant performance improvements when utilising an INNER JOIN in place of EXISTS.
As part of the investigation as to why this might be I thought I would ask the question here.
So:
Can an INNER JOIN offer better performance than EXISTS?
What circumstances would this happen?
How might I set up a test case as proof?
Do you have any useful links to further documentation?
And really, any other experience people can bring to bear on this question.
I would appreciate if any answers could address this question specifically without any suggestion of other possible performance improvements. We've had quite a degree of success already, and I was just interested in this one item.
Any help would be much appreciated.
Generally speaking, INNER JOIN and EXISTS are different things.
The former returns duplicates and columns from both tables, the latter returns one record and, being a predicate, returns records from only one table.
If you do an inner join on a UNIQUE column, they exhibit same performance.
If you do an inner join on a recordset with DISTINCT applied (to get rid of the duplicates), EXISTS is usually faster.
IN and EXISTS clauses (with an equijoin correlation) usually employ one of the several SEMI JOIN algorithms which are usually more efficient than a DISTINCT on one of the tables.
See this article in my blog:
IN vs. JOIN vs. EXISTS
Maybe, maybe not.
The same plan will be generated most likely
An INNER JOIN may require a DISTINCT to get the same output
EXISTS deals with NULL
In sql server 2019 queries with IN, EXIST, JOIN statements have different plans (if correct indexes added). So performence also is different. It is shown in article https://www.mssqltips.com/sqlservertip/6659/sql-exists-vs-in-vs-join-performance-comparison/ that JOIN is some faster.
P.S. I understand that question was about sql server 2005 (in tags), but people mostly looks for answer by article title.