Why is SQL Server's query optimizer joining a table against itself? - sql

I'm writing a report that needs to pull data from a view that I'm not authorized to modify. The view is missing a column that I need for a report, so I attempted to join it against one of its source tables. However, this is causing it to take twice as long to execute.
A look at the execution plan shows that it performs two scans of the table and merge joins them together. Is there a hint I can use to convince the query optimizer to visit the table only once?
Abstracted fiddle: http://sqlfiddle.com/#!3/4a44d/1/0

Because the optimizer will never eliminate a table access specified in the query unless nothing is actually referenced from that table.
There is no way to access a table less times than it is referenced in the query (as far as I know from 13 years experience). There may be a few other cases, but the only case I know of where the query optimizer can do less accesses than the number of object references is when it can optimize away a left outer or right outer join when nothing is accessed from the outer table and it is known from constraints that excluding the work will not change the number of rows or which rows would be returned in the result.

Related

JOIN on concatenated column performance

I have a view that needs to join on a concatenated column. For example;
dbo.View1 INNER JOIN
dbo.table2 ON dbo.View1.combinedcode = dbo.table2.code
Inside the 'View1' there is a column which is comprised like so;
dbo.tableA.details + dbo.tableB.code AS combinedcode
Performing a join on this column is extremely slow. However the actual 'View1' runs extremely quickly. The poor performance comes with the join, and there aren't even many rows in any of the tables or views. Does anyone know why this might be?
Thanks for any insight!
Since there's no index on combinedcode, the JOIN will most likely result in a full "table scan" of the view to calculate the code for every row.
If you want to speed things up, try making the view into an indexed view with an index on combinedcode to help the join.
Another alternative, depending on your SQL server version, is to (as Parado answers) create a temporary table for the join, although it's usually less performant, at least for single shot queries.
Try this way:
select *
into #TemTap
from View1
/*where conditions on view1*/
after that You could create index on #TemTap.combinedcode and than
dbo.#TemTap as View1 INNER JOIN dbo.table2 ON dbo.View1.combinedcode =
dbo.table2.code
It often works for me.
The reason is because the optimizer has no information about the concatenated column, so it cannot choose a reasonable join path. My guess, if you look at the execution plan, is that the join is using a "nested loop" join. (I'm tempted to add "dreaded" to that.)
You might be able to fix this by putting an index on table2(code). The optimizer should decide to use this index, getting around the bad join optimization.
You can also use query hints to force the use of a "hash join" or "merge join". I am finding myself doing this more often for complex queries, where changes to the data might effect the query plan. (Such hints go in when a query that has been taking 2 minutes for a year decides to take hours, fill the temporary database, and die when it runs out of space.) You can do this by adding OPTION (merge join, hash join) to the end of the query. You can also explicitly choose the type of join in the on clause.
Finally, storing the intermediate results in a temporary table (as proposed by Parado) should give the optimizer enough information to choose the best join algorithm.
Using SQL functions is where condition is not advised. here you are using concatenate in where condition (indirectly but yes). so it is performing concatenation for every row and then comparing it with other table.
Now solution will be try using intermediate table rather then this view to hold the concatinated value.
if not possible try using index view, I know its a hell of task.
I would have preferred creating intermediate table.
see the link for indexed views
http://msdn.microsoft.com/en-us/library/ms191432.aspx#Restrictions

If you join two tables in the SELECT statement, all indexes on the table columns can no longer be used?

Let's say we have:
SELECT *
FROM Pictures
JOIN Categories ON Categories.CategoryId = Pictures.CategoryId
WHERE Pictures.UserId = #UserId
ORDER BY Pictures.UploadDate DESC
In this case, the database first join the two tables and then work on the derived table, which I think would mean the indexes on the individual tables would be no use, unless you can come up with an index that is bound to some column in the derived table?
You have a fundamental misunderstanding of how SQL works. The SQL language specifies what result set should be returned. It says nothing about how the database should achieve those results.
It is up to the database engine to parse the statement and come up with an execution plan (hopefully an efficient one) that will produce the correct results. Many modern relational databases have sophisticated query optimizers that completely pull apart the statement and derive execution plans that seem to have no relationship with the original query. (At least not to the untrained eye)
The execution plan for the same query can even change over time if the engine uses a cost based optimizer. A cost based optimizer makes decisions based on statistics that have been gathered about data and indexes. As the statistics change, the execution plan can also change.
With your simple query you assume that the database has to join the tables and create a temporary result set before it applies the where clause. That might be how you think about the problem, but the database is free to implement it entirely differently. I doubt there are many (if any) databases that would create a temporary result set for your simple query.
This is not to say that you cannot ever predict when an index may or may not be used. But it takes practice and experience to get a feel for how a database might execute a query.
This will join the tables giving you all the category information if a picture's 'CategoryId' is in the table 'Categories''s CategoryId field. (and no result for a particular 'Picture' if there is no such category)
This query will likely return several rows of data. The indexes of either table will be useful no matter which table you would like to access.
Normally your program would loop through the result set.
CategoryId will give you the row in Categories with all the relevant fields in that Category and 'Picture.Id' (assuming there is such a field) will give you a reference to that exact picture row in the database.
You can then manipulate either table by using the relevant index
"UPDATE Categories SET .... WHERE CategoryId = " +
"UPDATE Pictures ..... WHERE PictureId =" +
or some such depending on your programming environment.
Indexes are up to the optimizer for use, which depends on what is occurring in the query. For the query posted, there's nothing obvious to stop an index from being used. However, not all databases operate the same -- MySQL only allows one index to be used per SELECT (check the query plan, because the optimizer might interpret the JOIN so that another index may be used).
The stuff that is likely to ensure that an index can not be used is any function/operation that alters the data. IE: getting the month/etc out of a date, wildcarding the left side of a LIKE clause...

Methods of visualizing joins

Just wondering if anyone has any tricks (or tools) they use to visualize joins. You know, you write the perfect query, hit run, and after it's been running for 20 minutes, you realize you've probably created a cartesian join.
I sometimes have difficulty visualizing what's going to happen when I add another join statement and wondered if folks have different techniques they use when trying to put together lots of joins.
Always keep the end in mind.
Ascertain which are the columns you need
Try to figure out the minimum number of tables which will be needed to do it.
Write your FROM part with the table which will give max number of columns. eg FROM Teams T
Add each join one by one on a new line. Ensure whether you'll need OUTER, INNER, LEFT, RIGHT JOIN at each step.
Usually works for me. Keep in mind that it is Structured query language. Always break your query into logical lines and it's much easier.
Every join combines two resultsets into one. Each may be from a single database table or a temporary resultset which is the result of previous join(s) or of a subquery.
Always know the order that joins are processed, and, for each join, know the nature of the two temporary result sets that you are joining together. Know what logical entity each row in that resultset represents, and what attributes in that resultset uniquely identify that entity. If your join is intended to always join one row to one row, these key attributes are the ones you need to use (in join conditions) to implement the join. If your join is intended to generate some kind of cartesian product, then it is critical to understand the above to understand how the join conditions (whatever they are) will affect the cardinality of the new joined resultset.
Try to be consistent in the use of outer join directions. I try to always use Left Joins when I need an outer join, as I "think" of each join as "joining" the new table (to the right) to whatever I have already joined together (on the left) of the Left Join statement...
Run an explain plan.
These are always hierarchical trees (to do this, first I must do that). Many tools exist to make these plans into graphical trees, some in SQL browsers, (e.g, Oracle SQLDeveloper, whatever SQlServer's GUI client is called). If you don't have a tool, most plan text ouput includes a "depth" column, which you can use to indent the line.
What you want to look for is the cost of each row. (Note that for Oracle, though, higher costs can mean less time, if it allows Oracle to do a hash join rather than nested loops, and if the final result set has high cardinality (many, many rows).)
I have never found a better tool than thinking it through and using my own mind.
If the query is so complicated that you cannot do that, you may want to use either CTE's, views, or some other carefully organized subqueries to break it into logical pieces so you can easily understand and visualize each piece even if you cannot manage the whole.
Also, if your concern is effeciency, then SQL Server Management Studio 2005 or later lets you get estimated query execution plans without actually executing the query. This can give you very good ideas of where problems lie, if you are using MS SQL Server.

Estimated Subtree Cost Wildly Off, Terrible Optimization

I am joining a table that has two record id fields (record1, record2) to a view twice--once on each record--and selecting the top 1000. The view consists of several rather large tables, and it's id field is a string concatenation of their respective Ids (this was necessary for some third party software that requires a unique ID for the view. Row numbering was abysmally slow). There is also a where clause in the view calling a function that compares dates.
The estimated execution plan produces a "No Join Predicate" warning unless I use OPTION(FORCE ORDER). With forcing the ordering, the execution plan has multiple nodes displaying 100% cost. In both cases, the estimated subtree cost at the endpoint is thirteen orders of magnitude smaller than just one of it's nodes (it's doing a lot or nested loop joins with cpu costs as high 35927400000000)
What is going on here with the numbers in the execution plan? And why is SQL Server having such a hard time optimizing the query?
Simply adding an index to the view on the concatenated string and using the NOEXPAND table hint fixed the problem entirely. It ran in all of 12 seconds. But why did sql stumble so bad (even requiring the noexpand hint after I added the index)?
Running SQL Server 2008 SP1 with CU 8.
The View:
SELECT
dbo.fnGetCombinedTwoPartKey(N.NameID,A.AddressID) AS NameAddressKey,
[other fields]
FROM
[7 joined tables]
WHERE dbo.fnDatesAreOverlapping(N.dtmValidStartDate,N.dtmValidEndDate,A.dtmValidStartDate,A.dtmValidEndDate) = 1
The Query
SELECT TOP 1000
vw1.strFullName,
vw1.strAddress1,
vw1.strCity,
vw2.strFullName,
vw2.strAddress1,
vw2.strCity
FROM tblMatches M
JOIN vwImportNameAddress vw1 ON vw1.NameAddressKey = M.Record1
JOIN vwImportNameAddress vw2 ON vw2.DetailAddressKey = M.Record2
Looks like you're already pretty close to the explanation. It's because of this:
The view consists of several rather large tables, and it's id field is a string concatenation of their respective Ids...
This creates a non-sargable join predicate condition, and prevents SQL server from using any of the indexes on the base tables. Thus, the engine has to perform a full scan of all the underlying tables for each join (two in your case).
Perhaps in order to avoid doing several full table scans (one for each table, multiplied by the number of joins), SQL Server has decided that it will be faster to simply use the cartesian product and filter afterward (hence the "no join predicate" warning). When you FORCE ORDER, it dutifully performs all of the full scans and nested loops that you originally asked it for.
I do agree with some of the comments that this view is underlying a problematic data model, but the short-term workaround, as you've discovered, is to index the computed ID column in the view, which (obviously) makes it sargable again because it has hashes of the actual generated ID.
Edit: I also missed this on the first read-through:
WHERE dbo.fnDatesAreOverlapping(N.dtmValidStartDate,N.dtmValidEndDate,A.dtmValidStartDate,A.dtmValidEndDate) = 1
This, again, is a non-sargable predicate which will lead to poor performance. Wrapping any columns in a UDF will cause this behaviour. Indexing the view also materializes it, which may also factor into the speed of the query; without the index, this predicate has to be evaluated every time and forces a full scan on the base tables, even without the composite ID.
It would have to parse your function (fnGetCombinedTwoPartKey) to determine what columns are fetched to create the result column. It can't so it's going to assume all columns are necessary. If your indexes are covering indexes then your estimate is going to be wrong.

SQL Server 2005 - Order of Inner Joins

I have a query containing three inner join statements in the Where clause. The query takes roughly 2 minutes to execute. If I simply change the order of two of the inner joins, performance drops to 40 seconds.
How can doing nothing but changing the order of the inner joins have such a drastic impact of query performance? I would have thought the optimizer would figure all this out.
SQL is declarative, that is, the JOIN order should not matter.
However it can in practice, say, if it's a complex query when the optimiser does not explore all options (which in theory could take months).
Another option is that it's a very different query if you reorder and you get different results, but this is usually with OUTER JOINs.
And it could also be the way the ON clause is specified It has to change if you reorder the FROM clause. Unless you are using the older (and bad) JOIN-in-the-WHERE-clause.
Finally, if it's a concern you could use parenthesis to change evaluation order to make your intentions clear, say, filter on a large table first to generate a derived table.
Because by changing the order of the joins, SQL Server is coming up with a different execution plan for your query (chances are it's changing the way it's filtering the tables based on your joins).
In this case, I'm guessing you have several large tables...one of which performs the majority of the filtering.
In one query, your joins are joining several of the large tables together and then filtering the records at the end.
In the other, you are filtering the first table down to a much smaller sub-set of the data...and then joining the rest of the tables in. Since that initial table got filtered before joining the other large recordsets, performance is much better.
You could always verify but running the query with the 'Show query plan' option enabled and see what the query plan is for the two different join orders.
I would have thought it was smart enough to do that as well, but clearly it's still performing the joins in the order you explicitly list them... As to why that affects the performance, if the first join produces an intermediate result set of only 100 records in one ordering scheme, then the second join will be from that 100-record set to the third table.
If putting the other join first produces a first intermediate result set of one million records, then the second join will be from a one million row result set to the third table...