Estimated Subtree Cost Wildly Off, Terrible Optimization - sql

I am joining a table that has two record id fields (record1, record2) to a view twice--once on each record--and selecting the top 1000. The view consists of several rather large tables, and it's id field is a string concatenation of their respective Ids (this was necessary for some third party software that requires a unique ID for the view. Row numbering was abysmally slow). There is also a where clause in the view calling a function that compares dates.
The estimated execution plan produces a "No Join Predicate" warning unless I use OPTION(FORCE ORDER). With forcing the ordering, the execution plan has multiple nodes displaying 100% cost. In both cases, the estimated subtree cost at the endpoint is thirteen orders of magnitude smaller than just one of it's nodes (it's doing a lot or nested loop joins with cpu costs as high 35927400000000)
What is going on here with the numbers in the execution plan? And why is SQL Server having such a hard time optimizing the query?
Simply adding an index to the view on the concatenated string and using the NOEXPAND table hint fixed the problem entirely. It ran in all of 12 seconds. But why did sql stumble so bad (even requiring the noexpand hint after I added the index)?
Running SQL Server 2008 SP1 with CU 8.
The View:
SELECT
dbo.fnGetCombinedTwoPartKey(N.NameID,A.AddressID) AS NameAddressKey,
[other fields]
FROM
[7 joined tables]
WHERE dbo.fnDatesAreOverlapping(N.dtmValidStartDate,N.dtmValidEndDate,A.dtmValidStartDate,A.dtmValidEndDate) = 1
The Query
SELECT TOP 1000
vw1.strFullName,
vw1.strAddress1,
vw1.strCity,
vw2.strFullName,
vw2.strAddress1,
vw2.strCity
FROM tblMatches M
JOIN vwImportNameAddress vw1 ON vw1.NameAddressKey = M.Record1
JOIN vwImportNameAddress vw2 ON vw2.DetailAddressKey = M.Record2

Looks like you're already pretty close to the explanation. It's because of this:
The view consists of several rather large tables, and it's id field is a string concatenation of their respective Ids...
This creates a non-sargable join predicate condition, and prevents SQL server from using any of the indexes on the base tables. Thus, the engine has to perform a full scan of all the underlying tables for each join (two in your case).
Perhaps in order to avoid doing several full table scans (one for each table, multiplied by the number of joins), SQL Server has decided that it will be faster to simply use the cartesian product and filter afterward (hence the "no join predicate" warning). When you FORCE ORDER, it dutifully performs all of the full scans and nested loops that you originally asked it for.
I do agree with some of the comments that this view is underlying a problematic data model, but the short-term workaround, as you've discovered, is to index the computed ID column in the view, which (obviously) makes it sargable again because it has hashes of the actual generated ID.
Edit: I also missed this on the first read-through:
WHERE dbo.fnDatesAreOverlapping(N.dtmValidStartDate,N.dtmValidEndDate,A.dtmValidStartDate,A.dtmValidEndDate) = 1
This, again, is a non-sargable predicate which will lead to poor performance. Wrapping any columns in a UDF will cause this behaviour. Indexing the view also materializes it, which may also factor into the speed of the query; without the index, this predicate has to be evaluated every time and forces a full scan on the base tables, even without the composite ID.

It would have to parse your function (fnGetCombinedTwoPartKey) to determine what columns are fetched to create the result column. It can't so it's going to assume all columns are necessary. If your indexes are covering indexes then your estimate is going to be wrong.

Related

Self-Joins: is there a way to improve the performance of this query?

The purpose of all this is to create a lookup table to avoid a self join down the road, which would involve joins for the same data against much bigger data sets.
In this instance a sales order may have one or both of bill to and ship to customer ID.
The tables here are aggregates of data from 5 different servers, differentiated by the box_id. The customer table is ~1.7M rows, and sales_order is ~55M. The end result is ~52M records and takes on average about 80 minutes to run.
The query:
SELECT DISTINCT sog.box_id ,
sog.sales_order_id ,
cb.cust_id AS bill_to_customer_id ,
cb.customer_name AS bill_to_customer_name ,
cs.cust_id AS ship_to_customer_id ,
cs.customer_name AS ship_to_customer_name
FROM sales_order sog
LEFT JOIN customer cb ON cb.cust_id = sog.bill_to_id AND cb.box_id = sog.box_id
LEFT JOIN customer cs ON cs.cust_id = sog.ship_to_id AND cs.box_id = sog.box_id
The execution plan:
https://www.brentozar.com/pastetheplan/?id=SkjhXspEs
All of this is happening on SQL Server.
I've tried reproducing the bill to and ship to customer sets as CTEs and joining to those, but found no performance benefit.
The only indexes on these tables are the primary keys (which are synthetic IDs). Somewhat curiously the execution plan analyzer is not recommending adding any indexes to either table; it usually wants me to slap indexes on almost everything.
I don't know that there necessarily IS a way to make this run faster, but I am trying to improve my query optimization and have hit the limit of my knowledge. Any insight is much appreciated.
When you run queries like yours -- queries with no WHERE filters -- often the DBMS decides it has to scan entire tables. (In SQL Server execution plans, "clustered index scan" means it is scanning the whole table.) It certainly has to wrangle all the data in the tables. The lookup table you want to create is often called a "materialized view." (An online version of SQL server has built in support for materialized views, but other versions still don't.)
Depending on how you will use your data, you may be better off avoiding this materialized lookup table. If all your uses of your proposed lookup table involve filtering out a small subset of rows using WHERE clauses, an ordinary non-materialized view may be a good choice. When you give queries involving ordinary views, the query planner folds those views into the query, and may recommend helpful indexes.

SQL adding Order By clause causes the query to run significantly faster. Explanation needed

So i have this query:
SELECT *
FROM ViewTechnicianStatus
WHERE NotificationClass = 2
AND MachineID IN (SELECT ID FROM MachinesTable WHERE DepartmentID = 1 AND IsMachineActive <> 0)
--ORDER BY ResponseDate DESC
The view is huge and complex with a lot of joins and subqueries. When i run this query it takes forever to finish, however if i add the ORDER BY it finishes instantly and returns 20 rows as intended. I don't understand how adding the ORDER BY could have such a huge positive impact on the performance. Would love if somebody could explain to me the phenomenon.
EDIT:
Here is the rundown with SET STATISTICS TIME, IO ON; flags on. Sorry for the hidden table names but i don't think i can expose those.
Without ORDER BY
With Order BY
To answer your question, The reason that your query runs faster when adding order by is due to the INDEXING. Probably in all the Clients that you tested, had Indexing for those specific fields/tables, and on using the Order by made the performance better.
Summary
OK.. I've thought about this for a while as I think it's an interesting issue. I believe it's very much an edge case - which is part of what makes it interesting.
I'm taking an educated guess based on the info provided - obviously, without being able to see it/play with it, I cannot be certain. But I think this explanation matches the evidence based on the info you provide and the statistics.
I think the main issue is a poor query plan. In the version without the sort, it uses an inappropriate nested loop; in the version with the sort, it does (say) a hash match or merge join.
I have found that SQL Server often has issues with query plans within complex views that reference other views, and especially if those sub-views have group bys/sorts/etc.
For demonstration of what difference that could occur, I'll simplify your complex view into 2 subgroups I'll call 'view groups' (which may be one or several views and tables - not a technical term, just a term to summarise them).
The first view group contains most tables,
The second view group contains the views getting data from tables 6 and 7.
For both approaches, how SQL uses the data in the view groups are probably the same (e.g., use the same indexes, etc). However, there's a difference in its approach to how it does the join between the two view groups.
Example - query planner underestimates cost of view group 2 and doesn't care which method is used
I'm guessing
The first view group, at the point of the join, is dealing with about 3000 rows (it hasn't filtered it down yet), and
The query builder thinks view group 2 is easy to run
In the version without the order by, the query plan is designed with a nested loop join. That is, it gets each value in view group 1, and then for each value it runs view group 2 to get the relevant data. This means the view group 2 is run 3000-ish times (once for each value in view group 1).
In the version with the order by, it decides to do (say) a hash match between view group 1 and view group 2. This means it has to only run view group 2 once, but spend a bit more time sorting it. However, because you asked for it to be sorted anyway, it chooses the hash match.
However, because the query designed underestimated the cost of view group 2, it turns out that the hash match is a much much better query plan for the circumstances.
Example - query planner use of cached plans
I believe (but may be wrong!) that when you reference views within views, it can often just used cached plans for the sub-views rather than trying to get the best plan possible for your current situation.
It may be that one of your views uses the "cached plan" whereas the other one tries to optimise the query plan including the sub-views.
Ironically, it may be that the query version with the order by is more complex, and in this case it uses the cached plans for view group 2. However, as it knows it hasn't optimised the plan for view group 2, it simply gets the data once for view group 2, then keeps all results in memory and uses it in a hash match.
In contrast, in the version without the order by, it takes a shot at optimising the query plan (including optimising how it uses the views), and makes a mess of it.
Possible solutions
These are all possibilities - they may make it better or may make it worse! Note that SQL is a declarative language (you tell the computer what to do/what you want, but not how to do it).
This is not a comprehensive list of possibilities, but they are things you can try
Pre-calculate all or part(s) of the views (e.g., put the pre-calculated stuff from tables 6 and 7 into a temporary table, then use the temporary tables in the views)
Simplify the SQL and/or move all the SQL into a single view that doesn't call other views
Use join hints e.g., instead of INNER JOIN, use INNER HASH JOIN in the appropriate position
Use OPTION(RECOMPILE)

SQL Server Query running slow when changing constant in where clause

Hello all and thanks in advance. I have a view that when queried with no where clause takes just over 0 seconds to return ~8600 rows. However, when I query with a where clause such as:
SELECT * FROM myView WHERE myID = 123
depending on what constant I put in place of 123 the query execution time changes considerably.
Now, "considerably" in this case means the difference between just above 0 seconds and 3 to 4 seconds. But the view is called frequently and repeatedly for certain tasks which makes 3 seconds turn into 30 or more seconds.
While I cannot give the code for the view itself, what I can confirm is that:
The view is comprised of the joining of 6 standard tables (no special qualities).
While there may not always be records in table A that link up with table B, thus creating null columns in the results, I have confirmed that such instances are not consistently resulting in the longer or shorter query times.
The view itself has no clauses beyond the standard Select, From, and Left Outer Join clauses.
Certain IDs always result in long query times and the others always result in short query times
I have dropped and created the view in between queries on the off chance that there was a cached execution plan that was sub-optimal.
If these known variables are not enough to reduce the possibilities down to 2 or 3 possible causes I would still like to know what THEORETICAL problems might be causing this issue just to expand my understanding.
Thanks Again,
ProtoNoob
I would assume that the statistics for the tables are outdated and do not match the real content of the tables. This would mean that the optimizer, relying on the statistics, e. g. assumes that a value you use in the WHERE clause does not occur in the data at all, hence the result set being rather small, while in reality it contains many rows. Or the other way round: Relying on the statistics, the optimizer could assume that - say- 20% of the rows of the table have this value, and hence it is better to do a full table scan than to first access index pages for evaluating the where condition, then jump to a data page for almost each index entry to read the data, and in the end having to read nearly all pages anyway. Or it would access the tables in a wrong order, or ... But in reality, the value is not contained in the table at all, thus just leading to a wrong plan.
One hint pointing to outdated statistics would be if the query plan shows a huge difference between estimated and actual number of rows.
Which DBMS are you using? If SQL Server, then you can see the current statistics using DBCC SHOW_STATISTICS and refresh the statistics for selected columns and tables using the UPDATE STATISTICS statement. There are more views and procedures around this subject, most of them are linked from one of these two articles.

effect of number of projections on query performance

I am looking to improve the performance of a query which selects several columns from a table. was wondering if limiting the number of columns would have any effect on performance of the query.
Reducing the number of columns would, I think, have only very limited effect on the speed of the query but would have a potentially larger effect on the transfer speed of the data. The less data you select, the less data that would need to be transferred over the wire to your application.
I might be misunderstanding the question, but here goes anyway:
The absolute number of columns you select doesn't make a huge difference. However, which columns you select can make a significant difference depending on how the table is indexed.
If you are selecting only columns that are covered by the index, then the DB engine can use just the index for the query without ever fetching table data. If you use even one column that's not covered, though, it has to fetch the entire row (key lookup) and this will degrade performance significantly. Sometimes it will kill performance so much that the DB engine opts to do a full scan instead of even bothering with the index; it depends on the number of rows being selected.
So, if by removing columns you are able to turn this into a covering query, then yes, it can improve performance. Otherwise, probably not. Not noticeably anyway.
Quick example for SQL Server 2005+ - let's say this is your table:
ID int NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
Name varchar(50) NOT NULL,
Status tinyint NOT NULL
If we create this index:
CREATE INDEX IX_MyTable
ON MyTable (Name)
Then this query will be fast:
SELECT ID
FROM MyTable
WHERE Name = 'Aaron'
But this query will be slow(er):
SELECT ID, Name, Status
FROM MyTable
WHERE Name = 'Aaron'
If we change the index to a covering index, i.e.
CREATE INDEX IX_MyTable
ON MyTable (Name)
INCLUDE (Status)
Then the second query becomes fast again because the DB engine never needs to read the row.
Limiting the number of columns has no measurable effect on the query. Almost universally, an entire row is fetched to cache. The projection happens last in the SQL pipeline.
The projection part of the processing must happen last (after GROUP BY, for instance) because it may involve creating aggregates. Also, many columns may be required for JOIN, WHERE and ORDER BY processing. More columns than are finally returned in the result set. It's hardly worth adding a step to the query plan to do projections to somehow save a little I/O.
Check your query plan documentation. There's no "project" node in the query plan. It's a small part of formulating the result set.
To get away from "whole row fetch", you have to go for a columnar ("Inverted") database.
It can depend on the server you're dealing with (and, in the case of MySQL, the storage engine). Just for example, there's at least one MySQL storage engine that does column-wise storage instead of row-wise storage, and in this case more columns really can take more time.
The other major possibility would be if you had segmented your table so some columns were stored on one server, and other columns on another (aka vertical partitioning). In this case, retrieving more columns might involve retrieving data from different servers, and it's always possible that the load is imbalanced so different servers have different response times. Of course, you usually try to keep the load reasonably balanced so that should be fairly unusual, but it's still possible (especially if, for example, if one of the servers handles some other data whose usage might vary independently from the rest).
yes, if your query can be covered by a non clustered index it will be faster since all the data is already in the index and the base table (if you have a heap) or clustered index does not need to be touched by the optimizer
To demonstrate what tvanfosson has already written, that there is a "transfer" cost I ran the following two statements on a MSSQL 2000 DB from query analyzer.
SELECT datalength(text) FROM syscomments
SELECT text FROM syscomments
Both results returned 947 rows but the first one took 5 ms and the second 973 ms.
Also because the fields are the same I would not expect indexing to factor here.

Avoiding full table scan

I have the following query which obtains transactions from the transactions table and transaction detail. Both tables have a big amount of entries, so this query takes a while to return results.
SELECT * FROM transactions t LEFT JOIN transac_detail tidts ON (tidts.id_transac = t.id);
However, I'm more worried because of the fact that Oracle does a full table scan on both tables, according to explain plan, even though t.id and tidts.id_transac have indexes.
Is there any way to optimize this without touching table structure?
I don't think it's true that the SQL neccessarily would be best served with full table scans. This would really be most obviously true where there is a foreign key between the two, but even then there could be exceptions.
I think the key question is this: "what proportion of the rows from each table are expected to be included in the result set?". If the answer is "100% from each" then you have a clear case for full table scans (and a hash join).
However consider the case where tables A and B are joined, with table A containing 5 rows with a foreign key to table B (the parent) containing a million rows. Obviously here you'd look for a full scan of A with a nested loop join to B (the join column on table B would be indexed because it would have to be a primary or unique key).
In the OP's case though it looks like you'd expect 100% rows to be returned from each table. I'd expect to see a full scan of both tables, and a hash join with TRANSACTIONS(probably the smaller table) being accessed first and built into the hash table. This would be the optimum join method, and I'd just be looking out for a situation where TRANSACTIONS is too big for a single-pass hash join. If the join spills to disk then that could be a performance worry and you'd have to look at increasing the memory allocation or equi-partitioning the two tables to reduce the memory requirement.
Since the query as given just returns everything anyway, full table scan may be actually the fastest way of arriving at the final result. Since I/O is so expensive compared to CPU time, it may be more efficient to just pull everything into memory and making the final join there, than to make an index seek in a loop over one table.
To determine whether the query could actually run any faster, you can try the following approaches:
look at query plan on only a subset of the data (for example, a range of id's)
try the query on subsets of varying sizes, see what kind of curve you're plotting here
You don't have a WHERE clause, therefore Oracle is thinking that since it has to return all records from both tables a full table scan will be most efficient.
If you would add a WHERE clause that uses the index, I think you will find that EXPLAIN PLAN will no longer use a full table scan.
Full table scans are not necessarily bad - it appears that you're not limiting the result set in any way and this might be the most efficient way to execute the query. You can always verify this by using an index hint and determining the resulting performance change.