MSSQL Diferent execution plan for same query/data - sql

I have query that is running "fast" on production, but very slow (1hour) on test servers.
The following query is in question:
select z.PrimaryKeyColumn
FROM [table1] z
inner join Table2 p on p.PrimaryKeyColumn=z.PrimaryKeyColumn
left outer join table3 pz on z.PrimaryKeyColumn==Rtrim(rtrim(pz.column2)+LTRIM(pz.column3))
I analyzed query execution plan and realized that on production it uses hash match while on test it uses loop for the first join, hence the slowness.
I have rebuilt indexes and updated statistics, but result are the same.
Additionally, on TEST server, where results are slow, i Copied/duplicated Table2 table with indexes and data, and when i use that table then the query is fast as it is on production server...
This are the query execution plans:
TEST server:
TEST server but using duplicate of Table2 in INNER JOIN:
PRODUCTION server:

Probably both server are not the same or with different configuration. BUT something that is not necessary, remove the functions RTRIM() by comparing u will have the same result.

The fact that when you have copied table1 and table 2 to new copies has resolved the Query Plan to the same as Production does indicate that there is something different in the cardinality estimates from the original and the copy.
There has to be some differences in the statistics created for the tables, so check between them that they have the same statistics created for the original and copy.
Also review the histograms for the statistics, especially related to the different index choices observed between the plans - do the steps look the same?
Also, possible obvious and you've already confirmed - but do all the same indexes exist on the tables?

Related

Does my previous SQL query/ies affect my current query?

I have multiple SQL queries that I run one after the other to get a set of data. In each query, there are a bunch of tables joined that are exactly the same with the other queries. For example:
Query1
SELECT * FROM
Product1TableA A1
INNER JOIN Product1TableB B on A1.BId = B.Id
INNER JOIN CommonTable1 C on C.Id = B.CId
INNER JOIN CommonTable2 D on D.Id = B.DId
...
Query2
SELECT * FROM Product2TableA A2
INNER JOIN Product2TableB B on A2.BId = B.Id
INNER JOIN CommonTable1 C on C.Id = B.CId
INNER JOIN CommonTable2 D on D.Id = B.DId
...
I am playing around re-ordering the joins (around 2 dozen tables joined per query) and I read here that they should not really affect query execution unless SQL "gives up" during optimization because of how big the query is...
What I am wondering is if bunching up common table joins at the start of all my queries actually helps...
In theory, the order of the joins in the from clause doesn't make a difference on query performance. For a small number of tables, there should be no difference. The optimizer should find the best execution path.
For a larger number of tables, the optimizer may have to short-circuit its search regarding join order. It would then be using heuristics -- and these could be affected by join order.
Earlier queries would have no effect on a particular execution plan.
If you are having problems with performance, I am guessing that join order is not the root cause. The most common problem that I have in SQL Server are inappropriate nested-loop joins -- and these can be handled with an optimizer hint.
I think I understood what he was trying to say/to do:
What I am wondering is if bunching up common table joins at the start
of all my queries actually helps...
Imagine that you have some queries and every query has more than 3 inner joins. The queries are different but always have (for example) 3 tables in common that are joined on the same fields. Now the question is:
what will happen if every query will start with these 3 tables in join, and all the other tables are joined after?
The answer is it will change nothing, i.e. optimizer will rearrange the tables in the way it thinks will bring to optimal execution.
The thing may change if, for example, you save the result of these 3 joins into a temporary table and then use this saved result to join with other tables. But this depends on the filters that your queries use. If you have appropriate indexes and your query filters are selective enough(so that your query returns very few rows) there is no need to cache intermediate no-filtered result that has too many rows because optimizer can choose to first filter every table and only then to join them
Gordon's answer is a good explanation, but this answer explains the JOIN's behavior and also specifies that SQL Server's version is relevant:
Although the join order is changed in optimisation, the optimiser
does't try all possible join orders. It stops when it finds what it
considers a workable solution as the very act of optimisation uses
precious resources.
While the optimizer tries its best in choosing a good order for the JOINs, having many JOINs creates a bigger chance of obtaining a not so good plan.
Personally, I have seen many JOINs in some views within an ERP and they usually ran ok. However, from time to time (based on client's data volume, instance configuration etc.), some selects from these views took much more than expected.
If this data reaches an actual application (.NET, JAVA etc.), a way is to cache information from all small tables, store it as dictionaries (hashes) and perform O(1) lookups based on the keys.
This provides the advantages of reducing the JOIN count and not performing reads from the database for these tables (except once when caching data). However, this increases the complexity of the application (cache management).
Another solution is use temporary tables and populate them in multiple queries to avoid many JOINs per single query. This solution usually performs better and also increases debuggability (if the query does not provide the correct data or no data at all, which of the 10-15 JOINs is the problem?).
So, my answer to your question is: you might get some benefit from reordering the JOIN clauses, but I recommend avoiding lots of JOINs in the first place.

Performance for big query in SQL Server view

I have a big query for a view that takes a couple of hours to run and I feel like it may be possible to work on its performance "a bit"..
The problem is that I am not sure of what I should do. The query SELECT 39 values, LEFT OUTER JOIN 25 tables and each table could have up to a couple of million rows.
Any tip is good. Is there any good way to attack this problem? I tried to look at the actual execution plan on a test with less data (took about 10 min to run) but it's crazy big. Is there any general things I could do to make this faster? Do I have to tackle one small part at the time..?
Maybe there is just one join that slows down everything? How do I detect it? So what I mean for short, how do I work on a query like this?
As a said, all feedback is good. Is there some more information I need to show, tell me!
The query looks something like this:
SELECT DISTINCT
A.something,
A.somethingElse,
B.something,
C.somethingElse,
ISNULL(C.somethingElseElse, '')
C.somethingElseElseElse,
CASE *** THEN D.something ELSE 0,
E.something,
...
U.something
FROM
TableA A
JOIN
TableB B on ...
JOIN
TableC C on ...
JOIN
TableD D on ...
JOIN
TableE E on ...
JOIN
TableF F on ...
JOIN
TableG G on ...
...
JOIN
Table U on ...
Break your problem into manageable pieces. If the execution plan is too large for you to analyze, start with a smaller part of the query, check its execution plan and optimize it.
There is no general answer on how to optimize a query, since there is a whole bunch of possible reasons why a query can be slow. You have to check the execution plan.
Generally the most promising ways to improve performance are:
Indexing:
When you see a a Clustered Index Scan or - even worse (because then you don't have a clustered index) - a Table Scan in your query plan for a table that you join, you need an index for your JOIN predicate. This is especially true if you have tables with millions of entries, and you select only a small subset of those entries. Check also the index suggestions in the execution plan.
You see that the index works when your Clustered Index Scan turns into an Index Seek.
Index includes:
You probably are displaying columns from your joined tables that are different from the fields you use to join (otherwise, why would you need to join then?). SQL Server needs to get the fields that you need from the table, which you see in the execution plan as Key Lookup.
Since you are taking 39 values from 25 tables, there will be very few fields per table that you will need to get (mostly one or two). SQL Server needs to load entire pages of the respecitive table and get the values from them.
In this case, you should INCLUDE the column(s) you want to display in your index to avoid the key lookups. This comes at an increased index size, but considering you only include a few columns, that cost should be neglectable compared to the size of your tables.
Checking views that you join:
When you join VIEWs you should be aware that it basically means an extension to your query (which means also of the execution plan). Do the same performance optimizations for the view as you do for your main query. Also, check if you join tables in the view that you already join in the main query. These joins might be unnecessary.
Indexed views (maybe):
In general, you can add indexes to views you are joining to your query or create one or more indexed views for parts of your query. There are some caveats though:
Indexed views take storage space in your DB, because you store parts of the data multiple times.
There are a lot of restrictions to indexed views, most notably in your case that OUTER JOINs are forbidden. If you can transform at least some of your OUTER JOINs to INNER JOINs this might be an option.
When you join indexed views, don't forget to use WITH(NOEXPAND) in your join, otherwise they might be ignored.
Partitioned tables (maybe):
If you are running on the Enterprise Edition of SQL Server, you can partition your tables. That can be useful if the rows you join are always selected from a small subset of the available rows. You can make a partition for this subset and increase performance.
Summary:
Divide and conquer. Analyze your query bit by bit to optimize it. The most promising options are indexes and index includes. If you still have trouble, go from there.

`in` keyword vs inner `join` which one is better for performance? [duplicate]

I have a case where using a JOIN or an IN will give me the correct results... Which typically has better performance and why? How much does it depend on what database server you are running? (FYI I am using MSSQL)
Generally speaking, IN and JOIN are different queries that can yield different results.
SELECT a.*
FROM a
JOIN b
ON a.col = b.col
is not the same as
SELECT a.*
FROM a
WHERE col IN
(
SELECT col
FROM b
)
, unless b.col is unique.
However, this is the synonym for the first query:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT col
FROM b
)
ON b.col = a.col
If the joining column is UNIQUE and marked as such, both these queries yield the same plan in SQL Server.
If it's not, then IN is faster than JOIN on DISTINCT.
See this article in my blog for performance details:
IN vs. JOIN vs. EXISTS
This Thread is pretty old but still mentioned often. For my personal taste it is a bit incomplete, because there is another way to ask the database with the EXISTS keyword which I found to be faster more often than not.
So if you are only interested in values from table a you can use this query:
SELECT a.*
FROM a
WHERE EXISTS (
SELECT *
FROM b
WHERE b.col = a.col
)
The difference might be huge if col is not indexed, because the db does not have to find all records in b which have the same value in col, it only has to find the very first one. If there is no index on b.col and a lot of records in b a table scan might be the consequence. With IN or a JOIN this would be a full table scan, with EXISTS this would be only a partial table scan (until the first matching record is found).
If there a lots of records in b which have the same col value you will also waste a lot of memory for reading all these records into a temporary space just to find that your condition is satisfied. With exists this can be usually avoided.
I have often found EXISTS faster then IN even if there is an index. It depends on the database system (the optimizer), the data and last not least on the type of index which is used.
That's rather hard to say - in order to really find out which one works better, you'd need to actually profile the execution times.
As a general rule of thumb, I think if you have indices on your foreign key columns, and if you're using only (or mostly) INNER JOIN conditions, then the JOIN will be slightly faster.
But as soon as you start using OUTER JOIN, or if you're lacking foreign key indexes, the IN might be quicker.
Marc
A interesting writeup on the logical differences: SQL Server: JOIN vs IN vs EXISTS - the logical difference
I am pretty sure that assuming that the relations and indexes are maintained a Join will perform better overall (more effort goes into working with that operation then others). If you think about it conceptually then its the difference between 2 queries and 1 query.
You need to hook it up to the Query Analyzer and try it and see the difference. Also look at the Query Execution Plan and try to minimize steps.
Each database's implementation but you can probably guess that they all solve common problems in more or less the same way. If you are using MSSQL have a look at the execution plan that is generated. You can do this by turning on the profiler and executions plans. This will give you a text version when you run the command.
I am not sure what version of MSSQL you are using but you can get a graphical one in SQL Server 2000 in the query analyzer. I am sure that this functionality is lurking some where in SQL Server Studio Manager in later versions.
Have a look at the exeuction plan. As far as possible avoid table scans unless of course your table is small in which case a table scan is faster than using an index. Read up on the different join operations that each different scenario produces.
The optimizer should be smart enough to give you the same result either way for normal queries. Check the execution plan and they should give you the same thing. If they don't, I would normally consider the JOIN to be faster. All systems are different, though, so you should profile the code on your system to be sure.

Why is there a HUGE performance difference between temp table and subselect

This is a question about SQL Server 2008 R2
I'm not a DBA, by far. I'm a java developer, who has to write SQL from time to time. (mostly embedded in code). I want to know if I did something wrong here, and if so, what I can do to avoid it to happen again.
Q1:
SELECT something FROM (SELECT * FROM T1 WHERE condition1) JOIN ...
Q1 features 14 joins
Q2 is the same as Q1, with one exception. (SELECT * FROM T1 WHERE condition1) is executed before, and stored in a temp table.
This is not a correlated sub-query.
Q2:
SELECT * INTO #tempTable FROM T1 WHERE condition1
SELECT something FROM #tempTable JOIN ...
again, 14 joins.
The thing that puzzles me now is that Q1 took > 2min, (tried it a few times, to avoid caching to play a role) while Q2 (both queries combined) took 2sec!!! What gives?
Why it's not recommended to use subqueries?
Database Optimizer (regardless of what database you are using) can not always properly optimize such query (with subqueries). In this case, the problem to the optimizer is to choose the right way to join result sets. There are several algorithms for joining two result sets. The choice of the algorithm depends on the number of records which are contained in one and in the other result set. In case if you join two physical tables (subquery is not a physical table), the database can easily determine the amount of data in two result sets by the available statistics. If one of result sets is a subquery then to understand how many records it returns is very difficult. In this case the database can choose wrong query plan of join, so that will lead to a dramatic reduction in the performance of the query.
Rewriting the query with using temporary tables is intended to simplify the database optimizer. In the rewritten query all result sets participating in joins will be physical tables and the database will easily determine the length of each result set. This will allow the database to choose the guaranteed fastest of all possible query plans. Moreover, the database will make the right choice no matter what are the conditions. The rewritten query with temporary tables would work well on any database, this is especially important in the development of portable solutions. In addition, the rewritten query is easier to read, easier to understand and to debug.
It is understood that rewriting the query with temporary tables can lead to some slowdown due to additional expenses: creation of temporary tables. If the database will not be mistaken with the choice of the query plan, it will perform the old query faster than a new one. However, this slowdown will always be negligible. Typically the creation of a temporary table takes a few milliseconds. That is, the delay can not have a significant impact on system performance, and usually can be ignored.
Important! Do not forget to create indexes for temporary tables. The index fields should include all fields that are used in join conditions.
There are lot of things to tackle here, indexes, execution plans, etc. Testing and comparing results is the way to go.
You could take a look to the usual suspects, indexes. Take a look into the execution plan and compare them. Make sure the WHERE clause is using the correct ones. Ensure you are using the indexes on your JOINs.
These answers sure will help you a lot.
Performance: Subquery or Joining
Is there a speed difference between CTE , SubQuery and Temp tables?

Sql Server Join query

I have two tables. one is a small table and another one is a large table. While joining between two table, which table i will keep in left and which one in right so that the query optimiser will search quicker or it does not matter where i will join the table..
for example :
--1
SELECT smalltable.column1,
largetable.column1
FROM smalltable
INNER JOIN largetable
ON smalltable.column1 = largetable.column1 ;
--2
SELECT smalltable.column1,
largetable.column1
FROM smalltable
INNER JOIN largetable
ON largetable.column1 = smalltable.column1 ;
Which query will make it faster or it doesnot matter.
If you're talking about Microsoft SQL Server, both queries are equivalent to the query optimizer. In fact, to almost any cost-based query optimizer they'll be equivalent. You can try it by looking at the execucution plan (here for details http://www.simple-talk.com/sql/performance/execution-plan-basics/).
The query optimizer for most decent SQL Server variants will solve that. Some pritimitive ones dont (have a query optimizer - older MySQL, Access come to my mind). SOme may get overlaoded with complex decisions (this is simple).
But in general - trust the query optimizer first.
It should not matter which order you use, as your SQL Server should optimise the query execution for you. However, (if you are using Microsoft SQL Server) you could use SQL Server Profiler (found under the Tools menu of SQL Server Management Studio) to check the execution plans of both options.
If one of the tables is smaller that the other table.
Place the smaller table first and then the larger table as it will have less work to do and more over this will help the query optimizer to choose a plan that uses a Hash Join.
Then run the query profiler and check that the Hash join is used because this is the best and fastest in this scenario.
If there are no indexes on the joined tables then optimizer will select hash join.
You can force a Hash join by using OPTION (HASH JOIN) after inner join statement
From MSDN,http://blogs.msdn.com/b/irenak/archive/2006/03/24/559855.aspx
The column name that joins the table is called a hash key. In the example above, it’ll be au_id. SQL Server examines the two tables being joined, chooses the smaller table (so called build input), and builds a hash table applying a hash algorithm to the values of a hash key. Each row is inserted into a hash bucket depending on the hash value computed for the hash key. If build input is done completely in-memory, the hash join is called an “in-memory hash join”. If SQL Server doesn’t have enough memory to hold the entire build input, the process will be done in chunks, and is called “grace hash join”.
Before running both the queries,select'Include Actual Execution Plan' from the menu & then run the queries. The Sql server will show the execution plan which is the best tool to create the optimized queries. See more about Execution Plan here.
The order of the join columns does matter. See this post for more detail. Also there has been no discussion of indexing in this thread. It is the combination of optimal join table order AND useful indexing that results in the fastest executing queries.