A query performance is being affected if a column is included or not, but the weird thing is that it affects positive (reduce time execution) if the column is included.
The query includes a few joins to a view, some tables and tabled valued functions like the next:
SELECT
v1.field1, t2.field2
FROM
view v1 WITH (nolock)
INNER JOIN
table t1 WITH (nolock) ON v.field1 = t1.field1
INNER JOIN
table2 t2 WITH (nolock) ON t2.field2 = t1.field2
INNER JOIN
function1(#param) f1 ON f1.field3 = t2.field3
WHERE
(v.date1 = #param OR v.date2 = #param)
The thing is if I include within the select a varchar(200) not null column which is part of the view (it is not indexed in the original table or the view, and it's not part of a constraint), the query performance is X seconds, but if I don't include it then the performance ups to 4X seconds, which is a lot of difference just for including a column; so the query with the best performance will be like:
SELECT
v1.field1, t2.field2, v1.fieldWhichAffectsPerformance
view v1 WITH (nolock)
INNER JOIN
table t1 WITH (nolock) ON v.field1 = t1.field1
INNER JOIN
table2 t2 WITH (nolock) ON t2.field2 = t1.field2
INNER JOIN
function1(#param) f1 ON f1.field3 = t2.field3
WHERE
(v.date1 = #param OR v.date2 = #param)
It's mandatory to remove the column which improves the query performance, but without affecting in a negative way the actual performance. Any ideas?
EDIT: as suggested i've reviewed the execution plan, and the query without the column runs an extra hash match (left outer join) and uses index scan which cost a lot of CPU instead index seek which are the plan in the query with the column included. how can I remove the column without affecting the performance? any ideas?
Optimizers are complicated. Without query plans, there is only speculation.
You need to look at the query plans to get a real answer.
One possibility is the order of processing. The select could equivalently be written as:
SELECT t1.field1, t2.field2
because the on condition specifies that columns in the two tables are the same. The optimizer my recognize that the or prevents the use of indexes on the view (which is probably not applicable anyway). So, instead of scanning the view, it decides to scan table1 and then bring in the view.
By including an additional column in the select, you are pushing the optimizer to scan the view -- and this might be the better execution plan.
This is all hypothetical, but it gives a mechanism on how your observed timings could happen.
Related
I have kinda complex question.
Let's say that I have 7 tables (20mil+ rows each) (Table1, Table2 ...) with corresponding pk (pk1, pk2, ....) (cardinality among all tables is 1:1)
I want to get my final table (using hash join) as:
Create table final_table as select
t1.column1,
t2.column2,
t3.column3,
t4.column4,
t5.column5,
t6.column6,
t7.column7
from table1 t1
join table2 t2 on t1.pk1 = t2.pk2
join table2 t3 on t1.pk1 = t3.pk3
join table2 t4 on t1.pk1 = t4.pk4
join table2 t5 on t1.pk1 = t5.pk5
join table2 t6 on t1.pk1 = t6.pk6
join table2 t7 on t1.pk1 = t7.pk7
I would like to know if it would be faster to create partial tables and then final table, like this?
Create table partial_table1 as select
t1.column1,
t2.column2
from table1 t1
join table2 t2 on t1.pk1 = t2.pk2
create table partial_table2 as select
t1.column1, t1.column2
t3.column3
from partial_table1 t1
join table3 t3 on t1.pk1 = t3.pk3
create table partial_table3 as select
t1.column1, t1.column2, t1.column3
t4.column4
from partial_table1 t1
join table3 t4 on t1.pk1 = t4.pk4
...
...
...
I know it depends on RAM (because I want to use hash join), actual server usage, etc.. I am not looking for specific answer, I am looking for some explanations why and in what situations would it be better to use partial results or why it would it be better to use all 7 joins in 1 select.
Thanks, I hope that my question is easy to understand.
In general, it is not better to create temporary tables. SQL engines have an optimization phase and this optimization phase should do well as figuring out the best query plan.
In the case of a bunch of joins, this is mostly about join order, use of indexes, and the optimal algorithm.
This is a good default attitude. Does it mean that temporary tables are never useful for performance optimization? Not at all. Here are some exceptions:
The optimizer generates a suboptimal query plan. In this case, query hints can push the optimizer in the right direction. And, temporary tables can help.
Indexing the temporary tables. Sometimes an index on the temporary tables can be a big win for performance. The optimizer might not pick this up.
Re-use of temporary tables across queries.
For your particular goal of using hash joins, you can use a query hint to ensure that the optimizer does what you would like. I should note that if the joins are on primary keys, then a hash join might not be the optimal algorithm.
It is not a good idea to create temporary tables in your database. To Optimize your query for reporting purposes or faster results trying using views and it can lead to much better results.
For your specific case, you want to use hash join can you please explain a bit more like why you want to use that in particular because the optimizer will determine the best plan by itself and you don't need to worry about the type of join it performs.
I have a number of tables, around four, that I wish to join together. To make my code cleaner and readable (to me), I wish to join all at once and then filter at the end:
SELECT f1, f2, ..., fn
FROM t1 INNER JOIN t2 ON t1.field = t2.field
INNER JOIN t3 ON t2.field = t3.field
INNER JOIN t4 ON t3.field = t4.field
WHERE // filters here
But I suspect that placing each table in subqueries and filtering in each scope would make performance better.
SELECT f1, f2, ..., fn
FROM (SELECT t1_f1, t1_f2, ..., t1_fi FROM t1 WHERE // filter here) AS a
INNER JOIN
(SELECT t2_f1, t2_f2, ..., t2_fj FROM t2 WHERE // filter here) AS b
ON // and so on
Kindly advise which would lead to better performance and/or if my hunch is correct. I am willing to sacrifice performance to readability.
If indeed filtering in each subquery will be more efficient, does the architecture of database platform would make any difference or is this holds true for all RDBMS SQL flavors?
I'm using both SQL Server and Postgres.
The query optimizer will always attempt to take care of finding the most optimal plan from your SQL.
You should concentrate more on writing readable, maintainable code and then by analyzing the execution plan find the inefficient parts of your query (and more likely) the inefficient parts of your database and indexing design.
Moving your filtering around from the where clause to the join clause without any meaningful analysis is likely to be wasted effort.
Your first approach will always be better as the SQL engine will evaluate where conditions first and then perform joins. So while evaluating where clause, it will filter records if conditions are available.
SELECT f1, f2, ..., fn
FROM t1 INNER JOIN t2 ON t1.field = t2.field
INNER JOIN t3 ON t2.field = t3.field
INNER JOIN t4 ON t3.field = t4.field
WHERE // filters here
Join will always perform better if you have indexed properly.
I have 3 tables Table1 (with 1020690 records), Table2(with 289425 records), Table 3(with 83692 records).I have something like this
SELECT * FROM Table1 T1 /* OK fine select * is bad when not all columns are needed, this is just an example*/
LEFT JOIN Table2 T2 ON T1.id=T2.id
LEFT JOIN Table3 T3 ON T1.id=T3.id
and a query like this
SELECT * FROM Table1 T1
LEFT JOIN Table3 T3 ON T1.id=T3.id
LEFT JOIN Table2 T2 ON T1.id=T2.id
The query plan shows me that it uses 2 Merge Join for both the joins. For the first query, the first merge is with T1 and T2 and then with T3. For the second query, the first merge is with T1 and T3 and then with T2.
Both these queries take about the same time(40 seconds approx.) or sometimes Query1 takes couple of seconds longer.
So my question is, does the join order matter ?
The join order for a simple query like this should not matter. If there's a way to reorder the joins to improve performance, that's the job of the query optimizer.
In theory, you shouldn't worry about it -- that's the point of SQL. Trying to outthink the query optimizer is generally not going to give better results. Especially in MS SQL Server, which has a very good query optimizer.
I wouldn't expect this query to take 40 seconds. You might not have the right indexes defined. You should use tools like SQL Server Profiler or SQL Server Database Engine Tuning Advisor to see if it can recommend any new indexes.
The query optimizer will use a combination of the constraints, indexes, and statistics collected on the table to build an execution plan. In most cases this works well. However, I do occasionally encounter scenarios where the execution plan is chosen poorly. Often times tweaking the query can effectively coerce the optimizer into a choosing a better plan. I can offer no general rules for doing this though. When all else fails you could resort to the FORCE ORDER query hint.
And yes, the join order can have a significant impact on execution time of your query. The idea is that by joining the tables that yield the smallest results first will cause the next join to be computed more quickly. Edit: It is important to note, however, that in the abscense of FORCE ORDER and in all other things being equal the order you specify in the query may have no correlation with the way the optimizer builds the execution plan.
In general, SQL Server is smart enough to pick out the best way to join and it will not only use the order you wrote in the query. That said, I find it easier to understand a complex query if all the inner joins are first and then the left joins.
I want to understand the process of nested join clauses in sql queries. Can you explain this example with pseudo codes? (What is the order of joining tables?)
FROM
table1 AS t1 (nolock)
INNER JOIN table2 AS t2 (nolock)
INNER JOIN table3 as t3 (nolock)
ON t2.id = t3.id
ON t1.mainId = t2.mainId
In SQl basically we have 3 ways to join two tables.
Nested Loop ( Good if one table has small number of rows),
Hash Join (Good if both table has very large rows, it does expensive hash formation in memory)
Merge Join (Good when we have sorted data to join).
From your question it seems that you want for Nested Loop.
Let us say t1 has 20 rows, t2 has 500 rows.
Now it will be like
For each row in t1
Find rows in t2 where t1.MainId = t2.MainId
Now out put of that will be joined to t3.
Order of Joining depends on Optimizer, Expected Row count etc.
Try EXPLAIN query.
It tells you exactly what's going on. :)
Of course that doesn't work in SQL Server. For that you can try Razor SQLServer Explain Plan
Or even SET SHOWPLAN_ALL
If you're using SQL Server Query Analyzer, look for "Show Execution Plan" under the "Query" menu, and enable it.
Given these two queries:
Select t1.id, t2.companyName
from table1 t1
INNER JOIN table2 t2 on t2.id = t1.fkId
WHERE t2.aField <> 'C'
OR:
Select t1.id, t2.companyName
from table1 t1
INNER JOIN table2 t2 on t2.id = t1.fkId and t2.aField <> 'C'
Is there a demonstrable difference between the two? Seems to me that the clause "t2.aField <> 'C'" will run on every row in t2 that meets the join criteria regardless. Am I incorrect?
Update: I did an "Include Actual Execution Plan" in SQL Server. The two queries were identical.
I prefer to use the Join criteria for explaining how the tables are joined together.
So I would place the additional clause in the where section.
I hope (although I have no stats), that SQL Server would be clever enough to find the optimal query plan regardless of the syntax you use.
HOWEVER, if you have indexes which also have id, and aField in them, I would suggest placing them together in the inner join criteria.
It would be interesting to see the query plan's in these 2 (or 3) scenarios, and see what happens. Nice question.
There is a difference. You should do an EXPLAIN PLAN for both of the selects and see it in detail.
As for a simplier explanation:
The WHERE clause gets executed only after the joining of the two tables, so it executes for each row returned from the join and not nececerally every one from table2.
Performance wise its best to eliminate unwanted results early on so there should be less rows for joins, where clauses or other operations to deal with later on.
In the second example, there are 2 columns that have to be same for the rows to be joined together so it usually will give different results than the first one.
It depends.
SELECT
t1.foo,
t2.bar
FROM
table1 t1
LEFT JOIN table2 t2 ON t1.SomeId = t2.SomeId
WHERE
t2.SomeValue IS NULL
is different from
SELECT
t1.foo,
t2.bar
FROM
table1 t1
LEFT JOIN table2 t2 ON t1.SomeId = t2.SomeId AND t2.SomeValue IS NULL
It is different because the former crosses out all records from t2 that have NULL in t2.SomeValue and those from t1 that are not referenced in t2. The latter crosses out only the t2 records that have NULL in t2.SomeValue.
Just use the ON clause for the join condition and the WHERE clause for the filter.
Unless moving the join condition to the where clause changes the meaning of the query (like in the left join example above), then it doesn't matter where you put them. SQL will re-arrange them, and as long as they are provably equivalent, you'll get the same query.
That being said, I think it's more of a logical / readability thing. I usually put anything that relates two tables in the join, and anything that filters in the where.
I'd prefer first query. SQL server will use the best join type for your query based on indexes you have, after that will apply WHERE clause. But you can run both queries at the same time, look at execution plans, compare and choose the fastest (optimize adding indexes also).
unless you are working on a single-user app or something similarly small that creates trivial load, the only considerations that mean anything is how the server will process your query.
The answers that mention query plans give good advice.
In addition, set io statistics on to get an idea of how many reads your query will generate (I especially love Azder's post).
Think of every DB server as a pump of data from disk to client. That pump goes faster if it performs only the IO needed to get the job done. If the data is in cache it will be even faster. But you don't want to be reading more than you need from disk - that will result in crowding out of your cache useful data for no good reason.