JOIN on concatenated column performance

JOIN on concatenated column performance - sql

I have a view that needs to join on a concatenated column. For example;
dbo.View1 INNER JOIN
dbo.table2 ON dbo.View1.combinedcode = dbo.table2.code
Inside the 'View1' there is a column which is comprised like so;
dbo.tableA.details + dbo.tableB.code AS combinedcode
Performing a join on this column is extremely slow. However the actual 'View1' runs extremely quickly. The poor performance comes with the join, and there aren't even many rows in any of the tables or views. Does anyone know why this might be?
Thanks for any insight!

Since there's no index on combinedcode, the JOIN will most likely result in a full "table scan" of the view to calculate the code for every row.
If you want to speed things up, try making the view into an indexed view with an index on combinedcode to help the join.
Another alternative, depending on your SQL server version, is to (as Parado answers) create a temporary table for the join, although it's usually less performant, at least for single shot queries.

Try this way:
select *
into #TemTap
from View1
/*where conditions on view1*/
after that You could create index on #TemTap.combinedcode and than
dbo.#TemTap as View1 INNER JOIN dbo.table2 ON dbo.View1.combinedcode =
dbo.table2.code
It often works for me.

The reason is because the optimizer has no information about the concatenated column, so it cannot choose a reasonable join path. My guess, if you look at the execution plan, is that the join is using a "nested loop" join. (I'm tempted to add "dreaded" to that.)
You might be able to fix this by putting an index on table2(code). The optimizer should decide to use this index, getting around the bad join optimization.
You can also use query hints to force the use of a "hash join" or "merge join". I am finding myself doing this more often for complex queries, where changes to the data might effect the query plan. (Such hints go in when a query that has been taking 2 minutes for a year decides to take hours, fill the temporary database, and die when it runs out of space.) You can do this by adding OPTION (merge join, hash join) to the end of the query. You can also explicitly choose the type of join in the on clause.
Finally, storing the intermediate results in a temporary table (as proposed by Parado) should give the optimizer enough information to choose the best join algorithm.

Using SQL functions is where condition is not advised. here you are using concatenate in where condition (indirectly but yes). so it is performing concatenation for every row and then comparing it with other table.
Now solution will be try using intermediate table rather then this view to hold the concatinated value.
if not possible try using index view, I know its a hell of task.
I would have preferred creating intermediate table.
see the link for indexed views
http://msdn.microsoft.com/en-us/library/ms191432.aspx#Restrictions

Related

Performance for big query in SQL Server view

I have a big query for a view that takes a couple of hours to run and I feel like it may be possible to work on its performance "a bit"..
The problem is that I am not sure of what I should do. The query SELECT 39 values, LEFT OUTER JOIN 25 tables and each table could have up to a couple of million rows.
Any tip is good. Is there any good way to attack this problem? I tried to look at the actual execution plan on a test with less data (took about 10 min to run) but it's crazy big. Is there any general things I could do to make this faster? Do I have to tackle one small part at the time..?
Maybe there is just one join that slows down everything? How do I detect it? So what I mean for short, how do I work on a query like this?
As a said, all feedback is good. Is there some more information I need to show, tell me!
The query looks something like this:
SELECT DISTINCT
A.something,
A.somethingElse,
B.something,
C.somethingElse,
ISNULL(C.somethingElseElse, '')
C.somethingElseElseElse,
CASE *** THEN D.something ELSE 0,
E.something,
...
U.something
FROM
TableA A
JOIN
TableB B on ...
JOIN
TableC C on ...
JOIN
TableD D on ...
JOIN
TableE E on ...
JOIN
TableF F on ...
JOIN
TableG G on ...
...
JOIN
Table U on ...

Break your problem into manageable pieces. If the execution plan is too large for you to analyze, start with a smaller part of the query, check its execution plan and optimize it.
There is no general answer on how to optimize a query, since there is a whole bunch of possible reasons why a query can be slow. You have to check the execution plan.
Generally the most promising ways to improve performance are:
Indexing:
When you see a a Clustered Index Scan or - even worse (because then you don't have a clustered index) - a Table Scan in your query plan for a table that you join, you need an index for your JOIN predicate. This is especially true if you have tables with millions of entries, and you select only a small subset of those entries. Check also the index suggestions in the execution plan.
You see that the index works when your Clustered Index Scan turns into an Index Seek.
Index includes:
You probably are displaying columns from your joined tables that are different from the fields you use to join (otherwise, why would you need to join then?). SQL Server needs to get the fields that you need from the table, which you see in the execution plan as Key Lookup.
Since you are taking 39 values from 25 tables, there will be very few fields per table that you will need to get (mostly one or two). SQL Server needs to load entire pages of the respecitive table and get the values from them.
In this case, you should INCLUDE the column(s) you want to display in your index to avoid the key lookups. This comes at an increased index size, but considering you only include a few columns, that cost should be neglectable compared to the size of your tables.
Checking views that you join:
When you join VIEWs you should be aware that it basically means an extension to your query (which means also of the execution plan). Do the same performance optimizations for the view as you do for your main query. Also, check if you join tables in the view that you already join in the main query. These joins might be unnecessary.
Indexed views (maybe):
In general, you can add indexes to views you are joining to your query or create one or more indexed views for parts of your query. There are some caveats though:
Indexed views take storage space in your DB, because you store parts of the data multiple times.
There are a lot of restrictions to indexed views, most notably in your case that OUTER JOINs are forbidden. If you can transform at least some of your OUTER JOINs to INNER JOINs this might be an option.
When you join indexed views, don't forget to use WITH(NOEXPAND) in your join, otherwise they might be ignored.
Partitioned tables (maybe):
If you are running on the Enterprise Edition of SQL Server, you can partition your tables. That can be useful if the rows you join are always selected from a small subset of the available rows. You can make a partition for this subset and increase performance.
Summary:
Divide and conquer. Analyze your query bit by bit to optimize it. The most promising options are indexes and index includes. If you still have trouble, go from there.

does the order of joins in a big search query affect the response time in sqlserver

I have a query that has joins between 15 tables or even more. I need to optimize the response time. I created some index columns, changed some conditions from NOT IN to NOT EXISTS, but I found myself wondering about this.
Does the order of these joins affect the response time?

The order of JOINs definitely does affect performance, as well as the type of join. INNER JOINs, generally, will yield quicker results than RIGHT or LEFT OUTER JOINs due to the selectivity of the join.
SQL Server also tries its best to optimize every query, but at 15 joins it may have a hard time. Consider breaking the statement up into smaller junks of fewer joins. A strategy to resolve things like this that I've used in the past is to create a temporary table to store the results, then INSERT into it and UPDATE it accordingly through several different statements, with the 15 tables being spread out into the appropriate insert/update spots.

The order of the joins definitely matters. The question is, can you change the order by rewriting the SQL? The query optimizer doesn't necessarily care what order you write the joins in, depending on the type of join. The query optimizer does its best to find the most efficient execution plan based on the SQL you've written. However, it is no where close to perfect. If you notice after looking at the execution plan that it could be done more efficiently another way you can trick it into doing it your way and see if it helps.
One way to trick it is to use temp tables to pare down the result set before joining to large tables. This will allow less records to be selected which will reduce I/O.
Another way, as demonstrated by Adam Machanic, is to use a top in the select clause with an order by.

How to convert rows to columns in indexed view?

I use OUTER JOIN to get values stored in rows and show them as columns. When there is no value, I show NULL in column.
Source table:
Id|Name|Value
01|ABCG|,,,,,
01|ZXCB|.....
02|GHJK|;;;;;
View:
Id|ABCG|ZXCB|GHJK
01|,,,,|....|NULL
02|NULL|NULL|;;;;
The query looks like:
SELECT DISTINCT
b.Id,
bABCG.Value AS "ABCG"
bZXCB.Value AS "ZXCB"
bGHJK.Value AS "GHJK"
FROM
Bars b
LEFT JOIN Bars bABCG ON b.Id = bABCG.Id and b.Name = 'ABCG'
LEFT JOIN Bars bZXCB ON b.Id = bZXCB.Id and b.Name = 'ZXCB'
LEFT JOIN Bars bGHJK ON b.Id = bGHJK.Id and b.Name = 'GHJK'
I want to remove LEFT JOIN because it's not allowed in indexed view. I tried replacing it with inner SELECT, but inner SELECT is not allowed also and UNION too. I can't use INNER JOIN because I want to show NULLs in view. What should I use?

You may be able to implement something similar using an actual table to store the results, and a set of triggers against the base tables to maintain the internal data.
I believe that, under the covers, this is what SQL Server does (in spirit, if not in actual implementation) when you create an indexed view. However, by examining the rules for indexed views, it's clear that the triggers should only use the inserted and deleted tables, and should not be required to scan the base tables to perform the maintenance - otherwise, for large tables, maintaining this indexed view would impose a serious performance penalty.
As an example of the above, whilst you can easily write a trigger for insert to maintain a MAX(column) column in the view, deletion would be more problematic - if you're deleting the current max value, you'd need to scan the table to determine the new maximum. For many of the other restrictions, try writing the triggers by hand, and most times there'll come a point where you need to scan the base table.
Now, in your particular case, I believe it could be reasonably efficient for these triggers to perform the maintenance - but you need to carefully consider all of the insert/update/delete scenarios, and make sure that your triggers actually faithfully maintain this data - e.g. if you update any ids, you may need to perform a mixture of updates, inserts and deletes.

The best you are going to be able to do is use inner joins to get the matches, then union with the left joins and filter it to only return nulls. This probably won't solve your problem.
I don't know the specifics of your system but I am assuming that you are dealing with performance issues, which is why you want to use the indexed view. There are a few alternatives, but I think the following is the most appropriate.
Since you commented this is for a DW I am going to assume that your system is more intensive on reads than writes and that data is loaded into it on a schedule by an ETL process. In this kind of high read/low write* situation I would recommend you "materialize" this view, which means when the ETL process runs, to generate the table with your initial select statement that includes the left joins. You will take the hit on the write, then all your reads will be on par with the performance of the indexed view (you would be doing the same thing the indexed view would do, except in a batch instead of on a row by row basis). If your source DB and DW are on the same instance this is a better choice than an indexed view b/c it won't affect the performance of the source system (indexed views slow down inserts). This is the same concept as the indexed view because you take the performance hit on the insert to speed up the select.
I've been down this path before and come to the following conclusion:
An indexed view is more likely to be part of the solution than the entire solution.
*when I said "high read/low write" above you can also think of it as "high read/scheduled write"

SELECT DISTINCT
b.Id,
(Select bABCG.Value from Bars bABCG where b.Name = 'ABCG') AS "ABCG"
...
FROM
Bars b
you may have to add a aggregation on the value, I'm not sure how your data is organized

Methods of visualizing joins

Just wondering if anyone has any tricks (or tools) they use to visualize joins. You know, you write the perfect query, hit run, and after it's been running for 20 minutes, you realize you've probably created a cartesian join.
I sometimes have difficulty visualizing what's going to happen when I add another join statement and wondered if folks have different techniques they use when trying to put together lots of joins.

Always keep the end in mind.
Ascertain which are the columns you need
Try to figure out the minimum number of tables which will be needed to do it.
Write your FROM part with the table which will give max number of columns. eg FROM Teams T
Add each join one by one on a new line. Ensure whether you'll need OUTER, INNER, LEFT, RIGHT JOIN at each step.
Usually works for me. Keep in mind that it is Structured query language. Always break your query into logical lines and it's much easier.

Every join combines two resultsets into one. Each may be from a single database table or a temporary resultset which is the result of previous join(s) or of a subquery.
Always know the order that joins are processed, and, for each join, know the nature of the two temporary result sets that you are joining together. Know what logical entity each row in that resultset represents, and what attributes in that resultset uniquely identify that entity. If your join is intended to always join one row to one row, these key attributes are the ones you need to use (in join conditions) to implement the join. If your join is intended to generate some kind of cartesian product, then it is critical to understand the above to understand how the join conditions (whatever they are) will affect the cardinality of the new joined resultset.
Try to be consistent in the use of outer join directions. I try to always use Left Joins when I need an outer join, as I "think" of each join as "joining" the new table (to the right) to whatever I have already joined together (on the left) of the Left Join statement...

Run an explain plan.
These are always hierarchical trees (to do this, first I must do that). Many tools exist to make these plans into graphical trees, some in SQL browsers, (e.g, Oracle SQLDeveloper, whatever SQlServer's GUI client is called). If you don't have a tool, most plan text ouput includes a "depth" column, which you can use to indent the line.
What you want to look for is the cost of each row. (Note that for Oracle, though, higher costs can mean less time, if it allows Oracle to do a hash join rather than nested loops, and if the final result set has high cardinality (many, many rows).)

I have never found a better tool than thinking it through and using my own mind.
If the query is so complicated that you cannot do that, you may want to use either CTE's, views, or some other carefully organized subqueries to break it into logical pieces so you can easily understand and visualize each piece even if you cannot manage the whole.
Also, if your concern is effeciency, then SQL Server Management Studio 2005 or later lets you get estimated query execution plans without actually executing the query. This can give you very good ideas of where problems lie, if you are using MS SQL Server.

Are LEFT JOIN subquery table arguments evaluated more than once?

I have a query that looks like this:
SELECT *
FROM employees e
LEFT JOIN
(
SELECT *
FROM timereports
WHERE date = '2009-05-04'
) t
ON e.id = t.employee_id
As you can see, my LEFT JOIN second table parameter is generated by a a subquery.
Does the db evaluate this subquery only once, or multiple times?
thanks.
matti

This depends on the RDBMS.
In most of them, a HASH OUTER JOIN will be employed, in which case the subquery will be evaluated once.
MySQL, on the other hand, isn't capable of making HASH JOIN's, that's why it will most probably push the predicate into the subquery and will issue this query:
SELECT *
FROM timereports t
WHERE t.employee_id = e.id
AND date = '2009-05-04'
in a nested loop. If you have an index on timereports (employee_id, date), this will also be efficient.

If you are using SQL Server, you can take a look at the Execution Plan of the query. The SQL Server query optimizer will optimize the query so that it takes the least time in execution. Best time will be based on some conditions viz. indexing and the like.

You have to ask the database to show the plan. The algorithm for doing this is chosen dynamically (at query time) based on many factors. Some databases use statistics of the key distribution to decide which algorithm to use. Other databases have relatively fixed rules.
Further, each database has a menu of different algorithms. The database could use a sort-merge algorithm, or nested loops. In this case, there may be a query flattening strategy.
You need to use your database's unique "Explain Plan" feature to look at the query execution plan.
You also need to know if your database uses hints (usually comments embedded in the SQL) to pick an algorithm.
You also need to know if your database uses statistics (sometimes called a "cost-based query optimizer) to pick an algorithm.
One you know all that, you'll know how your query is executed and if an inner query is evaluated multiple times or flattened into the parent query or evaluated once to create a temporary result that's used by the parent query.

What do you mean by evaluated?
The database has a couple of different options how to perform a join, the two most common ones being
Nested loops, in which case each row in one table will be looped through and the corresponding row in the other table will be looked up, and
Hash join, which means that both tables will be scanned once and the results are then merged using some hash algorithm.
Which of those two options is chosen depends on the database, the size of the table and the available indexes (and perhaps other things as well).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas