Performance for big query in SQL Server view

Performance for big query in SQL Server view - sql

I have a big query for a view that takes a couple of hours to run and I feel like it may be possible to work on its performance "a bit"..
The problem is that I am not sure of what I should do. The query SELECT 39 values, LEFT OUTER JOIN 25 tables and each table could have up to a couple of million rows.
Any tip is good. Is there any good way to attack this problem? I tried to look at the actual execution plan on a test with less data (took about 10 min to run) but it's crazy big. Is there any general things I could do to make this faster? Do I have to tackle one small part at the time..?
Maybe there is just one join that slows down everything? How do I detect it? So what I mean for short, how do I work on a query like this?
As a said, all feedback is good. Is there some more information I need to show, tell me!
The query looks something like this:
SELECT DISTINCT
A.something,
A.somethingElse,
B.something,
C.somethingElse,
ISNULL(C.somethingElseElse, '')
C.somethingElseElseElse,
CASE *** THEN D.something ELSE 0,
E.something,
...
U.something
FROM
TableA A
JOIN
TableB B on ...
JOIN
TableC C on ...
JOIN
TableD D on ...
JOIN
TableE E on ...
JOIN
TableF F on ...
JOIN
TableG G on ...
...
JOIN
Table U on ...

Break your problem into manageable pieces. If the execution plan is too large for you to analyze, start with a smaller part of the query, check its execution plan and optimize it.
There is no general answer on how to optimize a query, since there is a whole bunch of possible reasons why a query can be slow. You have to check the execution plan.
Generally the most promising ways to improve performance are:
Indexing:
When you see a a Clustered Index Scan or - even worse (because then you don't have a clustered index) - a Table Scan in your query plan for a table that you join, you need an index for your JOIN predicate. This is especially true if you have tables with millions of entries, and you select only a small subset of those entries. Check also the index suggestions in the execution plan.
You see that the index works when your Clustered Index Scan turns into an Index Seek.
Index includes:
You probably are displaying columns from your joined tables that are different from the fields you use to join (otherwise, why would you need to join then?). SQL Server needs to get the fields that you need from the table, which you see in the execution plan as Key Lookup.
Since you are taking 39 values from 25 tables, there will be very few fields per table that you will need to get (mostly one or two). SQL Server needs to load entire pages of the respecitive table and get the values from them.
In this case, you should INCLUDE the column(s) you want to display in your index to avoid the key lookups. This comes at an increased index size, but considering you only include a few columns, that cost should be neglectable compared to the size of your tables.
Checking views that you join:
When you join VIEWs you should be aware that it basically means an extension to your query (which means also of the execution plan). Do the same performance optimizations for the view as you do for your main query. Also, check if you join tables in the view that you already join in the main query. These joins might be unnecessary.
Indexed views (maybe):
In general, you can add indexes to views you are joining to your query or create one or more indexed views for parts of your query. There are some caveats though:
Indexed views take storage space in your DB, because you store parts of the data multiple times.
There are a lot of restrictions to indexed views, most notably in your case that OUTER JOINs are forbidden. If you can transform at least some of your OUTER JOINs to INNER JOINs this might be an option.
When you join indexed views, don't forget to use WITH(NOEXPAND) in your join, otherwise they might be ignored.
Partitioned tables (maybe):
If you are running on the Enterprise Edition of SQL Server, you can partition your tables. That can be useful if the rows you join are always selected from a small subset of the available rows. You can make a partition for this subset and increase performance.
Summary:
Divide and conquer. Analyze your query bit by bit to optimize it. The most promising options are indexes and index includes. If you still have trouble, go from there.

Related

Self-Joins: is there a way to improve the performance of this query?

The purpose of all this is to create a lookup table to avoid a self join down the road, which would involve joins for the same data against much bigger data sets.
In this instance a sales order may have one or both of bill to and ship to customer ID.
The tables here are aggregates of data from 5 different servers, differentiated by the box_id. The customer table is ~1.7M rows, and sales_order is ~55M. The end result is ~52M records and takes on average about 80 minutes to run.
The query:
SELECT DISTINCT sog.box_id ,
sog.sales_order_id ,
cb.cust_id AS bill_to_customer_id ,
cb.customer_name AS bill_to_customer_name ,
cs.cust_id AS ship_to_customer_id ,
cs.customer_name AS ship_to_customer_name
FROM sales_order sog
LEFT JOIN customer cb ON cb.cust_id = sog.bill_to_id AND cb.box_id = sog.box_id
LEFT JOIN customer cs ON cs.cust_id = sog.ship_to_id AND cs.box_id = sog.box_id
The execution plan:
https://www.brentozar.com/pastetheplan/?id=SkjhXspEs
All of this is happening on SQL Server.
I've tried reproducing the bill to and ship to customer sets as CTEs and joining to those, but found no performance benefit.
The only indexes on these tables are the primary keys (which are synthetic IDs). Somewhat curiously the execution plan analyzer is not recommending adding any indexes to either table; it usually wants me to slap indexes on almost everything.
I don't know that there necessarily IS a way to make this run faster, but I am trying to improve my query optimization and have hit the limit of my knowledge. Any insight is much appreciated.

When you run queries like yours -- queries with no WHERE filters -- often the DBMS decides it has to scan entire tables. (In SQL Server execution plans, "clustered index scan" means it is scanning the whole table.) It certainly has to wrangle all the data in the tables. The lookup table you want to create is often called a "materialized view." (An online version of SQL server has built in support for materialized views, but other versions still don't.)
Depending on how you will use your data, you may be better off avoiding this materialized lookup table. If all your uses of your proposed lookup table involve filtering out a small subset of rows using WHERE clauses, an ordinary non-materialized view may be a good choice. When you give queries involving ordinary views, the query planner folds those views into the query, and may recommend helpful indexes.

Does my previous SQL query/ies affect my current query?

I have multiple SQL queries that I run one after the other to get a set of data. In each query, there are a bunch of tables joined that are exactly the same with the other queries. For example:
Query1
SELECT * FROM
Product1TableA A1
INNER JOIN Product1TableB B on A1.BId = B.Id
INNER JOIN CommonTable1 C on C.Id = B.CId
INNER JOIN CommonTable2 D on D.Id = B.DId
...
Query2
SELECT * FROM Product2TableA A2
INNER JOIN Product2TableB B on A2.BId = B.Id
INNER JOIN CommonTable1 C on C.Id = B.CId
INNER JOIN CommonTable2 D on D.Id = B.DId
...
I am playing around re-ordering the joins (around 2 dozen tables joined per query) and I read here that they should not really affect query execution unless SQL "gives up" during optimization because of how big the query is...
What I am wondering is if bunching up common table joins at the start of all my queries actually helps...

In theory, the order of the joins in the from clause doesn't make a difference on query performance. For a small number of tables, there should be no difference. The optimizer should find the best execution path.
For a larger number of tables, the optimizer may have to short-circuit its search regarding join order. It would then be using heuristics -- and these could be affected by join order.
Earlier queries would have no effect on a particular execution plan.
If you are having problems with performance, I am guessing that join order is not the root cause. The most common problem that I have in SQL Server are inappropriate nested-loop joins -- and these can be handled with an optimizer hint.

I think I understood what he was trying to say/to do:
What I am wondering is if bunching up common table joins at the start
of all my queries actually helps...
Imagine that you have some queries and every query has more than 3 inner joins. The queries are different but always have (for example) 3 tables in common that are joined on the same fields. Now the question is:
what will happen if every query will start with these 3 tables in join, and all the other tables are joined after?
The answer is it will change nothing, i.e. optimizer will rearrange the tables in the way it thinks will bring to optimal execution.
The thing may change if, for example, you save the result of these 3 joins into a temporary table and then use this saved result to join with other tables. But this depends on the filters that your queries use. If you have appropriate indexes and your query filters are selective enough(so that your query returns very few rows) there is no need to cache intermediate no-filtered result that has too many rows because optimizer can choose to first filter every table and only then to join them

Gordon's answer is a good explanation, but this answer explains the JOIN's behavior and also specifies that SQL Server's version is relevant:
Although the join order is changed in optimisation, the optimiser
does't try all possible join orders. It stops when it finds what it
considers a workable solution as the very act of optimisation uses
precious resources.
While the optimizer tries its best in choosing a good order for the JOINs, having many JOINs creates a bigger chance of obtaining a not so good plan.
Personally, I have seen many JOINs in some views within an ERP and they usually ran ok. However, from time to time (based on client's data volume, instance configuration etc.), some selects from these views took much more than expected.
If this data reaches an actual application (.NET, JAVA etc.), a way is to cache information from all small tables, store it as dictionaries (hashes) and perform O(1) lookups based on the keys.
This provides the advantages of reducing the JOIN count and not performing reads from the database for these tables (except once when caching data). However, this increases the complexity of the application (cache management).
Another solution is use temporary tables and populate them in multiple queries to avoid many JOINs per single query. This solution usually performs better and also increases debuggability (if the query does not provide the correct data or no data at all, which of the 10-15 JOINs is the problem?).
So, my answer to your question is: you might get some benefit from reordering the JOIN clauses, but I recommend avoiding lots of JOINs in the first place.

What is the better way than joining 3 tables?

I have to join 3 tables somewhere on my project.
Here is example tables and columns:
Table-1 : posts
Columns1: id,owner,title,post,keywords
Table-2 : sites
Columns2: id,name,url
Table-3 : views
Columns3: id,post,view
When I join all these tables it happens such a little huge query:
SELECT title,post,keywords,name,url,view
FROM posts
LEFT JOIN sites ON sites.id=posts.owner
LEFT JOIN views ON views.post = post.id
WHERE posts.date BETWEEN '2010-10-10 00:00:00' AND '2010-11-11 00:00:00'
ORDER BY views.view DESC
LIMIT 0,10
Is it the only way or could I do something else to get better performance?
This is my current query's EXPLAIN. Above one is just an example.

That's not a huge query by any stretch of the imagination.
How slow is it really?
Maybe if views doesn't contain any more information than what you've shown, you should just have a view count be a field of posts. No need for it to be its own separate table unless you're actually storing some information about the views themselves, like user-agent strings or time.

That's not a particularly "Huge" query. Have you ran query analyzer and checked for where the slow point is and then checked your indexes?
Re: Analyzer - Microsoft keeps moving it, but in 2008 Management Studio there are several options for showing the execution plan. Once you see the execution plan you can see where the problems are. Look for a single action taking 80+% of your time and focus on that. Things like Table Scans are an indication that you could speed it up by tweaking indexes. (There are downsides to indexes as well, but worry about that later).

If your relations are guaranteed, in other words non-nullable foreign keys, then the query will perform better if it uses inner joins instead of left joins. Although this query does not appear to be large or complex enough to seriously be suffering from performance issues.

If your POSTS table is particularly large (>100K rows?) then one thing you could do is to load the time-filtered posts into a temporary table and join on that temp table.

Estimated Subtree Cost Wildly Off, Terrible Optimization

I am joining a table that has two record id fields (record1, record2) to a view twice--once on each record--and selecting the top 1000. The view consists of several rather large tables, and it's id field is a string concatenation of their respective Ids (this was necessary for some third party software that requires a unique ID for the view. Row numbering was abysmally slow). There is also a where clause in the view calling a function that compares dates.
The estimated execution plan produces a "No Join Predicate" warning unless I use OPTION(FORCE ORDER). With forcing the ordering, the execution plan has multiple nodes displaying 100% cost. In both cases, the estimated subtree cost at the endpoint is thirteen orders of magnitude smaller than just one of it's nodes (it's doing a lot or nested loop joins with cpu costs as high 35927400000000)
What is going on here with the numbers in the execution plan? And why is SQL Server having such a hard time optimizing the query?
Simply adding an index to the view on the concatenated string and using the NOEXPAND table hint fixed the problem entirely. It ran in all of 12 seconds. But why did sql stumble so bad (even requiring the noexpand hint after I added the index)?
Running SQL Server 2008 SP1 with CU 8.
The View:
SELECT
dbo.fnGetCombinedTwoPartKey(N.NameID,A.AddressID) AS NameAddressKey,
[other fields]
FROM
[7 joined tables]
WHERE dbo.fnDatesAreOverlapping(N.dtmValidStartDate,N.dtmValidEndDate,A.dtmValidStartDate,A.dtmValidEndDate) = 1
The Query
SELECT TOP 1000
vw1.strFullName,
vw1.strAddress1,
vw1.strCity,
vw2.strFullName,
vw2.strAddress1,
vw2.strCity
FROM tblMatches M
JOIN vwImportNameAddress vw1 ON vw1.NameAddressKey = M.Record1
JOIN vwImportNameAddress vw2 ON vw2.DetailAddressKey = M.Record2

Looks like you're already pretty close to the explanation. It's because of this:
The view consists of several rather large tables, and it's id field is a string concatenation of their respective Ids...
This creates a non-sargable join predicate condition, and prevents SQL server from using any of the indexes on the base tables. Thus, the engine has to perform a full scan of all the underlying tables for each join (two in your case).
Perhaps in order to avoid doing several full table scans (one for each table, multiplied by the number of joins), SQL Server has decided that it will be faster to simply use the cartesian product and filter afterward (hence the "no join predicate" warning). When you FORCE ORDER, it dutifully performs all of the full scans and nested loops that you originally asked it for.
I do agree with some of the comments that this view is underlying a problematic data model, but the short-term workaround, as you've discovered, is to index the computed ID column in the view, which (obviously) makes it sargable again because it has hashes of the actual generated ID.
Edit: I also missed this on the first read-through:
WHERE dbo.fnDatesAreOverlapping(N.dtmValidStartDate,N.dtmValidEndDate,A.dtmValidStartDate,A.dtmValidEndDate) = 1
This, again, is a non-sargable predicate which will lead to poor performance. Wrapping any columns in a UDF will cause this behaviour. Indexing the view also materializes it, which may also factor into the speed of the query; without the index, this predicate has to be evaluated every time and forces a full scan on the base tables, even without the composite ID.

It would have to parse your function (fnGetCombinedTwoPartKey) to determine what columns are fetched to create the result column. It can't so it's going to assume all columns are necessary. If your indexes are covering indexes then your estimate is going to be wrong.

Avoiding full table scan

I have the following query which obtains transactions from the transactions table and transaction detail. Both tables have a big amount of entries, so this query takes a while to return results.
SELECT * FROM transactions t LEFT JOIN transac_detail tidts ON (tidts.id_transac = t.id);
However, I'm more worried because of the fact that Oracle does a full table scan on both tables, according to explain plan, even though t.id and tidts.id_transac have indexes.
Is there any way to optimize this without touching table structure?

I don't think it's true that the SQL neccessarily would be best served with full table scans. This would really be most obviously true where there is a foreign key between the two, but even then there could be exceptions.
I think the key question is this: "what proportion of the rows from each table are expected to be included in the result set?". If the answer is "100% from each" then you have a clear case for full table scans (and a hash join).
However consider the case where tables A and B are joined, with table A containing 5 rows with a foreign key to table B (the parent) containing a million rows. Obviously here you'd look for a full scan of A with a nested loop join to B (the join column on table B would be indexed because it would have to be a primary or unique key).
In the OP's case though it looks like you'd expect 100% rows to be returned from each table. I'd expect to see a full scan of both tables, and a hash join with TRANSACTIONS(probably the smaller table) being accessed first and built into the hash table. This would be the optimum join method, and I'd just be looking out for a situation where TRANSACTIONS is too big for a single-pass hash join. If the join spills to disk then that could be a performance worry and you'd have to look at increasing the memory allocation or equi-partitioning the two tables to reduce the memory requirement.

Since the query as given just returns everything anyway, full table scan may be actually the fastest way of arriving at the final result. Since I/O is so expensive compared to CPU time, it may be more efficient to just pull everything into memory and making the final join there, than to make an index seek in a loop over one table.
To determine whether the query could actually run any faster, you can try the following approaches:
look at query plan on only a subset of the data (for example, a range of id's)
try the query on subsets of varying sizes, see what kind of curve you're plotting here

You don't have a WHERE clause, therefore Oracle is thinking that since it has to return all records from both tables a full table scan will be most efficient.
If you would add a WHERE clause that uses the index, I think you will find that EXPLAIN PLAN will no longer use a full table scan.

Full table scans are not necessarily bad - it appears that you're not limiting the result set in any way and this might be the most efficient way to execute the query. You can always verify this by using an index hint and determining the resulting performance change.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas