SQL Optimization for DB2 - sql

If you have a situation where you are doing a Union All on two result sets, and the each result set is derived from an inner join with the same filtered subset of a master table does the query engine "hit" the master table once, or twice?
example:
SELECT m.col4, st1.col2
FROM master m
INNER JOIN subTable1 st1
on st1.col1 = m.col1
WHERE m.col1 = 'a' and m.col2 = 123 and m.col3 = "a1b2"
UNION ALL
SELECT m.col4, st2.col2
FROM master m
INNER JOIN subTable2 st2
on st2.col1 = m.col1
WHERE m.col1 = 'a' and m.col2 = 123 and m.col3 = "a1b2"
I am trying to determine if it would be beneficial to create a temp table to hold the filtered results of the master table so the UNION ALL statement would be working with a small subset of the master records, instead of having to perform the filtering of the master table twice, like it might be doing in the example above.
thank you, in advance, for whatever advice you can give.

Maybe a common table expression helps:
with small_master as (
select m.col4,
m.col1
from master
where m.col1 = 'a'
and m.col2 = 123
and m.col3 = 'a1b2'
)
SELECT m.col4, st1.col2
FROM small_master m
INNER JOIN subTable1 st1
on st1.col1 = m.col1
UNION ALL
SELECT m.col4, st2.col2
FROM small_master m
INNER JOIN subTable2 st2
on st2.col1 = m.col1;
In my experience (not with DB2 though) this helps if the CTE is reducing the number of rows drastically (say from "millions" to "thousands").
If the intermediate result of the CTE is (still) quite large (several millions) then this will probably not help.
But only the execution plan can shed light on this.

The easiest way to answer this kind of "what if" questions is to look at the query plan. You can easily generate one with the command db2expln -d <your db> -f <your query file> -z <your query delimiter> -gi
Generally speaking, if a task can be done with a single SQL statement that will be the fastest way to accomplish the task, so it is unlikely that creating a temporary table will benefit performance.

This depends a lot on the database and the statistics of the tables involved. I am not intimately familiar with DB2.
However, if the issue is performance, then consider putting an index on master(col, col2, col3). This would speed up both parts of the query.
The use of a CTE as a temp table is highly database specific. Postgres always instantiates CTEs, so the code is only run once. SQL Server never does. I do not know the behavior of DB2 in this regards. However, I would prefer to add indexes to explicitly improve performance, rather than fiddling with the query -- your new query may result in unexpected query plans based when table statistics change, new software is released, or hardware is upgraded.
As references for SQL Server behavior, you might be interested in this one or this one or this discussion.

Related

Tuning Oracle Query for slow select

I'm working on an oracle query that is doing a select on a huge table, however the joins with other tables seem to be costing a lot in terms of time of processing.
I'm looking for tips on how to improve the working of this query.
I'm attaching a version of the query and the explain plan of it.
Query
SELECT
l.gl_date,
l.REST_OF_TABLES
(
SELECT
MAX(tt.task_id)
FROM
bbb.jeg_pa_tasks tt
WHERE
l.project_id = tt.project_id
AND l.task_number = tt.task_number
) task_id
FROM
aaa.jeg_labor_history l,
bbb.jeg_pa_projects_all p
WHERE
p.org_id = 2165
AND l.project_id = p.project_id
AND p.project_status_code = '1000'
Something to mention:
This query takes data from oracle to send it to a sql server database, so I need it to be this big, I can't narrow the scope of the query.
the purpose is to set it to a sql server job with SSIS so it runs periodically
One obvious suggestion is not to use sub query in select clause.
Instead, you can try to join the tables.
SELECT
l.gl_date,
l.REST_OF_TABLES
t.task_id
FROM
aaa.jeg_labor_history l
Join bbb.jeg_pa_projects_all p
On (l.project_id = p.project_id)
Left join (SELECT
tt.project_id,
tt.task_number,
MAX(tt.task_id) task_id
FROM
bbb.jeg_pa_tasks tt
Group by tt.project_id, tt.task_number) t
On (l.project_id = t.project_id
AND l.task_number = t.task_number)
WHERE
p.org_id = 2165
AND p.project_status_code = '1000';
Cheers!!
As I don't know exactly how many rows this query is returning or how many rows this table/view has.
I can provide you few simple tips which might be helpful for you for better query performance:
Check Indexes. There should be indexes on all fields used in the WHERE and JOIN portions of the SQL statement.
Limit the size of your working data set.
Only select columns you need.
Remove unnecessary tables.
Remove calculated columns in JOIN and WHERE clauses.
Use inner join, instead of outer join if possible.
You view contains lot of data so you can also break down and limit only the information you need from this view

Multiple Joins in Teradata SQL - Faster to Use Subqueries or Temp Tables?

I am writing SQL for Teradata. I need to use joins to connect data from multiple tables. Is it typically faster to use subqueries or create temporary tables and append columns one join at a time? I'm trying to test it myself but network traffic makes it hard for me to tell which is faster.
Example A:
SELECT a.ID, a.Date, b.Gender, c.Age
FROM mainTable AS a
LEFT JOIN (subquery 1) AS b ON b.ID = a.ID
LEFT JOIN (subquery 2) AS c ON c.ID = a.ID
Or I could...
Example B:
CREATE TABLE a AS (
SELECT mainTable.ID, mainTable.Date, sq.Gender
FROM mainTable
LEFT JOIN (subquery 1) AS sq ON sq.id = mainTable.ID
)
CREATE TABLE b AS (
SELECT a.ID, a.Date, a.Gender, sq.Age
FROM a
LEFT JOIN (subquery 2) AS sq ON sq.id = a.ID
)
Assuming I clean everything up afterward, is one approach preferable to another? Again, I would like to just test this myself but the network traffic is kind of messing me up.
EDIT: The main table has anywhere from 100k to 5 million rows. The subqueries return a 1:1 relationship to the main table's IDs, but require WHERE clauses to filter dates. The subquery SQL isn't trivial, I guess is what I'm trying to convey.
Of course it's recommended to write joins, that's why there's an optmizer :-)
If you create temporary tables you force a specific order of processing instead of letting the optimizer decide which is the best plan.
Creating temporary tables might be usefull in some rare cases when you got a really complex query with dozens of joins and you need to break it into a more easily maintainable parts or you would like to get a specific PI for further processing.
Regarding testing different approaches:
Runtime should never be used for that, it might vary greatly based on the load on the server. You need to access Teradata's Query Log (DBQL: dbc.QryLogV, etc.) to get details about actual CPU/IO/spool usage. If you don't have access to it you might ask your DBA to grant it to you.
Btw, instead of real tables you should create VOLATILE TABLES which are automatically dropped when you logoff.

Can this SQL Query be optimized to run faster?

I have an SQL Query (For SQL Server 2008 R2) that takes a very long time to complete. I was wondering if there was a better way of doing it?
SELECT #count = COUNT(Name)
FROM Table1 t
WHERE t.Name = #name AND t.Code NOT IN (SELECT Code FROM ExcludedCodes)
Table1 has around 90Million rows in it and is indexed by Name and Code.
ExcludedCodes only has around 30 rows in it.
This query is in a stored procedure and gets called around 40k times, the total time it takes the procedure to finish is 27 minutes.. I believe this is my biggest bottleneck because of the massive amount of rows it queries against and the number of times it does it.
So if you know of a good way to optimize this it would be greatly appreciated! If it cannot be optimized then I guess im stuck with 27 min...
EDIT
I changed the NOT IN to NOT EXISTS and it cut the time down to 10:59, so that alone is a massive gain on my part. I am still going to attempt to do the group by statement as suggested below but that will require a complete rewrite of the stored procedure and might take some time... (as I said before, im not the best at SQL but it is starting to grow on me. ^^)
In addition to workarounds to get the query itself to respond faster, have you considered maintaining a column in the table that tells whether it is in this set or not? It requires a lot of maintenance but if the ExcludedCodes table does not change often, it might be better to do that maintenance. For example you could add a BIT column:
ALTER TABLE dbo.Table1 ADD IsExcluded BIT;
Make it NOT NULL and default to 0. Then you could create a filtered index:
CREATE INDEX n ON dbo.Table1(name)
WHERE IsExcluded = 0;
Now you just have to update the table once:
UPDATE t
SET IsExcluded = 1
FROM dbo.Table1 AS t
INNER JOIN dbo.ExcludedCodes AS x
ON t.Code = x.Code;
And ongoing you'd have to maintain this with triggers on both tables. With this in place, your query becomes:
SELECT #Count = COUNT(Name)
FROM dbo.Table1 WHERE IsExcluded = 0;
EDIT
As for "NOT IN being slower than LEFT JOIN" here is a simple test I performed on only a few thousand rows:
EDIT 2
I'm not sure why this query wouldn't do what you're after, and be far more efficient than your 40K loop:
SELECT src.Name, COUNT(src.*)
FROM dbo.Table1 AS src
INNER JOIN #temptable AS t
ON src.Name = t.Name
WHERE src.Code NOT IN (SELECT Code FROM dbo.ExcludedCodes)
GROUP BY src.Name;
Or the LEFT JOIN equivalent:
SELECT src.Name, COUNT(src.*)
FROM dbo.Table1 AS src
INNER JOIN #temptable AS t
ON src.Name = t.Name
LEFT OUTER JOIN dbo.ExcludedCodes AS x
ON src.Code = x.Code
WHERE x.Code IS NULL
GROUP BY src.Name;
I would put money on either of those queries taking less than 27 minutes. I would even suggest that running both queries sequentially will be far faster than your one query that takes 27 minutes.
Finally, you might consider an indexed view. I don't know your table structure and whether your violate any of the restrictions but it is worth investigating IMHO.
You say this gets called around 40K times. WHy? Is it in a cursor? If so do you really need a cursor. Couldn't you put the values you want for #name in a temp table and index it and then join to it?
select t.name, count(t.name)
from table t
join #name n on t.name = n.name
where NOT EXISTS (SELECT Code FROM ExcludedCodes WHERE Code = t.code)
group by t.name
That might get you all your results in one query and is almost certainly faster than 40K separate queries. Of course if you need the count of all the names, it's even simpleer
select t.name, count(t.name)
from table t
NOT EXISTS (SELECT Code FROM ExcludedCodes WHERE Code = t
group by t.name
NOT EXISTS typically performs better than NOT IN, but you should test it on your system.
SELECT #count = COUNT(Name)
FROM Table1 t
WHERE t.Name = #name AND NOT EXISTS (SELECT 1 FROM ExcludedCodes e WHERE e.Code = t.Code)
Without knowing more about your query it's tough to supply concrete optimization suggestions (i.e. code suitable for copy/paste). Does it really need to run 40,000 times? Sounds like your stored procedure needs reworking, if that's feasible. You could exec the above once at the start of the proc and insert the results in a temp table, which can keep the indexes from Table1, and then join on that instead of running this query.
This particular bit might not even be the bottleneck that makes your query run 27 minutes. For example, are you using a cursor over those 90 million rows, or scalar valued UDFs in your WHERE clauses?
Have you thought about doing the query once and populating the data in a table variable or temp table? Something like
insert into #temp (name, Namecount)
values Name, Count(name)
from table1
where name not in(select code from excludedcodes)
group by name
And don't forget that you could possibly use a filtered index as long as the excluded codes table is somewhat static.
Start evaluating the execution plan. Which is the heaviest part to compute?
Regarding the relation between the two tables, use a JOIN on indexed columns: indexes will optimize query execution.

Why is this SQL statement hanging when group by, sum, or where clause is included?

I have a SQL statement:
select
t3.item1,
t3.item2,
sum(t1.moneys)
from
table t1
inner join table t2 on t1.key = t2.key
inner join table t3 on t1.key2 = t3.key2
where
t2.type = 'thistype'
and t3.type2 = 'thistype'
group by
t3.item1, t3.item2
If I remove the group by, sum, or where clause it runs fine - but if I add back any of those it hangs forever... any ideas... this is on SQL Server Management Studio 2008 R2
Thanks
Further Testing
so I created a view:
select
t3.item1,
t3.item2,
t1.moneys,
t2.type,
t3.type2
from
table t1
inner join table t2 on t1.key = t2.key
inner join table t3 on t1.key2 = t3.key2
and I can select top 1000 from the view fine and see the type I want to specifically look at in the data, but when I add the 'where type2 = 'thistype'' it hangs again...
Your joining three tables together with millions of records, this is normal for it take a bit to run.
To answer your question about statistics, they are what the indices utilize to retrieve records faster from your tables. Without accurate or up to date statistics, indices can actually slow your queries down.
http://blogs.technet.com/b/rob/archive/2008/05/16/sql-server-statistics.aspx
I think we'd need to see some table structure and know some more things about your DB before we can give a solid answer. First thing, though, is to run a trace on it and see what that tells you.
At first blush, I have found that issues with aggregate functions (sum, group by, etc) tend to stem from a) overly large data sets (that is: you're just trying to pull back too much data) or b) from overly-complicated structure or relationships on the joined tables.
However, those are just my general rules-of-thumb, and may not apply in a specific situation: run a trace and any other form of profiling you can and see what that tells you.
Has you looked at the execution plan you're getting? That will tell you where the problem is. Do you have covering indices on the columns on which you're joining and grouping?
Is it possible that the execution plan is corrupted?
http://msdn.microsoft.com/en-us/library/aa175244(v=sql.80).aspx
Try recompiling the plan using sp_recompile

Is derived table executed once or three times?

Every time you make use of a derived table, that query is going to be executed. When using a CTE, that result set is pulled back once and only once within a single query.
Does the quote suggest that the following query will cause derived table to be executed three times ( once for each aggregate function’s call ):
SELECT
AVG(OrdersPlaced),MAX(OrdersPlaced),MIN(OrdersPlaced)
FROM (
SELECT
v.VendorID,
v.[Name] AS VendorName,
COUNT(*) AS OrdersPlaced
FROM Purchasing.PurchaseOrderHeader AS poh
INNER JOIN Purchasing.Vendor AS v ON poh.VendorID = v.VendorID
GROUP BY v.VendorID, v.[Name]
) AS x
thanx
No that should be one pass, take a look at the execution plan
here is an example where something will run for every row in table table2
select *,(select COUNT(*) from table1 t1 where t1.id <= t2.id) as Bla
from table2 t2
Stuff like this with a running counts will fire for each row in the table2 table
CTE or a nested (uncorrelated) subquery will generally have no different execution plan. Whether a CTE or a subquery is used has never had an effect on my intermediate queries being spooled.
With regard to the Tony Rogerson link - the explicit temp table performs better than the self-join to the CTE because it's indexed better - many times when you go beyond declarative SQL and start to anticipate the work process for the engine, you can get better results.
Sometimes, the benefit of a simpler and more maintainable query with many layered CTEs instead of a complex multi-temp-table process outweighs the performance benefits of a multi-table process. A CTE-based approach is a single SQL statement, which cannot be as quietly broken by a step being accidentally commented out or a schema changing.
Probably not, but it may spool the derived results so it only needs to access it once.
In this case, there should be no difference between a CTE and derived table.
Where is the quote from?