New Analyst SQL Optimization - Multiple Spools

New Analyst SQL Optimization - Multiple Spools - sql

Unfortunately my ability to Query has outgrown my knowledge of SQL Optimization, so i am hoping someone would help a young analyst by looking at this atrocious execution plan and provide some wisdom as to how i could speed it up. I've read a few threads about spooling, but they were all mostly a discussion about weather an Eager Table spool is good or bad, and the answer is always "it depends".
My execution plan looks like it's Spooling and Sorting the same #Temp Table multiple times, and it's eating up a lot of execution cost.
My understanding of a Table Spool is that it is temporary storage to be used later, but if the data is already stored for later use, why would it spool the same data over and over again? My query doesn't require any ordering so why would it need to sort the same #TempTable/Spool multiple times?
I'm so new to this, i can't figure out how to attach the entire execution plan.... so i attached an image of the bottom half of it...
Help me experienced analysts. You're my only hope.
A Little Context.
I currently have a transaction table that tracks all changes made to a lead in my CRM, and i am attempting to create a new table from this data to speed up reporting.
I am pulling data from this transaction table and flagging the first action, first user, and other firsts of a lead by using Row_Number(). I am then inserting every "first" into a #Temp Table, as i know i am going to utilize this data multiple times.
SELECT
ID,
Action,
ROW_NUMBER() OVER (PARTITION BY ID, Action ORDER BY DATE) AS ActionNum,
ROW_NUMBER() OVER (PARTITION BY ID, Actor ORDER BY DATE) AS USERNUM
INTO #Temp
FROM Table
;
I am then Left joining this #Temp Table many times (10 times actually). I have tried multiple other ways of solving this issue but using Row_Number multiple times seems like the best solution.
SELECT
*
FROM #temp T1
LEFT JOIN #Temp T2
ON T2.ID = T1.ID AND T2.Action = A2 AND T2.ActionNum = 1
LEFT JOIN #Temp T3
On T3.ID = T1.ID AND T3.Action = A3 AND T3.ActionNum = 1
LEFT JOIN #Temp T4
ON T4.ID = T1.ID AND T4.UserNum = 1
WHERE
T1.Action = A1
AND
T1.ActionNum = 1
I've looked into creating a clustered index on the #TempTable, but i must not be doing it right because it didn't change anything about my execution.
Thanks in advance for all your help! Any good reading material is also greatly apprecaited!
Best,
Austin

Related

SQL - LEFT JOIN takes an extremely long time to execute

I'm trying to see if there are any rows in table A which I've missed in table B.
For this, I'm using the following query:
SELECT t1.cusa
FROM patch t1
LEFT JOIN trophy t2
ON t2.titleid = t1.titleid
WHERE t2.titleid IS NULL
And the query worked before, but now that the trophy table has nearly 200.000 rows, it's extremely slow. I've waited 5 minutes for it to execute but it was still loading and timed out eventually.
Is there any way to speed this query up?

Adding Indexes to titleId on both tables (but especially t2) is the quickest way to get better performance. 200K records is nothing for SQL Server.

Try this and it might perform a bit better!
SELECT t1.cusa
FROM patch t1
WHERE NOT EXISTS (SELECT 1
FROM trophy t2
WHERE t2.titleid = t1.titleid );

SQL Optimization for DB2

If you have a situation where you are doing a Union All on two result sets, and the each result set is derived from an inner join with the same filtered subset of a master table does the query engine "hit" the master table once, or twice?
example:
SELECT m.col4, st1.col2
FROM master m
INNER JOIN subTable1 st1
on st1.col1 = m.col1
WHERE m.col1 = 'a' and m.col2 = 123 and m.col3 = "a1b2"
UNION ALL
SELECT m.col4, st2.col2
FROM master m
INNER JOIN subTable2 st2
on st2.col1 = m.col1
WHERE m.col1 = 'a' and m.col2 = 123 and m.col3 = "a1b2"
I am trying to determine if it would be beneficial to create a temp table to hold the filtered results of the master table so the UNION ALL statement would be working with a small subset of the master records, instead of having to perform the filtering of the master table twice, like it might be doing in the example above.
thank you, in advance, for whatever advice you can give.

Maybe a common table expression helps:
with small_master as (
select m.col4,
m.col1
from master
where m.col1 = 'a'
and m.col2 = 123
and m.col3 = 'a1b2'
)
SELECT m.col4, st1.col2
FROM small_master m
INNER JOIN subTable1 st1
on st1.col1 = m.col1
UNION ALL
SELECT m.col4, st2.col2
FROM small_master m
INNER JOIN subTable2 st2
on st2.col1 = m.col1;
In my experience (not with DB2 though) this helps if the CTE is reducing the number of rows drastically (say from "millions" to "thousands").
If the intermediate result of the CTE is (still) quite large (several millions) then this will probably not help.
But only the execution plan can shed light on this.

The easiest way to answer this kind of "what if" questions is to look at the query plan. You can easily generate one with the command db2expln -d <your db> -f <your query file> -z <your query delimiter> -gi
Generally speaking, if a task can be done with a single SQL statement that will be the fastest way to accomplish the task, so it is unlikely that creating a temporary table will benefit performance.

This depends a lot on the database and the statistics of the tables involved. I am not intimately familiar with DB2.
However, if the issue is performance, then consider putting an index on master(col, col2, col3). This would speed up both parts of the query.
The use of a CTE as a temp table is highly database specific. Postgres always instantiates CTEs, so the code is only run once. SQL Server never does. I do not know the behavior of DB2 in this regards. However, I would prefer to add indexes to explicitly improve performance, rather than fiddling with the query -- your new query may result in unexpected query plans based when table statistics change, new software is released, or hardware is upgraded.
As references for SQL Server behavior, you might be interested in this one or this one or this discussion.

Can this SQL Query be optimized to run faster?

I have an SQL Query (For SQL Server 2008 R2) that takes a very long time to complete. I was wondering if there was a better way of doing it?
SELECT #count = COUNT(Name)
FROM Table1 t
WHERE t.Name = #name AND t.Code NOT IN (SELECT Code FROM ExcludedCodes)
Table1 has around 90Million rows in it and is indexed by Name and Code.
ExcludedCodes only has around 30 rows in it.
This query is in a stored procedure and gets called around 40k times, the total time it takes the procedure to finish is 27 minutes.. I believe this is my biggest bottleneck because of the massive amount of rows it queries against and the number of times it does it.
So if you know of a good way to optimize this it would be greatly appreciated! If it cannot be optimized then I guess im stuck with 27 min...
EDIT
I changed the NOT IN to NOT EXISTS and it cut the time down to 10:59, so that alone is a massive gain on my part. I am still going to attempt to do the group by statement as suggested below but that will require a complete rewrite of the stored procedure and might take some time... (as I said before, im not the best at SQL but it is starting to grow on me. ^^)

In addition to workarounds to get the query itself to respond faster, have you considered maintaining a column in the table that tells whether it is in this set or not? It requires a lot of maintenance but if the ExcludedCodes table does not change often, it might be better to do that maintenance. For example you could add a BIT column:
ALTER TABLE dbo.Table1 ADD IsExcluded BIT;
Make it NOT NULL and default to 0. Then you could create a filtered index:
CREATE INDEX n ON dbo.Table1(name)
WHERE IsExcluded = 0;
Now you just have to update the table once:
UPDATE t
SET IsExcluded = 1
FROM dbo.Table1 AS t
INNER JOIN dbo.ExcludedCodes AS x
ON t.Code = x.Code;
And ongoing you'd have to maintain this with triggers on both tables. With this in place, your query becomes:
SELECT #Count = COUNT(Name)
FROM dbo.Table1 WHERE IsExcluded = 0;
EDIT
As for "NOT IN being slower than LEFT JOIN" here is a simple test I performed on only a few thousand rows:
EDIT 2
I'm not sure why this query wouldn't do what you're after, and be far more efficient than your 40K loop:
SELECT src.Name, COUNT(src.*)
FROM dbo.Table1 AS src
INNER JOIN #temptable AS t
ON src.Name = t.Name
WHERE src.Code NOT IN (SELECT Code FROM dbo.ExcludedCodes)
GROUP BY src.Name;
Or the LEFT JOIN equivalent:
SELECT src.Name, COUNT(src.*)
FROM dbo.Table1 AS src
INNER JOIN #temptable AS t
ON src.Name = t.Name
LEFT OUTER JOIN dbo.ExcludedCodes AS x
ON src.Code = x.Code
WHERE x.Code IS NULL
GROUP BY src.Name;
I would put money on either of those queries taking less than 27 minutes. I would even suggest that running both queries sequentially will be far faster than your one query that takes 27 minutes.
Finally, you might consider an indexed view. I don't know your table structure and whether your violate any of the restrictions but it is worth investigating IMHO.

You say this gets called around 40K times. WHy? Is it in a cursor? If so do you really need a cursor. Couldn't you put the values you want for #name in a temp table and index it and then join to it?
select t.name, count(t.name)
from table t
join #name n on t.name = n.name
where NOT EXISTS (SELECT Code FROM ExcludedCodes WHERE Code = t.code)
group by t.name
That might get you all your results in one query and is almost certainly faster than 40K separate queries. Of course if you need the count of all the names, it's even simpleer
select t.name, count(t.name)
from table t
NOT EXISTS (SELECT Code FROM ExcludedCodes WHERE Code = t
group by t.name

NOT EXISTS typically performs better than NOT IN, but you should test it on your system.
SELECT #count = COUNT(Name)
FROM Table1 t
WHERE t.Name = #name AND NOT EXISTS (SELECT 1 FROM ExcludedCodes e WHERE e.Code = t.Code)
Without knowing more about your query it's tough to supply concrete optimization suggestions (i.e. code suitable for copy/paste). Does it really need to run 40,000 times? Sounds like your stored procedure needs reworking, if that's feasible. You could exec the above once at the start of the proc and insert the results in a temp table, which can keep the indexes from Table1, and then join on that instead of running this query.
This particular bit might not even be the bottleneck that makes your query run 27 minutes. For example, are you using a cursor over those 90 million rows, or scalar valued UDFs in your WHERE clauses?

Have you thought about doing the query once and populating the data in a table variable or temp table? Something like
insert into #temp (name, Namecount)
values Name, Count(name)
from table1
where name not in(select code from excludedcodes)
group by name
And don't forget that you could possibly use a filtered index as long as the excluded codes table is somewhat static.

Start evaluating the execution plan. Which is the heaviest part to compute?
Regarding the relation between the two tables, use a JOIN on indexed columns: indexes will optimize query execution.

Why is this SQL statement hanging when group by, sum, or where clause is included?

I have a SQL statement:
select
t3.item1,
t3.item2,
sum(t1.moneys)
from
table t1
inner join table t2 on t1.key = t2.key
inner join table t3 on t1.key2 = t3.key2
where
t2.type = 'thistype'
and t3.type2 = 'thistype'
group by
t3.item1, t3.item2
If I remove the group by, sum, or where clause it runs fine - but if I add back any of those it hangs forever... any ideas... this is on SQL Server Management Studio 2008 R2
Thanks
Further Testing
so I created a view:
select
t3.item1,
t3.item2,
t1.moneys,
t2.type,
t3.type2
from
table t1
inner join table t2 on t1.key = t2.key
inner join table t3 on t1.key2 = t3.key2
and I can select top 1000 from the view fine and see the type I want to specifically look at in the data, but when I add the 'where type2 = 'thistype'' it hangs again...

Your joining three tables together with millions of records, this is normal for it take a bit to run.
To answer your question about statistics, they are what the indices utilize to retrieve records faster from your tables. Without accurate or up to date statistics, indices can actually slow your queries down.
http://blogs.technet.com/b/rob/archive/2008/05/16/sql-server-statistics.aspx

I think we'd need to see some table structure and know some more things about your DB before we can give a solid answer. First thing, though, is to run a trace on it and see what that tells you.
At first blush, I have found that issues with aggregate functions (sum, group by, etc) tend to stem from a) overly large data sets (that is: you're just trying to pull back too much data) or b) from overly-complicated structure or relationships on the joined tables.
However, those are just my general rules-of-thumb, and may not apply in a specific situation: run a trace and any other form of profiling you can and see what that tells you.

Has you looked at the execution plan you're getting? That will tell you where the problem is. Do you have covering indices on the columns on which you're joining and grouping?

Is it possible that the execution plan is corrupted?
http://msdn.microsoft.com/en-us/library/aa175244(v=sql.80).aspx
Try recompiling the plan using sp_recompile

T-SQL Query Optimization

I'm working on some upgrades to an internal web analytics system we provide for our clients (in the absence of a preferred vendor or Google Analytics), and I'm working on the following query:
select
path as EntryPage,
count(Path) as [Count]
from
(
/* Sub-query 1 */
select
pv2.path
from
pageviews pv2
inner join
(
/* Sub-query 2 */
select
pv1.sessionid,
min(pv1.created) as created
from
pageviews pv1
inner join Sessions s1 on pv1.SessionID = s1.SessionID
inner join Visitors v1 on s1.VisitorID = v1.VisitorID
where
pv1.Domain = isnull(#Domain, pv1.Domain) and
v1.Campaign = #Campaign
group by
pv1.sessionid
) t1 on pv2.sessionid = t1.sessionid and pv2.created = t1.created
) t2
group by
Path;
I've tested this query with 2 million rows in the PageViews table and it takes about 20 seconds to run. I'm noticing a clustered index scan twice in the execution plan, both times it hits the PageViews table. There is a clustered index on the Created column in that table.
The problem is that in both cases it appears to iterate over all 2 million rows, which I believe is the performance bottleneck. Is there anything I can do to prevent this, or am I pretty much maxed out as far as optimization goes?
For reference, the purpose of the query is to find the first page view for each session.
EDIT: After much frustration, despite the help received here, I could not make this query work. Therefore, I decided to simply store a reference to the entry page (and now exit page) in the sessions table, which allows me to do the following:
select
pv.Path,
count(*)
from
PageViews pv
inner join Sessions s on pv.SessionID = s.SessionID
and pv.PageViewID = s.ExitPage
inner join Visitors v on s.VisitorID = v.VisitorID
where
(
#Domain is null or
pv.Domain = #Domain
) and
v.Campaign = #Campaign
group by pv.Path;
This query runs in 3 seconds or less. Now I either have to update the entry/exit page in real time as the page views are recorded (the optimal solution) or run a batch update at some interval. Either way, it solves the problem, but not like I'd intended.
Edit Edit: Adding a missing index (after cleaning up from last night) reduced the query to mere milliseconds). Woo hoo!

For starters,
where pv1.Domain = isnull(#Domain, pv1.Domain)
won't SARG. You can't optimize a match on a function, as I remember.

I'm back. To answer your first question, you could probably just do a union on the two conditions, since they are obviously disjoint.
Actually, you're trying to cover both the case where you provide a domain, and where you don't. You want two queries. They may optimize entirely differently.

What's the nature of the data in these tables? Do you find most of the data is inserted/deleted regularly?
Is that the full schema for the tables? The query plan shows different indexing..
Edit: Sorry, just read the last line of text. I'd suggest if the tables are routinely cleared/insertsed, you could think about ditching the clustered index and using the tables as heap tables.. just a thought
Definately should put non-clustered index(es) on Campaign, Domain as John suggested

Your inner query (pv1) will require a nonclustered index on (Domain).
The second query (pv2) can already find the rows it needs due to the clustered index on Created, but pv1 might be returning so many rows that SQL Server decides that a table scan is quicker than all the locks it would need to take. As pv1 groups on SessionID (and hence has to order by SessionID), a nonclustered index on SessionID, Created, and including path should permit a MERGE join to occur. If not, you can force a merge join with "SELECT .. FROM pageviews pv2 INNER MERGE JOIN ..."
The two indexes listed above will be:
CREATE NONCLUSTERED INDEX ncixcampaigndomain ON PageViews (Domain)
CREATE NONCLUSTERED INDEX ncixsessionidcreated ON PageViews(SessionID, Created) INCLUDE (path)

SELECT
sessionid,
MIN(created) AS created
FROM
pageviews pv
JOIN
visitors v ON pv.visitorid = v.visitorid
WHERE
v.campaign = #Campaign
GROUP BY
sessionid
so that gives you the sessions for a campaign. Now let's see what you're doing with that.
OK, this gets rid of your grouping:
SELECT
campaignid,
sessionid,
pv.path
FROM
pageviews pv
JOIN
visitors v ON pv.visitorid = v.visitorid
WHERE
v.campaign = #Campaign
AND NOT EXISTS (
SELECT 1 FROM pageviews
WHERE sessionid = pv.sessionid
AND created < pv.created
)

To continue from doofledorf.
Try this:
where
(#Domain is null or pv1.Domain = #Domain) and
v1.Campaign = #Campaign
Ok, I have a couple of suggestions
Create this covered index:
create index idx2 on [PageViews]([SessionID], Domain, Created, Path)
If you can amend the Sessions table so that it stores the entry page, eg. EntryPageViewID you will be able to heavily optimise this.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas