Fast calculation of partial sums on a large SQL Server table - sql

I need to calculate a total of a column up to a specified date on a table that currently has over 400k rows and is poised to grow further. I found the SUM() aggregate function to be too slow for my purpose, as I couldn't get it faster than about 1500ms for a sum over 50k rows.
Please note that the code below is the fastest implementation I have found so far. Notably filtering the data from CustRapport and storing it in a temporary table brought me a 3x performance increase. I also experimented with indexes, but they usually made it slower.
I would however like the function to be at least an order of magnitude faster. Any idea on how to achieve that? I have stumbled upon http://en.wikipedia.org/wiki/Fenwick_tree. However, I would rather have the storage and calculation processed within SQL Server.
CustRapport and CustLeistung are Views with the following definition:
ALTER VIEW [dbo].[CustLeistung] AS
SELECT TblLeistung.* FROM TblLeistung
WHERE WebKundeID IN (SELECT WebID FROM XBauAdmin.dbo.CustKunde)
ALTER VIEW [dbo].[CustRapport] AS
SELECT MainRapport.* FROM MainRapport
WHERE WebKundeID IN (SELECT WebID FROM XBauAdmin.dbo.CustKunde)
Thanks for any help or advice!
ALTER FUNCTION [dbo].[getBaustellenstunden]
(
#baustelleID int,
#datum date
)
RETURNS
#ret TABLE
(
Summe float
)
AS
BEGIN
declare #rapport table
(
id int null
)
INSERT INTO #rapport select WebSourceID from CustRapport
WHERE RapportBaustelleID = #baustelleID AND RapportDatum <= #datum
INSERT INTO #ret
SELECT SUM(LeistungArbeit)
FROM CustLeistung INNER JOIN #rapport as r ON LeistungRapportID = r.id
WHERE LeistungArbeit is not null
AND LeistungInventarID is null AND LeistungArbeit > 0
RETURN
END
Execution plan:
http://s23.postimg.org/mxq9ktudn/execplan1.png
http://s23.postimg.org/doo3aplhn/execplan2.png

General advice I can provide now until you provide more information.
Updated my query since it was pulling from views to pull straight from the tables.
INSERT INTO #ret
SELECT
SUM(LeistungArbeit)
FROM (
SELECT DISTINCT WebID FROM XBauAdmin.dbo.CustKunde
) Web
INNER JOIN dbo.TblLeistung ON TblLeistung.WebKundeID=web.webID
INNER JOIN dbo.MainRapport ON MainRapport.WebKundeID=web.webID
AND TblLeistung.LeistungRapportID=MainRapport.WebSourceID
AND MainRapport.RapportBaustelleID = #baustelleID
AND MainRapport.RapportDatum <= #datum
WHERE TblLeistung.LeistungArbeit is not null
AND TblLeistung.LeistungInventarID is null
AND TblLeistung.LeistungArbeit > 0
Get rid of the table variable. They have their use, but I switch to temp tables when I get over a 100 records; indexed temp tables simply perform better in my experience.
Update your select to the above query and retest performance
Check and ensure there are indexes on every column references in the query. If you use the show actual execution plan, SQL Server will help identify where indexes would be useful.

Related

SQL Spatial Indexing including category field(s)

I have a spatial index set up on a geography field in my 2012 SQL database that stores item locations. There are about 15,000 Items.
I need to return a total of Items within a radius of N kilometres of a given Lat/Lng.
I can do this and it's fast.
DECLARE #radius GEOGRAPHY = GEOGRAPHY::Point(#Lat, #Lng, 4326).STBuffer(#RadiusInMetres*1000)
SELECT
COUNT(*) AS Total
FROM dbo.Items i
WHERE
i.LatLngGeo.STIntersects(#radius) = 1
However, what I now need to do is filter by several fields, to get items that match a given Category and Price.
DECLARE #radius GEOGRAPHY = GEOGRAPHY::Point(#Lat, #Lng, 4326).STBuffer(#RadiusInMetres*1000)
SELECT
COUNT(*) AS Total
FROM dbo.Items i
WHERE
i.LatLngGeo.STIntersects(#radius) = 1 AND
(i.Category = #Category OR #Category is null) AND
(i.Price < #Price OR #Price is null)
This grinds away for about 10+ seconds, and I can find no way of adding varchar or number fields to a spatial index.
What can I do to speed this up?
I would start with something like this:
--Query 1 - use a CTE to split the two filters
DECLARE #radius GEOGRAPHY = GEOGRAPHY::Point(#Lat, #Lng, 4326).STBuffer(#RadiusInMetres * 1000);
WITH InRadius AS (
SELECT * FROM dbo.Items WHERE LatLngGeo.STIntersects(#radius) = 1)
SELECT
COUNT(*)
FROM
InRadius
WHERE
ISNULL(#Category, Category) = Category
AND ISNULL(#Price, Price) = Price;
GO
--Query 2 - use a temp table to *definitely( split the two filters
DECLARE #radius GEOGRAPHY = GEOGRAPHY::Point(#Lat, #Lng, 4326).STBuffer(#RadiusInMetres * 1000);
IF OBJECT_ID('tempdb..#temp') IS NOT NULL
DROP TABLE #temp;
WITH InRadius AS (
SELECT * FROM dbo.Items WHERE LatLngGeo.STIntersects(#radius) = 1)
SELECT * INTO #temp FROM InRadius;
SELECT
COUNT(*)
FROM
#temp
WHERE
ISNULL(#Category, Category) = Category
AND ISNULL(#Price, Price) = Price;
Run those queries a few times, then benchmark them, compared to your original script.
Another trick is to copy your original query in as well, then view the execution plan. What you are looking for is the percentage split per query, which would ideally be something like 98%:1%:1%, i.e. the original query will take 98% of the work, but it will probably look very different indeed.
If this doesn't help, and you are okay with temp tables, then try adding an index on the temp table that matches the criteria you are filtering for. However, with only 15,00 rows the effect of an index should almost be imperceptible.
And finally, you could limit the data being loaded into the temp table to only be the items you are going to filter on, as all you seem to want is a count at the end?
Let's just take a quick recap:
extract the data matching your spatial query (which you say is quick already);
discard anything GEOGRAPHY based from the results, storing them in a temp table;
index the temp table to speed up any filters;
now your COUNT(*) should be just on a sub-set of the data (with no spatial data), and the optimiser will not be able to try and combine it with the proximity filter;
profit!

SQL Server query with extremely large IN clause results in numerous queries in activity monitor

SQL Server 2014 database. Table with 200 million rows.
Very large query with HUGE IN clause.
I originally wrote this query for them, but they have grown the IN clause to over 700 entries. The CTE looks unnecessary because I have omitted all the select columns and their substring() transformations for simplicity.
The focus is on the IN clause. 700+ pairs of these.
WITH cte AS (
SELECT *
FROM AODS-DB1B
WHERE
Source+'-'+Target
IN
(
'ACY-DTW',
'ACY-ATL',
'ACY-ORD',
:
: 700+ of these pairs
:
'HTS-PGD',
'PIE-BMI',
'PGD-HTS'
)
)
SELECT *
FROM cte
order by Source, Target, YEAR, QUARTER
When running, this query shoots CPU to 100% for hours - not unexpectedly.
There are indexes on all columns involved.
Question 1: Is there a better, or more effecient way to accomplish this query other than the huge IN clause? Would 700 UNION ALLs be better?
Question 2: When this query runs, it creates a Session_ID that contains 49 "threads" (49 processes that all have the same Session_ID). Every one of them an instance of this query with it's "Command" being this query text.
21 of them SUSPENDED,
14 of them RUNNING, and
14 of them RUNNABLE.
This changes rapidly as the task is running.
WHAT the heck is going on there? Is this SQL Server breaking the query up into pieces to work on it?
I recommend you store your 700+ strings in a permanent table as it is generally perceived as bad practice to store that much meta data in a script. You can create the table like this:
CREATE TABLE dbo.LookUp(Source varchar(250), Target varchar(250))
CREATE INDEX IX_Lookup_Source_Target on dbo.Lookup(Source,Target)
INSERT INTO dbo.Lookup (Source,Target)
SELECT 'ACY','DTW'
UNION
SELECT 'ACY','ATL'
.......
and then you can simply join on this table:
SELECT * FROM [AODS-DB1B] a
INNER JOIN dbo.Lookup lt ON lt.Source = a.Source
AND lt.Target=a.Target
ORDER BY Source, Target, YEAR, QUARTER
However, even better would be to normalise the AODS-DB1B table and store SourceId and TargetId INT values instead, with the VARCHAR values stored in Source and Target tables. You can then write a query that only performs integer comparisons rather than string comparisons and this should be much faster.
Put all of your codes into a temporary table (or permamnent if suitable).....
SELECT *
FROM AODS-DB1B
INNER JOIN NEW_TABLE ON Source+'-'+Target = NEWTABLE.Code
WHERE
...
...
you can create a temp table with all those values and then JOIN to that table, it would make the process a lot faster
I like the answer from Jaco
Have an index on source, target
It may be worth giving this a try
where ( source = 'ACY' and target in ('DTW', 'ATL', 'ORD') )
or ( source = 'HTS' and target in ('PGD') )

How can I improve the performance of this stored procedure?

Okay so I made some changes to a stored procedure that we have, and it now takes 3 hours to run (it used to only take 10 minutes before). I have a temp table called #tCustomersEmail. In it, is a column called OrderDate, which has a lot of null values in it. I want to replace those null values with data from another database on a different server. So here's what I have:
I create another temp table:
Create Table #tSouth
(
CustID char(10),
InvcDate nchar(10)
)
Which I populate with this data:
INSERT INTO #tSouth(CustID, InvcDate)
SELECT DISTINCT
[CustID],
max(InvcDate) as InvcDate
FROM D3.SouthW.dbo.uc_InvoiceLine I
where EXISTS (SELECT CustomerNumber FROM #tCustomersEmail H WHERE I.CustID = H.CustomerNumber)
group BY I.CustID
Then I take the data from #tSouth and update the OrderDate in the #tCustomersEmail table, as long as the CustomerNumber matches up, and the OrderDate is null:
UPDATE #tCustomersEmail
SET OrderDate = InvcDate
FROM #tCustomersEmail
INNER JOIN #tSouth ON #tCustomersEmail.CustomerNumber = [#tSouth].CustID
where #tCustomersEmail.OrderDate IS null
Making those changes caused the stored procedure to take FOR-EV-ER (Sandlot reference!)
So what am I doing wrong?
BTW I create indexes on my temp tables after I create them like so:
create clustered index idx_Customers ON #tCustomersEmail(CustomerNumber)
CREATE clustered index idx_CustSouthW ON #tSouth(CustID)
Try skipping the #tsouth table and use this query:
UPDATE a
SET OrderDate = (select max(InvcDate) from D3.SouthW.dbo.uc_InvoiceLine I
where a.customernumber = custid)
FROM #tCustomersEmail a
WHERE orderdate is null
I don't think the index will help you in this example
Maybe use table variable instead of temp table?
declare #temp table
(
CustID char(10),
InvcDate nchar(10)
)
insert into #temp
...
That definitely will increase the performance!
Distinct isn't needed if you have a GROUP BY. Given you are going across database I don't like the EXISTS. I would change that part to limit the number of rows at that point. Change to:
INSERT INTO #tSouth(CustID, InvcDate)
SELECT
[CustID],
max(InvcDate) as InvcDate
FROM D3.SouthW.dbo.uc_InvoiceLine I
where I.CustID in
(SELECT CustomerNumber
FROM #tCustomersEmail H
WHERE H.OrderDate IS null )
group BY I.CustID
EDIT: Looking closer are you sure uc_InvoiceLine should be used? Looks like there should be a parent table to that one that would had the date and have fewer rows.
Also, you can skip the one temp table by doing the update directly:
UPDATE #tCustomersEmail
SET OrderDate = InvcDate
FROM #tCustomersEmail
INNER JOIN (SELECT
[CustID],
max(InvcDate) as InvcDate
FROM D3.SouthW.dbo.uc_InvoiceLine I
where I.CustID in
(SELECT CustomerNumber
FROM #tCustomersEmail H
WHERE H.OrderDate IS null )
group BY I.CustID) Invoices
ON #tCustomersEmail.CustomerNumber = Invoices.CustID
It's difficult to predict the behaviour of complex queries involving tables on a linked server, because the local server has no access to statistics for the remote table, and can end up with a poor query plan because of this - it will work on the assumption that the remote table has either 1 or 100 rows.
If this wasn't bad enough, the result of the bad plan can be to pull the entire remote table over the wire into local temp space and work on it there. If the remote table is very large, this can be a major performance overhead.
In might be worth trying to simplify the linked server query to minimise the chances of the entire table being returned over the wire - (as has already been mentioned, you don't need both DISTINCT and GROUP BY)
INSERT INTO #tSouth(CustID, InvcDate)
SELECT [CustID],
max(InvcDate) as InvcDate
FROM D3.SouthW.dbo.uc_InvoiceLine I
group BY I.CustID
leaving the rest of the query unchanged.
However, because of the aggregate this may still bring the whole table back to the local server - you'll need to test to find out. Your best bet may be to encapsulate this logic in a view in the SouthW database, if you're able to create objects in it, then reference that from your SP code.

sql server-query optimization with many columns

we have "Profile" table with over 60 columns like (Id, fname, lname, gender, profilestate, city, state, degree, ...).
users search other peopel on website. query is like :
WITH TempResult as (
select ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum, profile.id from Profile
where
(#a is null or a = #a) and
(#b is null or b = #b) and
...(over 60 column)
)
SELECT profile.* FROM TempResult join profile on TempResult.id = profile.id
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
sql server by default use clustered index for execution query. but total execution time is over 300. we test another solution such as multi column index in all columns in where clause but total execution time is over 400.
do you have any solution to make total execution time lower than 100.
we using sql server 2008.
Unfortunately I don't think there is a pure SQL solution to your issue. Here are a couple alternatives:
Dynamic SQL - build up a query that only includes WHERE clause statements for values that are actually provided. Assuming the average search actually only fills in 2-3 fields, indexes could be added and utilized.
Full Text Search - go to something more like a Google keyword search. No individual options.
Lucene (or something else) - Search outside of SQL; This is a fairly significant change though.
One other option that I just remembered implementing in a system once. Create a vertical table that includes all of the data you are searching on and build up a query for it. This is easiest to do with dynamic SQL, but could be done using Table Value Parameters or a temp table in a pinch.
The idea is to make a table that looks something like this:
Profile ID
Attribute Name
Attribute Value
The table should have a unique index on (Profile ID, Attribute Name) (unique to make the search work properly, index will make it perform well).
In this table you'd have rows of data like:
(1, 'city', 'grand rapids')
(1, 'state', 'MI')
(2, 'city', 'detroit')
(2, 'state', 'MI')
Then your SQL will be something like:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
WHERE (AttributeName = 'city' AND AttributeValue = 'grand rapids')
AND (AttributeName = 'state' AND AttributeValue = 'MI')
GROUP BY ProfileID
HAVING COUNT(*) = 2
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
Like I said, you could use a temp table that has attribute name/values:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
JOIN PassedInAttributeTable ON ProfileAttributes.AttributeName = PassedInAttributeTable.AttributeName
AND ProfileAttributes.AttributeValue = PassedInAttributeTable.AttributeValue
GROUP BY ProfileID
HAVING COUNT(*) = CountOfRowsInPassedInAttributeTable -- calculate or pass in
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
As I recall, this ended up performing very well, even on fairly complicated queries (though I think we only had 12 or so columns).
As a single query, I can't think of a clever way of optimising this.
Provided that each column's check is highly selective, however, the following (very long winded) code, might prove faster, assuming each individual column has it's own separate index...
WITH
filter AS (
SELECT
[a].*
FROM
(SELECT * FROM Profile WHERE #a IS NULL OR a = #a) AS [a]
INNER JOIN
(SELECT id FROM Profile WHERE b = #b UNION ALL SELECT NULL WHERE #b IS NULL) AS [b]
ON ([a].id = [b].id) OR ([b].id IS NULL)
INNER JOIN
(SELECT id FROM Profile WHERE c = #c UNION ALL SELECT NULL WHERE #c IS NULL) AS [c]
ON ([a].id = [c].id) OR ([c].id IS NULL)
.
.
.
INNER JOIN
(SELECT id FROM Profile WHERE zz = #zz UNION ALL SELECT NULL WHERE #zz IS NULL) AS [zz]
ON ([a].id = [zz].id) OR ([zz].id IS NULL)
)
, TempResult as (
SELECT
ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum,
[filter].*
FROM
[filter]
)
SELECT
*
FROM
TempResult
WHERE
(RowNum >= #FirstRow)
AND (RowNum <= #LastRow)
EDIT
Also, thinking about it, you may even get the same result just by having the 60 individual indexes. SQL Server can do INDEX MERGING...
You've several issues imho. One is that you're going to end up with a seq scan no matter what you do.
But I think your more crucial issue here is that you've an unnecessary join:
SELECT profile.* FROM TempResult
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
This is a classic "SQL Filter" query problem. I've found that the typical approaches of "(#b is null or b = #b)" & it's common derivatives all yeild mediocre performance. The OR clause tends to be the cause.
Over the years I've done a lot of Perf/Tuning & Query Optimisation. The Approach I've found best is to generate Dynamic SQL inside a Stored Proc. Most times you also need to add "with Recompile" on the statement. The Stored Proc helps reduce potential for SQL injection attacks. The Recompile is needed to force the selection of indexes appropriate to the parameters you are searching on.
Generally it is at least an order of magnitude faster.
I agree you should also look at points mentioned above like :-
If you commonly only refer to a small subset of the columns you could create non-clustered "Covering" indexes.
Highly selective (ie:those with many unique values) columns will work best if they are the lead colum in the index.
If many colums have a very small number of values, consider using The BIT datatype. OR Create your own BITMASKED BIGINT to represent many colums ie: a form of "Enumerated datatyle". But be careful as any function in the WHERE clause (like MOD or bitwise AND/OR) will prevent the optimiser from choosing an index. It works best if you know the value for each & can combine them to use an equality or range query.
While often good to find RoWID's with a small query & then join to get all the other columns you want to retrieve. (As you are doing above) This approach can sometimes backfire. If the 1st part of the query does a Clustred Index Scan then often it is faster to get the otehr columns you need in the select list & savethe 2nd table scan.
So always good to try it both ways & see what works best.
Remember to run SET STATISTICS IO ON & SET STATISTICS TIME ON. Before running your tests. Then you can see where the IO is & it may help you with index selection, for the mose frequent combination of paramaters.
I hope this makes sense without long code samples. (it is on my other machine)

SQL Server 2005 Full Text forum Search

I'm working on a search stored procedure for our existing forums.
I've written the following code which uses standard SQL full text indexes, however I'm sure there is a better way of doing it and would like a point in the right direction.
To give some info on how it needs to work, The page has 1 search text box which when clicked will search thread titles, thread descriptions and post text and should return the results with the title matches first, then descriptions then post data.
Below is what I've written so far which works but is not elegant or as fast as I would like. To give an example of performance with 20K threads and 80K posts it takes about 12 seconds to search for 5 random words.
ALTER PROCEDURE [dbo].[SearchForums]
(
--Input Params
#SearchText VARCHAR(200),
#GroupId INT = -1,
#ClientId INT,
--Paging Params
#CurrentPage INT,
#PageSize INT,
#OutTotalRecCount INT OUTPUT
)
AS
--Create Temp Table to Store Query Data
CREATE TABLE #SearchResults
(
Relevance INT IDENTITY,
ThreadID INT,
PostID INT,
[Description] VARCHAR(2000),
Author BIGINT
)
--Create and populate table of all GroupID's This search will return from
CREATE TABLE #GroupsToSearch
(
GroupId INT
)
IF #GroupId = -1
BEGIN
INSERT INTO #GroupsToSearch
SELECT GroupID FROM SNetwork_Groups WHERE ClientId = #ClientId
END
ELSE
BEGIN
INSERT INTO #GroupsToSearch
VALUES(#GroupId)
END
--Get Thread Titles
INSERT INTO #SearchResults
SELECT
SNetwork_Threads.[ThreadId],
(SELECT NULL) AS PostId,
SNetwork_Threads.[Description],
SNetwork_Threads.[OwnerUserId]
FROM
SNetwork_Threads
INNER JOIN SNetwork_Groups ON SNetwork_Groups.GroupId = SNetwork_Threads.GroupId
WHERE
FREETEXT(SNetwork_Threads.[Description], #SearchText) AND
Snetwork_Threads.GroupID IN (SELECT GroupID FROM #GroupsToSearch) AND
SNetwork_Groups.ClientId = #ClientId
--Get Thread Descriptions
INSERT INTO #SearchResults
SELECT
SNetwork_Threads.[ThreadId],
(SELECT NULL) AS PostId,
SNetwork_Threads.[Description],
SNetwork_Threads.[OwnerUserId]
FROM
SNetwork_Threads
INNER JOIN SNetwork_Groups ON SNetwork_Groups.GroupId = SNetwork_Threads.GroupId
WHERE
FREETEXT(SNetwork_Threads.[Name], #SearchText) AND
Snetwork_Threads.GroupID IN (SELECT GroupID FROM #GroupsToSearch) AND
SNetwork_Groups.ClientId = #ClientId
--Get Posts
INSERT INTO #SearchResults
SELECT
SNetwork_Threads.ThreadId,
SNetwork_Posts.PostId,
SNetwork_Posts.PostText,
SNetwork_Posts.[OwnerUserId]
FROM
SNetwork_Posts
INNER JOIN SNetwork_Threads ON SNetwork_Threads.ThreadId = SNetwork_Posts.ThreadId
INNER JOIN SNetwork_Groups ON SNetwork_Groups.GroupId = SNetwork_Threads.GroupId
WHERE
FREETEXT(SNetwork_Posts.PostText, #SearchText) AND
Snetwork_Threads.GroupID IN (SELECT GroupID FROM #GroupsToSearch) AND
SNetwork_Groups.ClientId = #ClientId
--Return Paged Result Sets
SELECT #OutTotalRecCount = COUNT(*) FROM #SearchResults
SELECT
#SearchResults.[ThreadID],
#SearchResults.[PostID],
#SearchResults.[Description],
#SearchResults.[Author]
FROM
#SearchResults
WHERE
#SearchResults.[Relevance] >= (#CurrentPage - 1) * #PageSize + 1 AND
#SearchResults.[Relevance] <= #CurrentPage*#PageSize
ORDER BY Relevance ASC
--Clean Up
DROP TABLE #SearchResults
DROP TABLE #GroupsToSearch
I know its a bit long winded but just a nudge in the right direction would be well appreciated.
Incase it helps 80% of the query time is taken up when search posts and according to teh query plan is spent on "Clustered Index Scan" on the posts table. I cant see anyway around this.
Thanks
Gavin
I'd really have to see an explain plan to know where the slow parts were, as I don't see anything particularly nasty in your code. Very first thing - make sure all your indexes are in good shape, they are being used, statistics are up to date, etc.
One other idea would be to do the search on thread title first, then use the results from that to prune the searches on thread description and post text. Similarly, use the results from the thread description search to prune the post text search.
The basic idea here is that if you find the keywords in the thread title, why bother searching the description and posts? I realize this may not work depending on how you are presenting the search results to the user, and it may not make a huge difference, but it's something to think about.
80k records isn't that much. I'd recommend not inserting the resulting data into your temp table, and instead only inserting the IDs, then joining to that table afterward. This will save on writing to the temp table, as you may store 10000 ints, instead of 10000 full posts (of which you discard all but one page of). This may reduce the amount of time spent scanning posts, as well.
It looks like you would need two temp tables, one for threads and one for posts. You would union them in the final select.