I have an SP that takes 10 seconds to run about 10 times (about a second every time it is ran). The platform is asp .net, and the server is SQL Server 2005. I have indexed the table (not on the PK also), and that is not the issue. Some caveats:
usp_SaveKeyword is not the issue. I commented out that entire SP and it made not difference.
I set #SearchID to 1 and the time was significantly reduced, only taking about 15ms on average for the transaction.
I commented out the entire stored procedure except the insert into tblSearches and strangely it took more time to execute.
Any ideas of what could be going on?
set ANSI_NULLS ON
go
ALTER PROCEDURE [dbo].[usp_NewSearch]
#Keyword VARCHAR(50),
#SessionID UNIQUEIDENTIFIER,
#time SMALLDATETIME = NULL,
#CityID INT = NULL
AS
BEGIN
SET NOCOUNT ON;
IF #time IS NULL SET #time = GETDATE();
DECLARE #KeywordID INT;
EXEC #KeywordID = usp_SaveKeyword #Keyword;
PRINT 'KeywordID : '
PRINT #KeywordID
DECLARE #SearchID BIGINT;
SELECT TOP 1 #SearchID = SearchID
FROM tblSearches
WHERE SessionID = #SessionID
AND KeywordID = #KeywordID;
IF #SearchID IS NULL BEGIN
INSERT INTO tblSearches
(KeywordID, [time], SessionID, CityID)
VALUES
(#KeywordID, #time, #SessionID, #CityID)
SELECT Scope_Identity();
END
ELSE BEGIN
SELECT #SearchID
END
END
Why are you using top 1 #SearchID instead of max (SearchID) or where exists in this query? top requires you to run the query and retrieve the first row from the result set. If the result set is large this could consume quite a lot of resources before you get out the final result set.
SELECT TOP 1 #SearchID = SearchID
FROM tblSearches
WHERE SessionID = #SessionID
AND KeywordID = #KeywordID;
I don't see any obvious reason for this - either of aforementioned constructs should get you something semantically equivalent to this with a very cheap index lookup. Unless I'm missing something you should be able to do something like
select #SearchID = isnull (max (SearchID), -1)
from tblSearches
where SessionID = #SessionID
and KeywordID = #KeywordID
This ought to be fairly efficient and (unless I'm missing something) semantically equivalent.
Enable "Display Estimated Execution Plan" in SQL Management Studio - where does the execution plan show you spending the time? It'll guide you on the heuristics being used to optimize the query (or not in this case). Generally the "fatter" lines are the ones to focus on - they're ones generating large amounts of I/O.
Unfortunately even if you tell us the table schema, only you will be able to see actually how SQL chose to optimize the query. One last thing - have you got a clustered index on tblSearches?
Triggers!
They are insidious indeed.
What is the clustered index on tblSearches? If the clustered index is not on primary key, the database may be spending a lot of time reordering.
How many other indexes do you have?
Do you have any triggers?
Where does the execution plan indicate the time is being spent?
Related
I'm trying to figure out why a query against a single table is taking far longer than I think it should. I am sure this question has a simple answer, but I've been scratching my head now for a while and may just not be seeing the forest for the trees.
I have a table, roughly 35 columns wide, with a standard assortment of columns (few int's, bunch of varchar()'s of sizes ranging from 10 to 255, pretty basic), on which I have placed a Clustered Index on the column let's call "PackageID" for the sake of explanation. There are a little north of a Million records in this table so there's a fair amount of data to comb through, and there may one or more records with the same PackageID due to the nature of the records, but it's just a single 'flat' table.
Hitting the table I have a Stored Procedure that takes in a varchar(max) argument that could be a single PackageID or it could be a comma delimited list of 10, 50, 500, or more. The SProc calls a fairly standard simple Split() function (found here and on other sites) that splits the list returning the values as a table, which I then attempt to filter against my table for results. The ID's are int values currently up to 5 digits in length, in the future it will grow but only 5 right now.
I have tried a couple variations on the query inside the SProc (just the query here for brevity):
SELECT PackageID, Column01, Column02, Column03, ... , ColumnN
FROM MyTable
WHERE PackageID IN (SELECT SplitValue FROM dbo.Split(#ListOfIDs, ','))
and
;WITH cteIDs AS (
SELECT SplitValue
FROM dbo.Split(#ListOfIDs, ',')
)
SELECT PackageID, Column01, Column02, Column03, ... , ColumnN
FROM MyTable m
INNER JOIN cteIDs c ON m.PackageID = c.SplitValue
Running from SSMS, on both the Estimated Execution Plan shows as being identical, and take roughly the same amount of time. When the #ListOfIDs is short, the records return quickly, but as the IDs list grows (and it can get to hundreds or more) the execution time can go to minutes or longer. There are no triggers, nothing else is using it, the query isn't being blocked or deadlocked by anything I can tell... it just runs slow.
I feel like I am missing something crazy simple here, but I am just not seeing it.
Appreciate any help, thanks!
UPDATE
This is the Split() function I am using, it's something I pulled from here I don't know how long ago, and have been using ever since. If there is a better one I am happy to switch, this one just worked so I never gave it another thought...
CREATE FUNCTION [dbo].[Split]
(
#String VARCHAR(max),
#Delimiter VARCHAR(5)
)
RETURNS #SplittedValues TABLE
(
OccurenceId SMALLINT IDENTITY(1,1),
SplitValue VARCHAR(max)
)
AS
BEGIN
DECLARE #SplitLength INT
WHILE LEN(#String) > 0
BEGIN
SELECT #SplitLength = (CASE CHARINDEX(#Delimiter, #String)
WHEN 0 THEN LEN(#String)
ELSE CHARINDEX(#Delimiter, #String) -1
END)
INSERT INTO #SplittedValues
SELECT SUBSTRING(#String, 1, #SplitLength)
SELECT #String = (CASE (LEN(#String) - #SplitLength)
WHEN 0 THEN ''
ELSE RIGHT(#String, LEN(#String) - #SplitLength - 1)
END)
END
RETURN
END
GO
UPDATE - Testing Comment Suggestions
I have attempted to try out the suggestions in the comments, and here is what I have found out...
Table size: 1,081,154 records
Unique "PackageID" count: 16008 ID
List test size: 500 random IDs (comma delimited string arg input)
When I run (in SSMS) the query using just the Split() function it takes on average 309 seconds to return 373,761 records.
When I run the query but first dump the Split() results into a #TempTable (with Primary Key Index) and join that against the table, it takes on average 111 seconds to return the same 373,761 records.
I understand this is a lot of records, but this is a flat table with a Clustered Index on the PackageID. The query is a very basic select just asking for the records matching on the ID. There is no calculations, no processing, no other JOINS to other tables, CASE statements, groupings, havings, etc. I am failing to understand why it is taking so long to execute the query. I've seen other queries with massive logic involved return thousands of records sub-second, why does this "simple" looking thing get bogged down?
UPDATE - Adding Exec Plan
As requested, here is the Execution Plan for the query I am running. After dumping split values of the incoming delimited list of ID's into a #TempTable, the query is simply asking for all records out of Table A ("MyTable") with matching ID's found in Table B (the #TempTable). That's it.
Update - Order By
In the attached Execution Plan, noted in the comments, there was an ORDER BY that appeared to be consuming a fair amount of overhead. I removed this from my query and re-ran my tests, which resulted in a minimal improvement in execution time. On a test run that previously took 7 minutes, without the ORDER BY would complete in 6:30 to 6:45 minutes.
At this stage of the game, I am about to chalk this up to a case of Data Volume versus anything to do with the query itself. It could be something on our network, the amount of hops the data has to flow through between the SQL Server and the destination, connection speed of the end user, and/or any number of other factors outside my control or ability to do anything about.
Thank you to all who have responded and provided suggestions. Many of which I will use going forward, and keep in mind as I work with the database.
Assume you are not falling into the trap of using different datatype for index seek of your main table (i.e. your PackageID is varchar but not nvarchar or numeric), then your table join itself is very fast.
To confirm this, you can split the process into 2 steps, insert into temp table, then use temp table to join. If the first step is super slow and the second step is super fast, then that affirms my assumption above.
If step 1 is slow, it means the main cause of the slow performance is on the Split which uses a lot of substring calls.
Assume your list has 10000 items of 20 bytes for each ID. It means you have a variable of 200KB.
By your current SUBSTRING call, it will always copy the 200KB into a new string on each iteration. The string will gradually decrease from 200KB to 0KB but you already copy the 100+KB string for 5000 times. This is 1000MB of data flow in total.
Below are 3 functions.
[Split$60769735$0] is your original function
[Split$60769735$1] is using XML
[Split$60769735$2] is using binary split, but also make use of your original function
[Split$60769735$1] is fast because it utilizes the specialized parser for XML that already can handle split very well.
[Split$60769735$2] is fast because it changes your O(n^2) complexity to O(n log n)
Run time is:
[Split$60769735$0] = 3 to 4 min
[Split$60769735$1] = 2 seconds
[Split$60769735$2] = 7 seconds
NOTE: as this is for demo purpose, some edge cases are not yet handled.
1. For [Split$60769735$1], if the values may contain < > &, some escape is required
2. For [Split$60769735$2], if the delimiter can be not found in the second half of the string (i.e. one child item can be as long as 5000 char), you need to handle the case when the charindex function does not return a hit.
CREATE SCHEMA [TRY]
GO
CREATE FUNCTION [TRY].[Split$60769735$0]
(
#String VARCHAR(max),
#Delimiter VARCHAR(5)
)
RETURNS #SplittedValues TABLE
(
OccurenceId INT IDENTITY(1,1),
SplitValue VARCHAR(max)
)
AS
BEGIN
DECLARE #SplitLength INT
WHILE LEN(#String) > 0
BEGIN
SELECT #SplitLength = (CASE CHARINDEX(#Delimiter, #String)
WHEN 0 THEN LEN(#String)
ELSE CHARINDEX(#Delimiter, #String) -1
END)
INSERT INTO #SplittedValues
SELECT SUBSTRING(#String, 1, #SplitLength)
SELECT #String = (CASE (LEN(#String) - #SplitLength)
WHEN 0 THEN ''
ELSE RIGHT(#String, LEN(#String) - #SplitLength - 1)
END)
END
RETURN
END
GO
CREATE FUNCTION [TRY].[Split$60769735$1]
(
#String VARCHAR(max),
#Delimiter VARCHAR(5)
)
RETURNS #SplittedValues TABLE
(
OccurenceId INT IDENTITY(1,1),
SplitValue VARCHAR(max)
)
AS
BEGIN
DECLARE #x XML = cast('<i>'+replace(#String,#Delimiter,'</i><i>')+'</i>' AS XML)
INSERT INTO #SplittedValues
SELECT v.value('.','varchar(100)') FROM #x.nodes('i') AS x(v)
RETURN
END
GO
CREATE FUNCTION [TRY].[Split$60769735$2]
(
#String VARCHAR(max),
#Delimiter VARCHAR(5)
)
RETURNS #SplittedValues TABLE
(
OccurenceId INT IDENTITY(1,1),
SplitValue VARCHAR(max)
)
AS
BEGIN
DECLARE #len int = len(#String);
IF #len > 10000 BEGIN
DECLARE #mid int = charindex(#Delimiter,#String,#len/2);
INSERT INTO #SplittedValues
SELECT SplitValue FROM TRY.[Split$60769735$2](substring(#String, 1, #mid-1), #Delimiter);
INSERT INTO #SplittedValues
SELECT SplitValue FROM TRY.[Split$60769735$2](substring(#String, #mid+len(#Delimiter), #len-#mid-len(#Delimiter)+1), #Delimiter);
END ELSE BEGIN
INSERT INTO #SplittedValues
SELECT SplitValue FROM TRY.[Split$60769735$0](#String, #Delimiter);
END
RETURN
END
GO
NOTE:
- starting from SQL Server 2016, there will be a built-in split function. But unfortunately you are in 2012
If step 1 is fast but step 2 is slow, possible issues are datatype mismatch or missing index. For such case, posting what your execution plan looks like will help most.
This is not much of an answer, but rather a bucket list. I see no obvious reason why this query performs poorly. The following are some unlikely, really unlikely, and ridiculous possibilities.
+1 on “make sure datatypes on either side of the join are identical”
+1 on loading the “split” data into its own temporary table.
I recommend a #temp table built with a primary key (as opposed to #temp), for obscure reasons having to do with statistics that I believe stopped being relevant in later versions of SQL Server (I started with 7.0, and easily lose track of when the myriad newfangleds got added).
What does the query plan show?
Try running it with “set statistics io on”, to see how many page reads are actually involved.
During your testing, are you certain this is the ONLY query running against that database?
“MyTable” is a table, right? Not a view, synonym, linked server monstrosity, or other bizarre construct?
Any third-party tools installed that might be logging your every action on the DB and/or server?
Give that PackageId is not unique in MyTable, how much data is actually being returned? It may well be that it just takes that long for the data to be read and passed back to the calling system—though this really seems unlikely, unless the server’s flooded with other work to do as well.
Take the (simplified) stored procedure defined here:
create procedure get_some_stuffs
#max_records int = null
as
begin
set NOCOUNT on
select top (#max_records) *
from my_table
order by mothers_maiden_name
end
I want to restrict the number of records selected only if #max_records is provided.
Problems:
The real query is nasty and large; I want to avoid having it duplicated similar to this:
if(#max_records is null)
begin
select *
from {massive query}
end
else
begin
select top (#max_records)
from {massive query}
end
An arbitrary sentinel value doesn't feel right:
select top (ISNULL(#max_records, 2147483647)) *
from {massive query}
For example, if #max_records is null and {massive query} returns less than 2147483647 rows, would this be identical to:
select *
from {massive query}
or is there some kind of penalty for selecting top (2147483647) * from a table with only 50 rows?
Are there any other existing patterns that allow for an optionally count-restricted result set without duplicating queries or using sentinel values?
I was thinking about this, and although I like the explicitness of the IF statement in your Problem 1 statement, I understand the issue of duplication. As such, you could put the main query in a single CTE, and use some trickery to query from it (the bolded parts being the highlight of this solution):
CREATE PROC get_some_stuffs
(
#max_records int = NULL
)
AS
BEGIN
SET NOCOUNT ON;
WITH staged AS (
-- Only write the main query one time
SELECT * FROM {massive query}
)
-- This part below the main query never changes:
SELECT *
FROM (
-- A little switcheroo based on the value of #max_records
SELECT * FROM staged WHERE #max_records IS NULL
UNION ALL
SELECT TOP(ISNULL(#max_records, 0)) * FROM staged WHERE #max_records IS NOT NULL
) final
-- Can't use ORDER BY in combination with a UNION, so move it out here
ORDER BY mothers_maiden_name
END
I looked at the actual query plans for each and the optimizer is smart enough to completely avoid the part of the UNION ALL that doesn't need to run.
The ISNULL(#max_records, 0) is in there because TOP NULL isn't valid, and it will not compile.
You could use SET ROWCOUNT:
create procedure get_some_stuffs
#max_records int = null
as
begin
set NOCOUNT on
IF #max_records IS NOT NULL
BEGIN
SET ROWCOUNT #max_records
END
select top (#max_records) *
from my_table
order by mothers_maiden_name
SET ROWCOUNT 0
end
There are a few methods, but as you probably notice these all look ugly or are unnecessarily complicated. Furthermore, do you really need that ORDER BY?
You could use TOP (100) PERCENT and a View, but the PERCENT only works if you do not really need that expensive ORDER BY, since SQL Server will ignore your ORDER BY if you try it.
I suggest taking advantage of stored procedures, but first lets explain the difference in the type of procs:
Hard Coded Parameter Sniffing
--Note the lack of a real parametrized column. See notes below.
IF OBJECT_ID('[dbo].[USP_TopQuery]', 'U') IS NULL
EXECUTE('CREATE PROC dbo.USP_TopQuery AS ')
GO
ALTER PROC [dbo].[USP_TopQuery] #MaxRows NVARCHAR(50)
AS
BEGIN
DECLARE #SQL NVARCHAR(4000) = N'SELECT * FROM dbo.ThisFile'
, #Option NVARCHAR(50) = 'TOP (' + #MaxRows + ') *'
IF ISNUMERIC(#MaxRows) = 0
EXEC sp_executesql #SQL
ELSE
BEGIN
SET #SQL = REPLACE(#SQL, '*', #Option)
EXEC sp_executesql #SQL
END
END
Local Variable Parameter Sniffing
IF OBJECT_ID('[dbo].[USP_TopQuery2]', 'U') IS NULL
EXECUTE('CREATE PROC dbo.USP_TopQuery2 AS ')
GO
ALTER PROC [dbo].[USP_TopQuery2] #MaxRows INT NULL
AS
BEGIN
DECLARE #Rows INT;
SET #Rows = #MaxRows;
IF #MaxRows IS NULL
SELECT *
FROM dbo.THisFile
ELSE
SELECT TOP (#Rows) *
FROM dbo.THisFile
END
No Parameter Sniffing, old method
IF OBJECT_ID('[dbo].[USP_TopQuery3]', 'U') IS NULL
EXECUTE('CREATE PROC dbo.USP_TopQuery3 AS ')
GO
ALTER PROC [dbo].[USP_TopQuery3] #MaxRows INT NULL
AS
BEGIN
IF #MaxRows IS NULL
SELECT *
FROM dbo.THisFile
ELSE
SELECT TOP (#MaxRows) *
FROM dbo.THisFile
END
PLEASE NOTE ABOUT PARAMETER SNIFFING:
SQL Server initializes variables in Stored Procs at the time of compile, not when it parses.
This means that SQL Server will be unable to guess the query and will
choose the last valid execution plan for the query, regardless of
whether it is even good.
There are two methods, hard coding an local variables that allow the Optimizer to guess.
Hard Coding for Parameter Sniffing
Use sp_executesql to not only reuse the query, but prevent SQL Injection.
However, in this type of query, will not always perform substantially better since a TOP Operator is not a column or table (so the statement effectively has no variables in this version I used)
Statistics at the time of the creation of your compiled plan will dictate how affective the method is if you are not using a variable on a predicate (ON, WHERE, HAVING)
Can use options or hint to RECOMPILE to overcome this issue.
Variable Parameter Sniffing
Variable Paramter sniffing, on the other hand, is flexible enough to work witht the statistics here, and in my own testing it seemed the variable parameter had the advantage of the query using statistics (particularly after I updated the statistics).
Ultimately, the issue of performance is about which method will use the least amount of steps to traverse through the leaflets. Statistics, the rows in your table, and the rules for when SQL Server will decide to use a Scan vs Seek impact the performance.
Running different values will show performances change significantly, though typically better than USP_TopQuery3. So DO NOT ASSUME one method is necessarily better than the other.
Also note you can use a table-valued function to do the same, but as Dave Pinal would say:
If you are going to answer that ‘To avoid repeating code, you use
Function’ ‑ please think harder! Stored procedure can do the same...
if you are going to answer
with ‘Function can be used in SELECT, whereas Stored Procedure cannot
be used’ ‑ again think harder!
SQL SERVER – Question to You – When to use Function and When to use Stored Procedure
You could do it like this (using your example):
create procedure get_some_stuffs
#max_records int = null
as
begin
set NOCOUNT on
select top (ISNULL(#max_records,1000)) *
from my_table
order by mothers_maiden_name
end
I know you don't like this (according to your point 2), but that's pretty much how it's done (in my experience).
How about something like this (you're have to really look at execution plans and I didn't have time to set anything up)?
create procedure get_some_stuffs
#max_records int = null
as
begin
set NOCOUNT on
select *, ROW_NUMBER(OVER order by mothers_maiden_name) AS row_num
from {massive query}
WHERE #max_records IS NULL OR row_num < #max_records
end
Another thing you can do with {massive query} is make a view or inline table-valued function (it it's parametrized), which is generally a pretty good practice for anything big and repetitively used.
I'm selecting the available login infos from a DB randomly via the stored procedure below. But when multiple threads want to get the available login infos, duplicate records are returned although I'm updating the timestamp field of the record.
How can I lock the rows here so that the record returned once won't be returned again?
Putting
WITH (HOLDLOCK, ROWLOCK)
didn't help!
SELECT TOP 1 #uid = [LoginInfoUid]
FROM [ZPer].[dbo].[LoginInfos]
WITH (HOLDLOCK, ROWLOCK)
WHERE ([Type] = #type)
...
...
...
ALTER PROCEDURE [dbo].[SelectRandomLoginInfo]
-- Add the parameters for the stored procedure here
#type int = 0,
#expireTimeout int = 86400 -- 24 * 60 * 60 = 24h
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- Insert statements for procedure here
DECLARE #processTimeout int = 10 * 60
DECLARE #uid uniqueidentifier
BEGIN TRANSACTION
-- SELECT [LoginInfos] which are currently not being processed ([Timestamp] is timedout) and which are not expired.
SELECT TOP 1 #uid = [LoginInfoUid]
FROM [MyDb].[dbo].[LoginInfos]
WITH (HOLDLOCK, ROWLOCK)
WHERE ([Type] = #type) AND ([Uid] IS NOT NULL) AND ([Key] IS NOT NULL) AND
(
([Timestamp] IS NULL OR DATEDIFF(second, [Timestamp], GETDATE()) > #processTimeout) OR
(
DATEDIFF(second, [UpdateDate], GETDATE()) <= #expireTimeout OR
([UpdateDate] IS NULL AND DATEDIFF(second, [CreateDate], GETDATE()) <= #expireTimeout)
)
)
ORDER BY NEWID()
-- UPDATE the selected record so that it won't be re-selected.
UPDATE [MyDb].[dbo].[LoginInfos] SET
[UpdateDate] = GETDATE(), [Timestamp] = GETDATE()
WHERE [LoginInfoUid] = #uid
-- Return the full record data.
SELECT *
FROM [MyDb].[dbo].[LoginInfos]
WHERE [LoginInfoUid] = #uid
COMMIT TRANSACTION
END
Locking a row in shared mode doesn't help a bit in preventing multiple threads from reading the same row. You want to lock the row exclusivey with XLOCK hint. Also you are using a very low precision marker determining candidate rows (GETDATE has 3ms precision) so you will get a lot of false positives. You must use a precise field, like a bit (processing 0 or 1).
Ultimately you are treating the LoginsInfo as a queue, so I suggest you read Using tables as Queues. The way to achieve what you want is to use UPDATE ... WITH OUTPUT. But you have an additional requirement to select a random login, which would throw everything haywire. Are you really, really, 100% convinced that you need randomness? It is an extremely unusual requirement and you will have a heck of hard time coming up with a solution that is correct and performant. You'll get duplicates and you're going to deadlock till the day after.
A first attempt would go something like:
with cte as (
select top 1 ...
from [LoginInfos] with (readpast)
where processing = 0 and ...
order by newid())
update cte
set processing = 1
output cte...
But because the NEWID order requires a full table scan and sort to pick the 1 lucky winner row, you will be 1) extremely unperformant and 2) deadlock constantly.
Now you may take this a a random forum rant, but it so happens I've been working with SQL Server backed queues for some years now and I know what you want will not work. You must modify your requirement, specifically the randomness, and then you can go back to the article linked above and use one of the true and tested schemes.
Edit
If you don't need randomess then is somehow simpler. The gist of the tables-as-queues issue is that you must seek your output row, you absolutely cannot scan for it. Scanning over a queue is not only unperformed, is a guaranteed deadlock because of the way queues are used (highly concurent dequeue operations where all threads want the same row). To achieve this your WHERE clause must be sarg-able, which is subject to 1) your expressions in the WHERE clause and 2) the clustered index key. Your expression cannot contain OR conditions, so loose all the IS NULL OR ..., modify the fields to be non-nullable and always populate them. Second, your must compare in an index freindly manner, not DATEDIFF(..., field, ...) < #variable) but instead always use field < DATEDIDD (..., #variable, ...) because the second form is SARG-able. And you must settle for one of the two fields, [Timestamp] or [UpdateDate], you cannot seek on both. All these, of course, call for a much more strict and tight state machine in your application, but that is a good thing, the lax conditions and OR clauses are only indication of poor data input.
select #now = getdate();
select #expired = dateadd(second, #now, #processTimeout);
with cte as (
select *
from [MyDb].[dbo].[LoginInfos] WITH (readpast, xlock)
WHERE
[Type] = #type) AND
[Timestamp] < #expired)
update cte
set [Timestamp] = #now
output INSERTED.*;
For this to work, the clustered index of the table must be on ([Type], [Timestamp]) (which implies making the primary key LoginInfoId a non-clustered index).
I'm running a procedure which takes around 1 minute for the first time execution but for the next time it reduces to around 9-10 seconds. And after some time again it takes around 1 minute.
My procedure is working with single table which is having 6 non clustered and 1 clustered indexes and unique id column is uniqueidentifier data type with 1,218,833 rows.
Can you guide me where is the problem/possible performance improvement is?
Thanks in advance.
Here is the procedure.
PROCEDURE [dbo].[Proc] (
#HLevel NVARCHAR(100),
#HLevelValue INT,
#Date DATE,
#Numbers NVARCHAR(MAX)=NULL
)
AS
declare #LoopCount INT ,#DateLastYear DATE
DECLARE #Table1 TABLE ( list of columns )
DECLARE #Table2 TABLE ( list of columns )
-- LOOP FOR 12 MONTH DATA
SET #LoopCount=12
WHILE(#LoopCount>0)
BEGIN
SET #LoopCount= #LoopCount -1
-- LAST YEAR DATA
DECLARE #LastDate DATE;
SET #LastDate=DATEADD(D,-1, DATEADD(yy,-1, DATEADD(D,1,#Date)))
INSERT INTO #Table1
SELECT list of columns
FROM Table3 WHERE Date = #Date
AND
CASE
WHEN #HLevel='crieteria1' THEN col1
WHEN #HLevel='crieteria2' THEN col2
WHEN #HLevel='crieteria3' THEN col3
END =#HLevelValue
INSERT INTO #Table2
SELECT list of columns
FROM table4
WHERE Date= #LastDate
AND ( #Numbers IS NULL OR columnNumber IN ( SELECT * FROM dbo.ConvertNumbersToTable(#Numbers)))
INSERT INTO #Table1
SELECT list of columns
FROM #Table2 Prf2 WHERE Prf2.col1 IN (SELECT col2 FROM #Table1) AND Year(Date) = Year(#Date)
SET #Date = DATEADD(D,-1,DATEADD(m,-1, DATEADD(D,1,#Date)));
END
SELECT list of columns FROM #Table1
The first time the query runs, the data is not in the data cache and so has to be retrieved from disk. Also, it has to prepare an execution plan. Subsequent times you run the query, the data will be in the cache and so it will not have to go to disk to read it. It can also reuse the execution plan generated originally. This means execution time can be much quicker and why an ideal situation is to have large amounts of RAM in order to be able to cache as much data in memory as possible (it's the data cache that offers the biggest performance improvements).
If execution times subsequently increase again, it's possible that the data is being removed from the cache (and execution plans can be removed from the cache too) - depends on how much pressure there is for RAM. If SQL Server needs to free some up, it will remove stuff from the cache. Data/execution plans that are used most often/have the highest value will remain cached for longer.
There are of course other things that could be a factor such as what load is on the server at the time, whether your query is being blocked by other processes etc
It seems that stored procedure is recompiling repeatedly after some time. To reduce the recompilation please check this article:
http://blog.sqlauthority.com/2010/02/18/sql-server-plan-recompilation-and-reduce-recompilation-performance-tuning/
Database : SQL Server 2005
Problem : Copy values from one column to another column in the same table with a billion+
rows.
test_table (int id, bigint bigid)
Things tried 1: update query
update test_table set bigid = id
fills up the transaction log and rolls back due to lack of transaction log space.
Tried 2 - a procedure on following lines
set nocount on
set rowcount = 500000
while #rowcount > 0
begin
update test_table set bigid = id where bigid is null
set #rowcount = ##rowcount
set #rowupdated = #rowsupdated + #rowcount
end
print #rowsupdated
The above procedure starts slowing down as it proceeds.
Tried 3 - Creating a cursor for update.
generally discouraged in SQL Server documentation and this approach updates one row at a time which is too time consuming.
Is there an approach that can speed up the copying of values from one column to another. Basically I am looking for some 'magic' keyword or logic that will allow the update query to rip through the billion rows half a million at a time sequentially.
Any hints, pointers will be much appreciated.
I'm going to guess that you are closing in on the 2.1billion limit of an INT datatype on an artificial key for a column. Yes, that's a pain. Much easier to fix before the fact than after you've actually hit that limit and production is shut down while you are trying to fix it :)
Anyway, several of the ideas here will work. Let's talk about speed, efficiency, indexes, and log size, though.
Log Growth
The log blew up originally because it was trying to commit all 2b rows at once. The suggestions in other posts for "chunking it up" will work, but that may not totally resolve the log issue.
If the database is in SIMPLE mode, you'll be fine (the log will re-use itself after each batch). If the database is in FULL or BULK_LOGGED recovery mode, you'll have to run log backups frequently during the running of your operation so that SQL can re-use the log space. This might mean increasing the frequency of the backups during this time, or just monitoring the log usage while running.
Indexes and Speed
ALL of the where bigid is null answers will slow down as the table is populated, because there is (presumably) no index on the new BIGID field. You could, (of course) just add an index on BIGID, but I'm not convinced that is the right answer.
The key (pun intended) is my assumption that the original ID field is probably the primary key, or the clustered index, or both. In that case, lets take advantage of that fact, and do a variation of Jess' idea:
set #counter = 1
while #counter < 2000000000 --or whatever
begin
update test_table set bigid = id
where id between #counter and (#counter + 499999) --BETWEEN is inclusive
set #counter = #counter + 500000
end
This should be extremely fast, because of the existing indexes on ID.
The ISNULL check really wasn't necessary anyway, neither is my (-1) on the interval. If we duplicate some rows between calls, that's not a big deal.
Use TOP in the UPDATE statement:
UPDATE TOP (#row_limit) dbo.test_table
SET bigid = id
WHERE bigid IS NULL
You could try to use something like SET ROWCOUNT and do batch updates:
SET ROWCOUNT 5000;
UPDATE dbo.test_table
SET bigid = id
WHERE bigid IS NULL
GO
and then repeat this as many times as you need to.
This way, you're avoiding the RBAR (row-by-agonizing-row) symptoms of cursors and while loops, and yet, you don't unnecessarily fill up your transaction log.
Of course, in between runs, you'd have to do backups (especially of your log) to keep its size within reasonable limits.
Is this a one time thing? If so, just do it by ranges:
set counter = 500000
while #counter < 2000000000 --or whatever your max id
begin
update test_table set bigid = id where id between (#counter - 500000) and #counter and bigid is null
set counter = #counter + 500000
end
I didn't run this to try it, but if you can get it to update 500k at a time I think you're moving in the right direction.
set rowcount 500000
update test_table tt1
set bigid = (SELECT tt2.id FROM test_table tt2 WHERE tt1.id = tt2.id)
where bigid IS NULL
You can also try changing the recover model so you don't log the transactions
ALTER DATABASE db1
SET RECOVERY SIMPLE
GO
update test_table
set bigid = id
GO
ALTER DATABASE db1
SET RECOVERY FULL
GO
First step, if there are any, would be to drop indexes before the operation. This is probably what is causing the speed degrade with time.
The other option, a little outside the box thinking...can you express the update in such a way that you could materialize the column values in a select? If you can do this then you could create what amounts to a NEW table using SELECT INTO which is a minimally logged operation (assuming in 2005 that you are set to a recovery model of SIMPLE or BULK LOGGED). This would be pretty fast and then you can drop the old table, rename this table to to old table name and recreate any indexes.
select id, CAST(id as bigint) bigid into test_table_temp from test_table
drop table test_table
exec sp_rename 'test_table_temp', 'test_table'
I second the
UPDATE TOP(X) statement
Also to suggest, if you're in a loop, add in some WAITFOR delay or COMMIT between, to allow other processes some time to use the table if needed vs. blocking forever until all the updates are completed