Avoiding while loops in SQL when a counter is required - sql

I feel like this is a common problem, but it seems that none of the answers that I have found on SO or other sites seem to address the issue of a while loop with a counter.
Let's say that I am trying to write a stored procedure in SQL that will populate a user's timesheet by inserting a row for each day for the remainder of the month. If the #endMonth variable holds the last day of the month, then I know that I could easily write a while loop and do something along these lines:
WHILE #date <= #endMonth
BEGIN
//Do some action with the date, like an insert
SET #date = DATEADD(d, 1, #date) //increment the date by one day
END
However, looking at answers here and on other sites leads me to believe that it would be best to avoid using a while loop if at all possible.
So my question is this: is there a way I can implement a loop with a counter in SQL without using the WHILE structure? What technique would I use to go about converting a loop similar to the one I posted? Or with something like this, do I have to bite the bullet and just use a while loop?
As an aside, some of the following questions come close, but none of them seem to quite address the issue of needing a counter as a loop condition. Most of the answers seem to condemn using WHILE loops, but I can't seem to find a general purpose solution as to an alternative.
sql while loop with date counter
SQL Server 2008 Insert with WHILE LOOP (this one was close, but unfortunately for me it only works with an auto increment column)

I saw many examples of populating data.
First you create dates from starting to ending dates in cte and then you can insert it into table.
One of them is with cte:
DECLARE #StartDate DateTime = '2014-06-01'
DECLARE #EndDate DateTime = '2014-06-29'
;WITH populateDates (dates) AS (
SELECT #StartDate as dates
UNION ALL
SELECT DATEADD(d, 1, dates)
FROM populateDates
WHERE DATEADD(d, 1, dates)<=#EndDate
)
SELECT *
INTO dbo.SomeTable
FROM populateDates
You should try to look for on internet how to populate date in sql table

As a general case, you can increment values without using cursors by assigning values and incrementing the variable in the same select, like this:
DECLARE #i INT = 0
DECLARE #table TABLE
(
ID INT ,
testfield VARCHAR(5)
)
INSERT INTO #table
( testfield )
VALUES ( 'abcd'),
( 'efgh' ),
( 'ijkl' ),
( 'mnop' )
UPDATE #table
SET #I = ID = #i + 1
SELECT *
FROM #table

I used a sequence - create temporarily.
I needed to do my updates outside of script context, with plain SQL, sequence was the only "counter" I could come up with.

Related

Iterative Union ALL's

I have a large SQL Server 2012 Database which I am querying 3 tables to create a result set of 5 fields.
I want to repeat this query in a WHILE - loop and "UNION ALL" the result sets obtained in each loop. This iteration will be on a variable: #this_date which will increment over the past 6 years and stop at today's date.
At each iteration a different results set will be obtained by the SELECT.
So I am trying to code the Stored Procedure as follows:
Declare #the_date as Date,
#to_date as Date
-- I setup the above dates, #the_date being 6 years behind #to_date
-- Want to loop for each day over the 6-year period
WHILE (#the_date <= #to_date)
BEGIN
-- the basic select query looks like this
Select Table1.Field-1, Table2.Field-2 ...
FROM Table1
Inner Join Table2 ...
On ( ..etc.. )
-- the JOIN conditions are based on table.attributes which are compared with
-- #the_date to get a different result set each time
-- now move the date up by 1
DateAdd(Day, +1, #the_date)
-- want to concatenate the result sets
UNION ALL
END
The above gives me a syntax error:
Incorrect syntax near the keyword 'Union'.
Any ideas on a solution to my problem would be welcome
- thanks.
Don't use a UNION. You can't in a loop anyway. Instead store the results of each iteration in a temp table or a table variable and select from the temp table / table variable instead.
DECLARE #the_date as Date,
#to_date as Date
CREATE TABLE #t (Col1 VARCHAR(100))
WHILE (#the_date <= #to_date)
BEGIN
INSERT #t (Col1) SELECT ... etc
DateAdd(Day, +1, #the_date)
END
SELECT Col1 FROM #t
That said, if you provide some sample data and expected results we might be able to help you with a more efficient set-based solution. You should avoid iterative looping in RDBMS whenever possible.

Fake a long running SQL statement

I want to fake a long running SQL statement so I can experiment with sys.dm_exec_requests
"Fake" isn't the best way to describe it, but does anyone have a good tip on perhaps selecting autogenerated records? Perhaps using a CTE?
Here's a long-running SQL statement:
WAITFOR DELAY '0:05';
It will take around five minutes to execute.
In one query window, execute the following:
BEGIN TRANSACTION
SELECT * from TableY with (XLOCK)
Then, in another window, execute any(*) query that attempts to access TableY. And it will run for as long as you like until you close the first window or execute a ROLLBACK or COMMIT in it.
(*) assuming you don't add a NOLOCK hint to the reference to TableY.
Just as I was writing "CTE"... it made me think. A quick search later and a variation on
http://smehrozalam.wordpress.com/2009/06/09/t-sql-using-common-table-expressions-cte-to-generate-sequences/
--define start and end limits
Declare #start bigint, #end bigint
Select #start=1, #end=99999
;With NumberSequence( Number ) as
(
Select #start as Number
union all
Select Number + 1
from NumberSequence
where Number < #end
)
--select result
Select * From NumberSequence Option (MaxRecursion 0)
I really like Alex KeySmith's CTE answer https://stackoverflow.com/a/14138219/318411 because you can apply it to your own queries; this means you can return valid data in a long running form.
So for example if you have a test table with a couple of rows in but you want to quickly know how your application code performs with large result sets, you can do the following:
declare #i int, #c int
select #i = 1, #c = 10;
with X as (
select #i as N union all select N + 1 from x where N < #c
)
select
T.*
from
X,
(
select
*
from
MySmallTestTable
) AS T
option (MaxRecursion 0)
This will repeat the test data #c times.
I have also used it to test query cancellation code.

Using while loop in T-SQL function

Non-database programmer here. It happens so, that I need to create a function in T-SQL which returns workdays count between given dates. I believe that the easiest how it's done is with while loop. Problem is, that as soon as I write something like
while #date < #endDate
begin
end
the statement won't execute, claiming "incorrect syntax near the keyword 'return'" (not very helpful). Where's the problem?
P.S. Full code:
ALTER FUNCTION [dbo].[GetNormalWorkdaysCount] (
#startDate DATETIME,
#endDate DATETIME
)
RETURNS INT
AS
BEGIN
declare #Count INT,
#CurrDate DATETIME
set #CurrDate = #startDate
while (#CurrDate < #endDate)
begin
end
return #Count
END
GO
Unlike some languages, the BEGIN/END pair in SQL Server cannot be empty - they must contain at least one statement.
As to your actual problem - you've said you're not a DB programmer. Most beginners to SQL tend to go down the same route - trying to write procedural code to solve the problem.
Whereas, SQL is a set-based language - it's usually better to find a set-based solution, rather than using loops.
In this instance, a calendar table would be a real help. Such a table contains one row for each date, and additional columns indicating useful information for your business (e.g. what you consider to be a working day). It then makes your query for working days look like:
SELECT COUNT(*) from Calendar
where BaseDate >= #StartDate and BaseDate < #EndDate and IsWorkingDay = 1
Populating the Calendar table becomes a one off exercise, and you can populate it with e.g. 30 years worth of dates easily.
Using any loop within SQL server is never a good idea :)
There are few better solutions, referring to one presented on StackOverflow already.

de-duplicating rows in a sql server 2005 table

I have a table with ~17 million rows in it. I need to de-duplicate the rows in the table. Under normal circumstances this wouldn't be a challenge, however, this isn't a normal circumstance. Normally 'duplicate rows' is defined as two or more rows containing the exact same values for all columns. In this case 'duplicate rows' is defined as two or more rows that have the exact same values, but are also within 20 seconds of each other. I wrote a script that is still running after 19.5 hours, this isn't acceptable, but I'm not sure how else to do it. Here's the script:
begin
create table ##dupes (ID int)
declare curOriginals cursor for
select ID, AssociatedEntityID, AssociatedEntityType, [Timestamp] from tblTable
declare #ID int
declare #AssocEntity int
declare #AssocType int
declare #Timestamp datetime
declare #Count int
open curOriginals
fetch next from curOriginals into #ID, #AssocEntity, #AssocType, #Timestamp
while ##FETCH_STATUS = 0
begin
select #Count = COUNT(*) from tblTable where AssociatedEntityID = #AssocEntity and AssociatedEntityType = #AssocType
and [Timestamp] >= DATEADD(ss, -20, #Timestamp)
and [Timestamp] <= DATEADD(ss, 20, #Timestamp)
and ID <> #ID
if (#Count > 0)
begin
insert into ##dupes (ID)
(select ID from tblHBMLog where AssociatedEntityID = #AssocEntity and AssociatedEntityType = #AssocType
and [Timestamp] >= DATEADD(ss, -20, #Timestamp)
and [Timestamp] <= DATEADD(ss, 20, #Timestamp)
and ID <> #ID)
print #ID
end
delete from tblHBMLog where ID = #ID or ID in (select ID from ##dupes)
fetch next from curOriginals into #ID, #AssocEntity, #AssocType, #Timestamp
end
close curOriginals
deallocate curOriginals
select * from ##dupes
drop table ##dupes
end
Any help would be greatly appreciated.
A quick tweak that should gain some speed would be to replace the nasty COUNT section with some EXISTS stuff :
IF EXISTS(SELECT 1 FROM tblTable WHERE AssociatedEntityID = #AssocEntity
AND AssociatedEntityType = #AssocType AND [Timestamp] >= DATEADD(ss, -20, #Timestamp)
AND [Timestamp] <= DATEADD(ss, 20, #Timestamp)
AND ID <> #ID) //if there are any matching rows...
BEGIN
DELETE FROM tblHBMLog
OUTPUT deleted.ID INTO ##dupes
WHERE AssociatedEntityID = #AssocEntity AND AssociatedEntityType = #AssocType
AND [Timestamp] >= DATEADD(ss, -20, #Timestamp)
AND [Timestamp] <= DATEADD(ss, 20, #Timestamp) //I think this is supposed to be within the block, not outside it
END
I've also now replaced the double references of ##dupes with the OUTPUT clause which will mean you're not scanning a growing ##dupes every time you delete a row. As far as the deletion goes, as you're deleting the ID and its matches in one go you don't need such an elaborate deletion clause. You've already checked that there are entries that need removing, and you seem to want to remove all the entries including the original.
Once you answer Paul's question, we can take a look at completely removing the cursor.
Basically, I agree with Bob.
1st of all, you have way too many things done in your code to be repeated 17 million times.
2nd, you could crop your set down to the absolute duplicates.
3rd it would be nicer if you had enough memory (which you should) and try and solve this in your programming language of choice.
Anyway, for the sake of a hardcoded answer, and because your query might still be running, I will try to give a working script which I think (?) does what you want.
First of all you should have an Index.
I would recommend an index on the AssociatedEntityID field.
If you already have one, but your table has been populated with lots of data after you created the index, then drop it and recreate it, in order to have fresh statistics.
Then see the script below, which does the following:
dumps all duplicates in the ##dupes, ignoring the 20 secs rule
it sorts them out (by AssociatedEntityID, Timestamp) and starts the simplest straight forward loop it can do.
checks for duplicate AssociatedEntityID and the timestamp inside the 20 sec interval.
if all true, then inserts the id to the ##dupes_to_be_deleted table.
There is the assumption that if you have a set of more than two duplicates, in sequence, then the script eliminates every duplicate in the range of 20 secs from the first one. Then, from the next remaining, if any, it resets and goes for another 20 secs, and so on...
Here is the script, it may be useful to you, though did not have the time to test it
CREATE TABLE ##dupes
(
ID INT ,
AssociatedEntityID INT ,
[Timestamp] DATETIME
)
CREATE TABLE ##dupes_to_be_deleted
(
ID INT
)
-- collect all dupes, ignoring for now the rule of 20 secs
INSERT
INTO ##dupes
SELECT ID ,
AssociatedEntityID ,
[Timestamp]
FROM tblTable
WHERE AssociatedEntityID IN
( SELECT AssociatedEntityID
FROM tblTable
GROUP BY AssociatedEntityID
HAVING COUNT(*) > 1
)
-- then sort and loop on all of them
-- using a cursor
DECLARE c CURSOR FOR
SELECT ID ,
AssociatedEntityID ,
[Timestamp]
FROM ##dupes
ORDER BY AssociatedEntityID,
[Timestamp]
-- declarations
DECLARE #id INT,
#AssociatedEntityID INT,
#ts DATETIME,
#old_AssociatedEntityID INT,
#old_ts DATETIME
-- initialisation
SELECT #old_AssociatedEntityID = 0,
#old_ts = '1900-01-01'
-- start loop
OPEN c
FETCH NEXT
FROM c
INTO #id ,
#AssociatedEntityID,
#ts
WHILE ##fetch_status = 0
BEGIN
-- check for dupe AssociatedEntityID
IF #AssociatedEntityID = #old_AssociatedEntityID
BEGIN
-- check for time interval
IF #ts <= DATEADD(ss, 20, #old_ts )
BEGIN
-- yes! it is a duplicate
-- store it in ##dupes_to_be_deleted
INSERT
INTO ##dupes_to_be_deleted
(
id
)
VALUES
(
#id
)
END
ELSE
BEGIN
-- IS THIS OK?:
-- put last timestamp for comparison
-- with the next timestamp
-- only if the previous one is not going to be deleted.
-- this way we delete all duplicates
-- 20 secs away from the first of the set of duplicates
-- and the next one remaining will be a duplicate
-- but after the 20 secs interval.
-- and so on ...
SET #old_ts = #ts
END
END
-- prepare vars for next iteration
SELECT #old_AssociatedEntityID = #AssociatedEntityID
FETCH NEXT
FROM c
INTO #id ,
#AssociatedEntityID,
#ts
END
CLOSE c
DEALLOCATE c
-- now you have all the ids that are duplicates and in the 20 sec interval of the first duplicate in the ##dupes_to_be_deleted
DELETE
FROM <wherever> -- replace <wherever> with tblHBMLog?
WHERE id IN
( SELECT id
FROM ##dupes_to_be_deleted
)
DROP TABLE ##dupes_to_be_deleted
DROP TABLE ##dupes
You could give a try and leave it for a couple of hours. Hope it helps.
If you have enough memory and storage, may be faster this way:
Create the new table with similar structure
Copy all data by select with distinct to this temp table
Clear original table (your should
delete some constraints before this)
Copy data back to original table
Instead of 3 and 4 steps you can rename drop original table and rename temp folder.
Putting the time differentiator aside, the first thing I would do is knock this list down to a much smaller subset of potential duplicates. For example, if you have 17 million rows, but only, say, 10 million have every field matching but the time, then you have just chopped a large portion of your processing off.
To do this I'd just whip up a query to dump the unique ID's of the potential duplicates into a temp table, then use this as an inner join on your cursor (again, this would be a first step).
In looking at the cursor, I see a lot of relatively heavy function calls which would explain your slowdowns. There's also a lot of data activity and I would not be suprised if you weren't being crushed by an I/O bottleneck.
One thing you could do then is rather than use the cursor, dump it into your programming language of choice. Assuming we've already limited all of our fields except for the timestamp down to a manageable set, grab each subset in turn (i.e. ones that match the remaining fields), since any dups would necessarily have all of their other fields matched. Then just snuff out the duplicates you find in these smaller atomic subsets.
So assuming you have 10 million potentials, and each time range has about 20 records or so that need to be worked through with the date logic, you're down to a much smaller number of database calls and some quick code - and from experience, knocking out the datetime comparisons, etc. outside of SQL is generally a lot faster.
Bottom line is to figure out ways to, as quickly as possible, partition your data down into managable subsets.
Hope that helps!
-Bob
In answer to Paul's question:
What happens when you have three entries, a, b, c. a = 00 secs b = 19 secs c = 39 secs >Are these all considered to be the same time? ( a is within 20 secs of b, b is within 20 >secs of c )
If the other comparisons are equal (AssociatedEntityid and AssociatedEntityType) then yes, they are considered the same thing, otherwise no.
I would add to the original question, except that I used a different account to post the question and now can't remember my password. It was a very old account and didn't realize that I had connected to the site with it.
I have been working with some the answers you guys have given me and there is one problem, you're using only one key column (AssociatedEntityid) when there are two (AssociatedEntityID and AssociatedEntityType). Your suggestions would work great for a single key column.
What I have done so far is:
Step 1: Determine which AssociatedEntityID and AssociatedEntityType pairs have duplicates and insert them into a temp table:
create table ##stage1 (ID int, AssociatedEntityID int, AssociatedEntityType int, [Timestamp] datetime)
insert into ##stage1 (AssociatedEntityID, AssociatedEntityType)
(select AssociatedEntityID, AssociatedEntityType from tblHBMLog group by AssociatedEntityID, AssociatedEntityType having COUNT(*) > 1)
Step 2: Retrieve the ID of the earliest occurring row with a given AssociatedEntityID and AssociatedEntityType pair:
declare curStage1 cursor for
select AssociatedEntityID, AssociatedEntityType from ##stage1
open curStage1
fetch next from curStage1 into #AssocEntity, #AssocType
while ##FETCH_STATUS = 0
begin
select top 1 #ID = ID, #Timestamp = [Timestamp] from tblHBMLog where AssociatedEntityID = #AssocEntity and AssociatedEntityType = #AssocType order by [Timestamp] asc
update ##stage1 set ID = #ID, [Timestamp] = #Timestamp where AssociatedEntityID = #AssocEntity and AssociatedEntityType = #AssocType
end
And this is where things slow down again. Now, granted, the result set has been pared down from ~17 million to just under 400,000, but it is still taking quite a long time to run through.
I guess another question that I should ask is this; If I continue to write this in SQL will it just have to take quite a long time? Should I write this in C# instead? Or am I just stupid and not seeing the forest for the trees of this solution?
Well, after much stomping of feet and gnashing of teeth, I have come up with a solution. It's just a simple, quick and dirty C# command line app, but it's faster than the sql script and it does the job.
I thank you all for your help, in the end the sql script was just taking too much time to execute and C# is much better suited for looping.

SQL Server 2000 (creating all dates given a daterange)

I was wondering if there is a way in SQL Server 2000 to create all dates given a start and end date as a result. I know I can achieve this with T-SQL looping. I am looking for a non looping solution. Also in 2005 you can use the recursive with clause. The solution can also be using a T table that has numbers in it to join with the table. Again I am looking at a SQL Server 2000 non looping/using T tables solution. Is there any?
SELECT
DATEADD(dy, T.number, #start_date)
FROM
T
WHERE
number BETWEEN 0 AND DATEDIFF(dy, #start_date, #end_date)
A Calendar table can also be useful for these kind of queries and you can add some date-specific information to it, such as whether a day is a holiday, counts as a "business" day, etc.
try this:
create numbers table, only need to do this one time in your DB:
CREATE TABLE Numbers (Number int NOT NULL)
GO
ALTER TABLE Numbers ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Number)
GO
DECLARE #x int
set #x=0
WHILE #X<8000
BEGIN
SET #X=#X+1
INSERT INTO Numbers VALUES (#x)
END
--run your query:
DECLARE #StartDate datetime
DECLARE #EndDate datetime
set #StartDate='05/03/2009'
set #EndDate='05/12/2009'
SELECT
#StartDate+Number-1
FROM Numbers
WHERE Number<=DATEDIFF(day, #StartDate, #EndDate)+1