How to keep a rolling checksum in SQL? - sql

I am trying to keep a rolling checksum to account for order, so take the previous 'checksum' and xor it with the current one and generate a new checksum.
Name Checksum Rolling Checksum
------ ----------- -----------------
foo 11829231 11829231
bar 27380135 checksum(27380135 ^ 11829231) = 93291803
baz 96326587 checksum(96326587 ^ 93291803) = 67361090
How would I accomplish something like this?
(Note that the calculations are completely made up and are for illustration only)

This is basically the running total problem.
Edit:
My original claim was that is one of the few places where a cursor based solution actually performs best. The problem with the triangular self join solution is that it will repeatedly end up recalculating the same cumulative checksum as a subcalculation for the next step so is not very scalable as the work required grows exponentially with the number of rows.
Corina's answer uses the "quirky update" approach. I've adjusted it to do the check sum and in my test found that it took 3 seconds rather than 26 seconds for the cursor solution. Both produced the same results. Unfortunately however it relies on an undocumented aspect of Update behaviour. I would definitely read the discussion here before deciding whether to rely on this in production code.
There is a third possibility described here (using the CLR) which I didn't have time to test. But from the discussion here it seems to be a good possibility for calculating running total type things at display time but out performed by the cursor when the result of the calculation must be saved back.
CREATE TABLE TestTable
(
PK int identity(1,1) primary key clustered,
[Name] varchar(50),
[CheckSum] AS CHECKSUM([Name]),
RollingCheckSum1 int NULL,
RollingCheckSum2 int NULL
)
/*Insert some random records (753,571 on my machine)*/
INSERT INTO TestTable ([Name])
SELECT newid() FROM sys.objects s1, sys.objects s2, sys.objects s3
Approach One: Based on the Jeff Moden Article
DECLARE #RCS int
UPDATE TestTable
SET #RCS = RollingCheckSum1 =
CASE WHEN #RCS IS NULL THEN
[CheckSum]
ELSE
CHECKSUM([CheckSum] ^ #RCS)
END
FROM TestTable WITH (TABLOCKX)
OPTION (MAXDOP 1)
Approach Two - Using the same cursor options as Hugo Kornelis advocates in the discussion for that article.
SET NOCOUNT ON
BEGIN TRAN
DECLARE #RCS2 INT
DECLARE #PK INT, #CheckSum INT
DECLARE curRollingCheckSum CURSOR LOCAL STATIC READ_ONLY
FOR
SELECT PK, [CheckSum]
FROM TestTable
ORDER BY PK
OPEN curRollingCheckSum
FETCH NEXT FROM curRollingCheckSum
INTO #PK, #CheckSum
WHILE ##FETCH_STATUS = 0
BEGIN
SET #RCS2 = CASE WHEN #RCS2 IS NULL THEN #CheckSum ELSE CHECKSUM(#CheckSum ^ #RCS2) END
UPDATE dbo.TestTable
SET RollingCheckSum2 = #RCS2
WHERE #PK = PK
FETCH NEXT FROM curRollingCheckSum
INTO #PK, #CheckSum
END
COMMIT
Test they are the same
SELECT * FROM TestTable
WHERE RollingCheckSum1<> RollingCheckSum2

I'm not sure about a rolling checksum, but for a rolling sum for instance, you can do this using the UPDATE command:
declare #a table (name varchar(2), value int, rollingvalue int)
insert into #a
select 'a', 1, 0 union all select 'b', 2, 0 union all select 'c', 3, 0
select * from #a
declare #sum int
set #sum = 0
update #a
set #sum = rollingvalue = value + #sum
select * from #a

Select Name, Checksum
, (Select T1.Checksum_Agg(Checksum)
From Table As T1
Where T1.Name < T.Name) As RollingChecksum
From Table As T
Order By T.Name
To do a rolling anything, you need some semblance of an order to the rows. That can be by name, an integer key, a date or whatever. In my example, I used name (even though the order in your sample data isn't alphabetical). In addition, I'm using the Checksum_Agg function in SQL.
In addition, you would ideally have a unique value on which you compare the inner and outer query. E.g., Where T1.PK < T.PK for an integer key or even string key would work well. In my solution if Name had a unique constraint, it would also work well enough.

Related

How to loop with a table values when column value is not incremental in SQL Server

I need to insert the value from the table T1 to another table t2 where t1 is truncate and load and any values can come after load. So how to use Loop to insert data into T2. It should happen automatically no manual intervention should required so can't Use table value parameter.
Suppose table1 has column Id
Id
---
4
7
15
I have to insert the data into table 2.
I have used this code:
DECLARE #counter INT = (SELECT MIN(CAST(ID AS INT)) FROM Table1);
WHILE #counter <= (SELECT COUNT(CAST(ID AS INT)) FROM Table1)
BEGIN
INSERT INTO TABLE2 (ID)
VALUES (#Counter)
SET #counter = (SELECT ID FROM table1 WHERE #counter = ID)
END
How to set the counter or pick the value from table1.Id value can come differently every time?
Please help
In the absence of any further detail, it seems you could simply rewrite your query as the below:
INSERT INTO Table2 (ID)
SELECT ID
FROM Table1;
There is no need for a loop (WHILE/CURSOR) for what you have here. SQL is a Query Language, and excels are set based operations. What SQL isn't good at is iterative ones, and whenever a CURSOR or WHILE is used, I would suggest it is almost always being misused; this certainly appears to be one of those times. A WHILE or CURSOR, for a even slightly larger dataset would be significantly slower, probably by 1,000s of times so, than the simple statement above.
Not sure of your logic and its almost always better to use set based solutions but here is a TSQL loop solution. I left out the casting which you may have to use:
declare #curid int
declare #previd int
select #curid= min([ID]) from Table1 ;
while ##rowcount > 0
begin
INSERT INTO TABLE2 (ID) VALUES (#curid)
set #previd=#curid
select #curid= min([ID])
from Table1
where [ID]> #previd;
end

SQL Server 2008 Is there a more efficient way to do this update loop?

First posted question, I apologize in advance for any blunders.
The table contains records that are assigned to a team, the initial assignments are done with another process. Frequently, we have to reassign an agent's records and spread them out equally to the rest of the team. We have been doing this by hand, one by one, which was cumbersome. So I came up with this solution:
DECLARE #UpdtAgt TABLE (ID INT, Name varchar(25))
INSERT INTO #UpdtAgt
VALUES (1, 'Gandalf')
,(2,'Hank')
,(3,'Icarus')
CREATE TABLE #UpdtQry (TblID varchar(25))
INSERT INTO #UpdtQry
SELECT ShtID
FROM TestUpdate
DECLARE #RowID INT
DECLARE #AgtID INT
DECLARE #Agt varchar(25)
DECLARE #MaxID INT
SET #MaxID = (SELECT COUNT(*) FROM #UpdtAgt)
SET #AgtID = 1
--WHILE ((SELECT COUNT(*) FROM #UpdtQry) > 0)
WHILE EXISTS (SELECT TblID FROM #UpdtQry)
BEGIN
SET #RowID = (SELECT TOP 1 TblID FROM #UpdtQry)
SET #Agt = (SELECT Name FROM #UpdtAgt WHERE ID = #AgtID)
UPDATE TestUpdate
SET Assignment = #Agt
WHERE ShtID = #RowID
DELETE #UpdtQry WHERE TblID = #RowID
IF #AgtID < #MaxID
SET #AgtID = #AgtID + 1
ELSE
SET #AgtID = 1
END
DROP TABLE #UpdtQry
This is really my first attempt at doing something this in-depth. An update of 100 rows takes about 30 seconds to do. The UPDATE table, TestUpdate, has only the CLUSTERED index. How can I make this more efficient?
EDIT: I didn't define the #UpdtAgt and #UpdtQry tables very well in my explanation. #UpdtAgt will hold the agents that are being reassigned the records, and will likely change each time this is used. #UpdtQry will have a WHERE clause to define which agents records will be getting reassigned, again, this will change with each use. I hope that makes this a little more clear. Again, apologies for not getting it right the first time.
EDIT 2: I commented out the old WHILE clause and inserted the one that HABO suggested. Thank you again HABO.
I think this is what you're looking for:
DECLARE #UpdtAgt TABLE
(
ID INT,
Name VARCHAR(25)
)
INSERT #UpdtAgt
VALUES (1, 'Gandalf')
,(2, 'Hank')
,(3, 'Icarus')
UPDATE t
SET t.Assignment = a.Name
FROM TestUpdate AS t
INNER JOIN #UpdtAgt AS a
ON t.ShtID = a.ID
That should do all 4 rows at once.
P.S...
If you do create tables like in your original post in future, please try and keep the naming of your columns and variables consistent with their purpose!
In your example you used ID, AgtID, and ShtID and (most confusingly) TblID (and I think they're all the same thing? [please correct me if I'm wrong!]). If you called it AgtID everywhere (and #AgtID for the variable [There's no real need for #RowID]) then it would be much easier to see at a glance what'd going on! The same thing goes with Assignment and Name.
Because this is your first attempt at something like this, I want to congratulate you on something that works. While it is not ideal (and what is?) it meets the main goal: it works. There is a better way to do this using something known as a cursor. I remind myself of the proper syntax using the following page from Microsoft: Click here for full instruction on cursors
Having said that, the code at the end of this post shows my quick solution to your situation. Notice the following:
The #TestUpdate table is defined so that the query will run in MSSQL without using permanent tables.
Only the #UpdtAgt table needs to be setup as a temp table. However, if this is used regularly, it would be best to make it a permanent table.
The CLOSE and DEALLOCATE statements at the end are IMPORTANT - forgetting these will have rather unpleasant consequences.
DECLARE #TestUpdate TABLE (ShtID int, Assignment varchar(25))
INSERT INTO #TestUpdate
VALUES (1,'Fred')
,(2,'Barney')
,(3,'Fred')
,(4,'Wilma')
,(5,'Betty'),(6,'Leopold'),(7,'Frank'),(8,'Fred')
DECLARE #UpdtAgt TABLE (ID INT, Name varchar(25))
INSERT INTO #UpdtAgt
VALUES (1, 'Gandalf')
,(2,'Hank')
,(3,'Icarus')
DECLARE #recid int
DECLARE #AgtID int SET #AgtID=0
DECLARE #MaxID int SET #MaxID = (SELECT COUNT(*) FROM #UpdtAgt)
DECLARE assignment_cursor CURSOR
FOR SELECT ShtID FROM #TestUpdate
OPEN assignment_cursor
FETCH NEXT FROM assignment_cursor
INTO #recid
WHILE ##FETCH_STATUS = 0
BEGIN
SET #AgtID = #AgtID + 1
IF #AgtID > #MaxID SET #AgtID = 1
UPDATE #TestUpdate
SET Assignment = (SELECT TOP 1 Name FROM #UpdtAgt WHERE ID=#AgtID)
FROM #TestUpdate TU
WHERE ShtID=#recid
FETCH NEXT FROM assignment_cursor INTO #recid
END
CLOSE assignment_cursor
DEALLOCATE assignment_cursor
SELECT * FROM #TestUpdate

Arithmetic overflow on large table

I have a table with 5 billions of rows in SQL Server 2014 (Developer Edition, x64, Windows 10 Pro x64):
CREATE TABLE TestTable
(
ID BIGINT IDENTITY(1,1),
PARENT_ID BIGINT NOT NULL,
CONSTRAINT PK_TestTable PRIMARY KEY CLUSTERED (ID)
);
CREATE NONCLUSTERED INDEX IX_TestTable_ParentId
ON TestTable (PARENT_ID);
I'm trying to apply the following patch:
-- Create non-nullable column with default (should be online operation in Enterprise/Developer edition)
ALTER TABLE TestTable
ADD ORDINAL TINYINT NOT NULL CONSTRAINT DF_TestTable_Ordinal DEFAULT 0;
GO
-- Populate column value for existing data
BEGIN
SET NOCOUNT ON;
DECLARE #BATCH_SIZE BIGINT = 1000000;
DECLARE #COUNTER BIGINT = 0;
DECLARE #ROW_ID BIGINT;
DECLARE #ORDINAL BIGINT;
DECLARE ROWS_C CURSOR
LOCAL FORWARD_ONLY FAST_FORWARD READ_ONLY
FOR
SELECT
ID AS ID,
ROW_NUMBER() OVER (PARTITION BY PARENT_ID ORDER BY ID ASC) AS ORDINAL
FROM
TestTable;
OPEN ROWS_C;
FETCH NEXT FROM ROWS_C
INTO #ROW_ID, #ORDINAL;
BEGIN TRANSACTION;
WHILE ##FETCH_STATUS = 0
BEGIN
UPDATE TestTable
SET
ORDINAL = CAST(#ORDINAL AS TINYINT)
WHERE
ID = #ROW_ID;
FETCH NEXT FROM ROWS_C
INTO #ROW_ID, #ORDINAL;
SET #COUNTER = #COUNTER + 1;
IF #COUNTER = #BATCH_SIZE
BEGIN
COMMIT TRANSACTION;
SET #COUNTER = 0;
BEGIN TRANSACTION;
END;
END;
COMMIT TRANSACTION;
CLOSE ROWS_C;
DEALLOCATE ROWS_C;
SET NOCOUNT OFF;
END;
GO
-- Drop default constraint from the column
ALTER TABLE TestTable
DROP CONSTRAINT DF_TestTable_Ordinal;
GO
-- Drop IX_TestTable_ParentId index
DROP INDEX IX_TestTable_ParentId
ON TestTable;
GO
-- Create IX_TestTable_ParentId_Ordinal index
CREATE UNIQUE INDEX IX_TestTable_ParentId_Ordinal
ON TestTable (PARENT_ID, ORDINAL);
GO
The aim of patch is to add a column, called ORDINAL, which is an ordinal number of the record within the same parent (defined by PARENT_ID). The patch is run using SQLCMD.
The patch is done is this way for a set of reasons:
Table is too large to run a single UPDATE statement on it (takes enormous amount of time and space in transaction log/tempdb).
Batch updates using a single UPDATE statement with TOP n rows are not simple to implement (if we update table in, say, 1m rows batches, 1000001st row may belong to the same PARENT_ID as 1000000th which will lead to wrong ordinal number assigned to 1000001st record). In other words, SELECT statement run in cursor should be run once (without paging) or more complicated operations (joins/conditions) should be applied.
Adding NULL column and changing it to NOT NULL later is not a good solution since I use SNAPSHOT isolation (full table update will be performed on altering column to be NOT NULL).
The patch works perfect on a small database with a few millions of rows, but, when applied to the one with billions of rows, I get:
Msg 3606, Level 16, State 2, Server XXX, Line 22
Arithmetic overflow occurred.
My first guess was ORDINAL value is too big to fit into TINYINT column, but this is not the case. I created a test database with similar structure and populated with data (more than 255 rows per parent). The error message I get is still arithmetic exception, but with different message code and different wording (explicitly saying it can't fit data into TINYINT).
Currently I have a couple of suspicions, but I haven't managed to find anything that could help me:
CURSOR is not able to handle more than MAX(INT32) rows.
SQLCMD imposed limitations.
Do you have any ideas on what could the problem be?
How about using a While loop but making sure that you keep the same parent_ids together:
DECLARE #SegmentSize BIGINT = 1000000
DECLARE #CurrentSegment BigInt = 0
WHILE 1 = 1
BEGIN
;With UpdateData As
(
SELECT ID AS ID,
ROW_NUMBER() OVER (PARTITION BY PARENT_ID ORDER BY ID ASC) AS ORDINAL
FROM TestData
WHERE ID > #CurrentSegment AND ID <= (#CurrentSegment + #SegmentSize)
)
UPDATE TestData
SET Ordinal = UpdateDate.Ordinal
FROM TestData
INNER JOIN UpdateData ON TestData.Id = UpdateData.Id
IF ##ROWCOUNT = 0
BEGIN
BREAK
END
SET #CurrentSegment = #CuurentSegment + #SegmentSize
END
EDIT - Amended to segment on Parent_Id as per request. This should be
reasonably quick as Parent_id is indexed (added Option(Recompile)
to ensure that actual value is used for the lookup.
Because you are not updating
the whole table this will limit the Transaction Log growth!
DECLARE #SegmentSize BIGINT = 1000000
DECLARE #CurrentSegment BigInt = 0
WHILE 1 = 1
BEGIN
;With UpdateData As
(
SELECT ID AS ID,
ROW_NUMBER() OVER (PARTITION BY PARENT_ID ORDER BY ID ASC) AS ORDINAL
FROM TestData
WHERE Parent_ID > #CurrentSegment AND
Parent_ID <= (#CurrentSegment + #SegmentSize)
)
UPDATE TestData
SET Ordinal = UpdateDate.Ordinal
FROM TestData
INNER JOIN UpdateData ON TestData.Id = UpdateData.Id
OPTION (RECOMPILE)
IF ##ROWCOUNT = 0
BEGIN
BREAK
END
SET #CurrentSegment = #CuurentSegment + #SegmentSize
END

Generating the Next Id when Id is non-AutoNumber

I have a table called Employee. The EmpId column serves as the primary key. In my scenario, I cannot make it AutoNumber.
What would be the best way of generating the the next EmpId for the new row that I want to insert in the table?
I am using SQL Server 2008 with C#.
Here is the code that i am currently getting, but to enter Id's in key value pair tables or link tables (m*n relations)
Create PROCEDURE [dbo].[mSP_GetNEXTID]
#NEXTID int out,
#TABLENAME varchar(100),
#UPDATE CHAR(1) = NULL
AS
BEGIN
DECLARE #QUERY VARCHAR(500)
BEGIN
IF EXISTS (SELECT LASTID FROM LASTIDS WHERE TABLENAME = #TABLENAME and active=1)
BEGIN
SELECT #NEXTID = LASTID FROM LASTIDS WHERE TABLENAME = #TABLENAME and active=1
IF(#UPDATE IS NULL OR #UPDATE = '')
BEGIN
UPDATE LASTIDS
SET LASTID = LASTID + 1
WHERE TABLENAME = #TABLENAME
and active=1
END
END
ELSE
BEGIN
SET #NEXTID = 1
INSERT INTO LASTIDS(LASTID,TABLENAME, ACTIVE)
VALUES(#NEXTID+1,#TABLENAME, 1)
END
END
END
Using MAX(id) + 1 is a bad idea both performance and concurrency wise.
Instead you should resort to sequences which were design specifically for this kind of problem.
CREATE SEQUENCE EmpIdSeq AS bigint
START WITH 1
INCREMENT BY 1;
And to generate the next id use:
SELECT NEXT VALUE FOR EmpIdSeq;
You can use the generated value in a insert statement:
INSERT Emp (EmpId, X, Y)
VALUES (NEXT VALUE FOR EmpIdSeq, 'x', 'y');
And even use it as default for your column:
CREATE TABLE Emp
(
EmpId bigint PRIMARY KEY CLUSTERED
DEFAULT (NEXT VALUE FOR EmpIdSeq),
X nvarchar(255) NULL,
Y nvarchar(255) NULL
);
Update: The above solution is only applicable to SQL Server 2012+. For older versions you can simulate the sequence behavior using dummy tables with identity fields:
CREATE TABLE EmpIdSeq (
SeqID bigint IDENTITY PRIMARY KEY CLUSTERED
);
And procedures that emulates NEXT VALUE:
CREATE PROCEDURE GetNewSeqVal_Emp
#NewSeqVal bigint OUTPUT
AS
BEGIN
SET NOCOUNT ON
INSERT EmpIdSeq DEFAULT VALUES
SET #NewSeqVal = scope_identity()
DELETE FROM EmpIdSeq WITH (READPAST)
END;
Usage exemple:
DECLARE #NewSeqVal bigint
EXEC GetNewSeqVal_Emp #NewSeqVal OUTPUT
The performance overhead of deleting the last inserted element will be minimal; still, as pointed out by the original author, you can optionally remove the delete statement and schedule a maintenance job to delete the table contents off-hour (trading space for performance).
Adapted from SQL Server Customer Advisory Team Blog.
Working SQL Fiddle
The above
select max(empid) + 1 from employee
is the way to get the next number, but if there are multiple user inserting into the database, then context switching might cause two users to get the same value for empid and then add 1 to each and then end up with repeat ids. If you do have multiple users, you may have to lock the table while inserting. This is not the best practice and that is why the auto increment exists for database tables.
I hope this works for you. Considering that your ID field is an integer
INSERT INTO Table WITH (TABLOCK)
(SELECT CASE WHEN MAX(ID) IS NULL
THEN 1 ELSE MAX(ID)+1 END FROM Table), VALUE_1, VALUE_2....
Try following query
INSERT INTO Table VALUES
((SELECT isnull(MAX(ID),0)+1 FROM Table), VALUE_1, VALUE_2....)
you have to check isnull in on max values otherwise it will return null in final result when table contain no rows .

Why my T-SQL (WHILE) does not work?

In my code, I need to test whether specified column is null and the most close to 0 as possible (it can holds numbers from 0 to 50) so I have tried the code below.
It should start from 0 and for each value test the query. When #Results gets null, it should return. However, it does not work. Still prints 0.
declare #hold int
declare #Result int
set #hold0
set #Result=0
WHILE (#Result!=null)
BEGIN
select #Result=(SELECT Hold from Numbers WHERE Name='Test' AND Hold=#hold)
set #hold=#hold+1
END
print #hold
First, you can't test equality of NULL. NULL means an unknown value, so you don't know whether or not it does (or does not) equal any specific value. Instead of #Result!=NULL use #result IS NOT NULL
Second, don't use this kind of sequential processing in SQL if you can at all help it. SQL is made to handle sets, not process things sequentially. You could do all of this work with one simple SQL command and it will most likely run faster anyway:
SELECT
MIN(hold) + 1
FROM
Numbers N1
WHERE
N1.name = 'Test' AND
NOT EXISTS
(
SELECT
*
FROM
Numbers N2
WHERE
N2.name = 'Test' AND
N2.hold = N1.hold + 1
)
The query above basically tells the SQL Server, "Give me the smallest hold value plus 1 (MIN(hold) + 1) in the table Numbers where the name is test (name = 'Test') and where the row with name of 'Test' and hold of one more that that does not exist (the whole "NOT EXISTS" part)". In the case of the following rows:
Name Hold
-------- ----
Test 1
Test 2
NotTest 3
Test 20
SQL Server finds all of the rows with name of "Test" (1, 2, 20) then finds which ones don't have a row with name = Test and hold = hold + 1. For 1 there is a row with Test, 2 that exists. For Test, 2 there is no Test, 3 so it's still in the potential results. For Test, 20 there is no Test, 21 so that leaves us with:
Name Hold
-------- ----
Test 2
Test 20
Now SQL Server looks for MIN(hold) and gets 2 then it adds 1, so you get 3.
SQL Server may not perform the operations exactly as I described. The SQL statement tells SQL Server what you're looking for, but not how to get it. SQL Server has the freedom to use whatever method it determines is the most efficient for getting the answer.
The key is to always think in terms of sets and how do those sets get put together (through JOINs), filtered (through WHERE conditions or ON conditions within a join, and when necessary, grouped and aggregated (MIN, MAX, AVG, etc.).
have you tried
WHILE (#Result is not null)
BEGIN
select #Result=(SELECT Hold from Numbers WHERE Name='Test' AND Hold=#hold)
set #hold=#hold+1
END
Here's a more advanced version of Tom H.'s query:
SELECT MIN(N1.hold) + 1
FROM Numbers N1
LEFT OUTER JOIN Numbers N2
ON N2.Name = N1.Name AND N2.hold = N1.hold + 1
WHERE N1.name = 'Test' AND N2.name IS NULL
It's not as intuitive if you're not familiar with SQL, but it uses identical logic. For those who are more familiar with SQL, it makes the relationship between N1 and N2 easier to see. It may also be easier for the query optimizer to handle, depending on your DBMS.
Try this:
declare #hold int
declare #Result int
set #hold=0
set #Result=0
declare #max int
SELECT #max=MAX(Hold) FROM Numbers
WHILE (#hold <= #max)
BEGIN
select #Result=(SELECT Hold from Numbers WHERE Name='Test' AND Hold=#hold)
set #hold=#hold+1
END
print #hold
While is tricky in T-SQL - you can use this for (foreach) looping through (temp) tables too - with:
-- Foreach with T-SQL while
DECLARE #tempTable TABLE (rownum int IDENTITY (1, 1) Primary key NOT NULL, Number int)
declare #RowCnt int
declare #MaxRows int
select #RowCnt = 1
select #MaxRows=count(*) from #tempTable
declare #number int
while #RowCnt <= #MaxRows
begin
-- Number from given RowNumber
SELECT #number=Number FROM #tempTable where rownum = #RowCnt
-- next row
Select #RowCnt = #RowCnt + 1
end