SQL server modulus operator to skip to every n'th row on a large table - sql

I have a large table with 100,000,000 rows. I'd like to select every n'th row from the table. My first instinct is to use something like this:
SELECT id,name FROM table WHERE id%125000=0
to retrieve a even spread of 800 rows (id is a clustered index)
This technique works fine on smaller data sets but with my larger table the query takes 2.5 minutes. I assume this is because the modulus operation is applied to every row. Is there a more optimal method of row skipping ?

Your query assumes that the IDs are contiguous (and probably they aren't without you realizing this...). Anyway, you should generate the IDs yourself:
select *
from T
where ID in (0, 250000*1, 250000*2, ...)
Maybe you need a TVP to send all IDs because there are so many. Or, you produce the IDs on the server in T-SQL or a SQLCLR function or a numbers table.
This technique allows you to perform index seeks and will be the fastest you can possibly produce. It reads the minimal amount of data possible.
Modulo is not SARGable. SQL Server could support this if Microsoft wanted it, but this is an exotic use case. They will never make modulo SARGable and they shouldn't.

The time is not going into the modulus operation itself, but rather into just reading 124,999 unnecessary rows for every row that you actually want (i.e., the Table Scan or Clustered Index Scan).
Just about the only way to speed up a query like this is something that seems at first illogical: Add an extra non-Clustered index on just that column ([ID]). Additionally, you may have to add an Index Hint to force it to use that index. And finally, it may not actually make it faster, though for a modulus of 125,000+, it should be (though it'll never be truly fast).
If your IDs are not necessarily contiguous (any deleted rows will pretty much cause this) and you really do need exactly every modulo rows, by ID order, then you can still use the approach above, but you will have to resequence the IDs for the Modulo operation using ROW_NUMBER() OVER(ORDER BY ID) in the query.

If id is in an index, then I am thinking of something along these lines:
with ids as (
select 1 as id
union all
select id + 125000
from ids
where id <= 100000000
)
select ids.id,
(select name from table t where t.id = ids.id) as name
from ids
option (MAXRECURSION 1000);
I think this formulation will use the index on table.
EDIT:
As I think about this approach, you can actually use it to get actual random ids in the table, rather than just evenly spaced ones:
with ids as (
select 1 as cnt,
ABS(CONVERT(BIGINT,CONVERT(BINARY(8), NEWID()))) % 100000000 as id
union all
select cnt + 1, ABS(CONVERT(BIGINT,CONVERT(BINARY(8), NEWID()))) % 100000000
from ids
where cnt < 800
)
select ids.id,
(select name from table t where t.id = ids.id) as name
from ids
option (MAXRECURSION 1000);
The code for the actual random number generator came from here.
EDIT:
Due to quirks in SQL Server, you can still get non-contiguous ids, even in your scenario. This accepted answer explains the cause. In short, identity values are not allocated one at a time, but rather in groups. The server can fail and even unused values get skipped.
One reason I wanted to do the random sampling was to help avoid this problem. Presumably, the above situation is rather rare on most systems. You can use the random sampling to generate say 900 ids. From these, you should be able to find 800 that are actually available for your sample.

DECLARE #i int, #max int, #query VARCHAR(1000)
SET #i = 0
SET #max = (SELECT max(id)/125000 FROM Table1)
SET #query = 'SELECT id, name FROM Table1 WHERE id in ('
WHILE #i <= #max
BEGIN
IF #i > 0 SET #query = #query + ','
SET #query = #query + CAST(#i*125000 as varchar(12))
SET #i = #i + 1
END
SET #query = #query + ')'
EXEC(#query)
EDIT :
To avoid any "holes" in a non-Contiguous ID situation, you can try something like this :
DECLARE #i int, #start int, #id int, #max int, #query VARCHAR(1000)
SET #i = 0
SET #max = (SELECT max(id)/125000 FROM Table1)
SET #query = 'SELECT id, name FROM Table1 WHERE id in ('
WHILE #i <= #max
BEGIN
SET #start = #i*125000
SET #id = (SELECT TOP 1 id FROM Table1 WHERE id >= #start ORDER BY id ASC)
IF #i > 0 SET #query = #query + ','
SET #query = #query + CAST(#id as VARCHAR(12))
SET #i = #i + 1
END
SET #query = #query + ')'
EXEC(#query)

Related

Stored Procedure algorithm taking 9 hours- better way of doing this?

I need to build an SQL stored procedure which basically updates an existing table (of about 150,000 rows) with an ID.
The table that this stored procedure will run over is basically a list of people, their names, addresses etc.
Now the algorithm for the id of the person is as follows:
- Take up to the first 4 characters of the persons first name.
- Take up to the first 2 characters of the persons last name.
- Pad the rest with 0's, with a counting number at the end, until the field is 8 characters.
For instance, the name JOHN SMITH would have an ID of 'JOHNSM00'. If there were 2 JOHN SMITH's, the ID of the next person would be JOHNSM01. If the persons name was FI LYNN for instance, the ID would be FILY0000.
I've got the following stored procedure that I wrote, but it takes around 9 hours to run! Is there a better way of doing this that I am missing?
ALTER PROCEDURE [dbo].[LM_SP_UPDATE_PERSON_CODES]
AS
DECLARE #NAMEKEY NVARCHAR(10)
DECLARE #NEWNAMEKEY NVARCHAR(10)
DECLARE #LENGTH INT
DECLARE #KEYCOUNT INT
DECLARE #I INT
DECLARE #PADDING NVARCHAR(8)
DECLARE #PERSONS CURSOR
DECLARE #FIRSTNAME NVARCHAR(30)
DECLARE #LASTNAME NVARCHAR(30)
SET #PADDING = '00000000'
--FIRST CLEAR OLD NEW NAMEKEYS IF ANY EXIST
UPDATE LM_T_PERSONS SET NEW_NAMEKEY = NULL
SET #PERSONS = CURSOR FOR
SELECT NAMEKEY, NAME_2, NAME_1 FROM LM_T_PERSONS
OPEN #PERSONS
FETCH NEXT FROM #PERSONS INTO #NAMEKEY, #FIRSTNAME, #LASTNAME
WHILE ##FETCH_STATUS = 0
BEGIN
--CHECK THE LENGTH OF FIRST NAME TO MAKE SURE NOTHING EXCEEDS 4
SET #LENGTH = LEN(#FIRSTNAME)
IF #LENGTH > 4
SET #LENGTH = 4
SET #NEWNAMEKEY = SUBSTRING(#FIRSTNAME,1,#LENGTH)
--CHECK THE LENGTH OF LAST NAME TO MAKE SURE NOTHING EXCEEDS 2
SET #LENGTH = LEN(#LASTNAME)
IF #LENGTH > 2
SET #LENGTH = 2
SET #NEWNAMEKEY = #NEWNAMEKEY + SUBSTRING(#LASTNAME,1,#LENGTH)
SET #LENGTH = LEN(#NEWNAMEKEY)
SET #I = 0
SET #PADDING = SUBSTRING('00000000',1,8 - LEN(#NEWNAMEKEY) - LEN(CONVERT(NVARCHAR(8),#I)))
--SEE IF THIS KEY ALREADY EXISTS
SET #KEYCOUNT = (SELECT COUNT(1) FROM LM_T_PERSONS WHERE NEW_NAMEKEY = #NEWNAMEKEY + #PADDING + CONVERT(NVARCHAR(8),#I) )
WHILE #KEYCOUNT > 0
BEGIN
SET #I = #I+1
SET #PADDING = SUBSTRING('00000000',1,8 - LEN(#NEWNAMEKEY) - LEN(CONVERT(NVARCHAR(8),#I)))
SET #KEYCOUNT = (SELECT COUNT(1) FROM LM_T_PERSONS WHERE NEW_NAMEKEY = #NEWNAMEKEY + #PADDING + CONVERT(NVARCHAR(8),#I) )
END
UPDATE LM_T_PERSONS SET NEW_NAMEKEY = #NEWNAMEKEY + #PADDING + CONVERT(NVARCHAR(8),#I) WHERE NAMEKEY = #NAMEKEY
FETCH NEXT FROM #PERSONS INTO #NAMEKEY, #FIRSTNAME, #LASTNAME
END
CLOSE #PERSONS
DEALLOCATE #PERSONS
Something like this can do it without the cursor:
UPDATE P
SET NAMEKEY = FIRSTNAME + LASTNAME + REPLICATE('0', 8 - LEN(FIRSTNAME) - LEN(LASTNAME) - LEN(I)) + I
FROM
LM_T_PERSONS AS P JOIN
(
SELECT
NAMEKEY,
LEFT(NAME_2, 4) AS FIRSTNAME,
LEFT(NAME_1, 2) AS LASTNAME,
CONVERT(NVARCHAR, ROW_NUMBER() OVER(PARTITION BY LEFT(NAME_2, 4), LEFT(NAME_1, 2) ORDER BY NAMEKEY)) AS I
FROM
LM_T_PERSONS
) AS DATA
ON P.NAMEKEY = DATA.NAMEKEY
You can verify the query here:
http://sqlfiddle.com/#!3/47365/19
I don't have any strict "you should do it XYZ way" but from similar sorts of exercises in the past:
If you wanted to keep with the stored proc and you have a window where you can do a task which is long (time-wise), like a weekend, and you can be sure you'll be the only operation running then setting the database to Simple recovery mode (I assume you're working on a Prod database so it is in Full recovery mode) for the duration of your work then that may speed things up (as you're not writing to the transaction log - since you're not, i.e. limited recoverability you want to make sure you're the only person doing anything). I'd take a full backup before starting work in case things get nasty
I don't think it's so much the stored proc but the cursor usage, substring etc as you're doing procedural code somewhere which is mainly set-based. I understand the "why" behind why these are there but an option would be to take it out and use something like SQL Server Integration Services, i.e. going with a technology option more suited to the looping or doing transformations against individual rows
Following on from using something more suited to procedural work...You could always write a simple .NET application or similar. Speaking from my own (limited) experience I have seen this done in the past but the mileage has tended to vary based on things like the complexity of the operation (in your case sounds simple enough in terms of transforming a UserId field), volumes and the person writing it...I would say I've never seen it go particularly well (in that we never turned around and went "that was awesome") but more like it got the job done so we'd move on to something else, taking neither good nor bad from the experience (just "average").
I think SSIS is a good way to go as you can extract these records from your DB, do the operations you need (considering SSIS supports a pretty broad variety of things you can do to data, including writing .NET code {albeit VB.NET from memory} if you have to) and then update your database.
Other kinds of ETL technologies will probably allow you to do similar things but I'm most familiar with SSIS. 150k rows wouldn't be a huge problem as it can deal with much larger volumes; from my own experience we would write SSIS packages that do nothing too special but they could do these sorts of operations over 1 million rows in about 15 mins...which I think the experts will say is still a little slow :-)
HTH a bit, Nathan
This query will get exactly what you want, and much faster.
select FirstName,
LastName,
ID + replicate('0',8-len(ID)-len(cast(rankNumber as varchar)))+cast(rankNumber as varchar)
from (
select dense_rank() over (partition by id order by rownumber) rankNumber,
FirstName,
LastName,
ID
from (
select row_number() over (Order by FirstName) rownumber,
FirstName,
LastName,
RTRIM(cast(FirstName as char(4)))+ RTRIM(cast(LastName as char(2))) as ID
from person
) A
) B
How about avoiding to use the inner WHILE loop by getting the maximum sequence number suffix (#I) for the existing NEW_NAMEKEY then just add 1 if more than 0, otherwise 0 if it returns NULL.

Updating a large table and minimizing user impact

I have a question on general database/sql server designing:
There is a table with 3 million rows that is being accessed 24x7. I need to update all the records in the table. Can you give me some methods to do this so that the user impact is minimized while I update my table?
Thanks in advance.
Normally you'd write a single update statement to update rows. But in your case you actually want to break it up.
http://www.sqlfiddle.com/#!3/c9c75/6
Is a working example of a common pattern. You don't want a batch size of 2, maybe you want 100,000 or 25,000 - you'll have to test on your system to determine the best balance between quick completion and low blocking.
declare #min int, #max int
select #min = min(user_id), #max = max(user_id)
from users
declare #tmp int
set #tmp = #min
declare #batchSize int
set #batchSize = 2
while #tmp <= #max
begin
print 'from ' + Cast(#tmp as varchar(10)) + ' to ' + cast(#tmp + #batchSize as varchar(10)) + ' starting (' + CONVERT(nvarchar(30), GETDATE(), 120) + ')'
update users
set name = name + '_foo'
where user_id >= #tmp and user_id < #tmp + #batchsize and user_id <= #max
set #tmp = #tmp + #batchSize
print 'Done (' + CONVERT(nvarchar(30), GETDATE(), 120) + ')'
WAITFOR DELAY '000:00:001'
end
update users
set name = name + '_foo'
where user_id > #max
We use patterns like this to update a user table about 10x your table size. With 100,000 chunks it takes about an hour. Performance depends on your hardware of course.
To minimally impact users, I would update only a certain # of records at a time. The number to update is more dependent on your hardware than anything else in my opinion.
As with all things database, it depends. What is the load pattern (ie, are users reading mainly from the end of the table)? How are new records added, if at all? What are your index fill factor settings and actual values? Will your update force any index re-computes? Can you split up the update to reduce locking? If so, do you need robust rollback ability in case of a failure? Are you setting the same value in every row, or do you need a per row calculation, or do you have a per-row source to match up?
Go through the table one row at a time using a loop or even a cursor. Make sure each update is using row locks.
If you don't have a way of identifying rows that still have to be updated, create another table first to hold the primary key and an update indicator, copy all primary key values in there and then keep track of how far you are along in that table.
This is also going to be the slowest method. If you need it to go a little faster, update a few thousand rows at a time, still using rowlock hints.

What is the most efficient way in T-SQL to compare answer strings to answer keys for scoring an exam

These exams typically have about 120 questions. Currently, they strings are compared to the keys and a value of 1 or 0 assigned. When complete, total the 1's for a raw score.
Are there any T-SQL functions like intersect or diff or something all together different that would handle this process as quickly as possible for 100,000 examinees?
Thanks in advance for your expertise.
-Steven
Try selecting the equality of a question to its correct answer. I assume you have the student's tests in one table and the key in another; something like this ought to work:
select student_test.student_id,
student_test.test_id,
student_test.question_id,
(student_test.answer == test_key.answer OR (student_test.answer IS NULL AND test_key.answer IS NULL))
from student_test
INNER JOIN test_key
ON student_test.test_id = test_key.test_id
AND student_test.question_id = test_key.question_id
WHERE student_test.test_id = <the test to grade>
You can group the results by student and test, then sum the last column if you want the DB to give you the total score. This will give a detailed "right/wrong" analysis of the test.
EDIT: The answers being stored as a continuous string make it much harder. You will most likely have to implement this in a procedural fashion with a cursor, meaning each student's answers are loaded, SUBSTRINGed into varchar(1)s, and compared to the key in an RBAR (row by agonizing row) fashion. You could also implement a scalar-valued function that compared string A to string B one character at a time and returned the number of differences, then call that function from a driving query that will call this function for each student.
Something like this might work out for you:
select student_id, studentname, answers, 0 as score
into #scores from test_answers
declare #studentid int
declare #i int
declare #answers varchar(120)
declare #testkey varchar(120)
select #testkey = test_key from test_keys where test_id = 1234
declare student_cursor cursor for
select student_id from #scores
open student_cursor
fetch next from student_cursor into #studentid
while ##FETCH_STATUS = 0
begin
select #i = 1
select #answers = answers from #scores where student_id = #studentid
while #i < len(#answers)
begin
if mid(#answers, #i, 1) = mid(#testkey, #i, 1)
update #scores set score = score + 1 where student_id = #studentid
select #i = #i + 1
end
fetch next from student_cursor into #studentid
end
select * from #scores
drop table #scores
I doubt that's the single most efficient way to do it, but it's not a bad starting point at least.

Why my T-SQL (WHILE) does not work?

In my code, I need to test whether specified column is null and the most close to 0 as possible (it can holds numbers from 0 to 50) so I have tried the code below.
It should start from 0 and for each value test the query. When #Results gets null, it should return. However, it does not work. Still prints 0.
declare #hold int
declare #Result int
set #hold0
set #Result=0
WHILE (#Result!=null)
BEGIN
select #Result=(SELECT Hold from Numbers WHERE Name='Test' AND Hold=#hold)
set #hold=#hold+1
END
print #hold
First, you can't test equality of NULL. NULL means an unknown value, so you don't know whether or not it does (or does not) equal any specific value. Instead of #Result!=NULL use #result IS NOT NULL
Second, don't use this kind of sequential processing in SQL if you can at all help it. SQL is made to handle sets, not process things sequentially. You could do all of this work with one simple SQL command and it will most likely run faster anyway:
SELECT
MIN(hold) + 1
FROM
Numbers N1
WHERE
N1.name = 'Test' AND
NOT EXISTS
(
SELECT
*
FROM
Numbers N2
WHERE
N2.name = 'Test' AND
N2.hold = N1.hold + 1
)
The query above basically tells the SQL Server, "Give me the smallest hold value plus 1 (MIN(hold) + 1) in the table Numbers where the name is test (name = 'Test') and where the row with name of 'Test' and hold of one more that that does not exist (the whole "NOT EXISTS" part)". In the case of the following rows:
Name Hold
-------- ----
Test 1
Test 2
NotTest 3
Test 20
SQL Server finds all of the rows with name of "Test" (1, 2, 20) then finds which ones don't have a row with name = Test and hold = hold + 1. For 1 there is a row with Test, 2 that exists. For Test, 2 there is no Test, 3 so it's still in the potential results. For Test, 20 there is no Test, 21 so that leaves us with:
Name Hold
-------- ----
Test 2
Test 20
Now SQL Server looks for MIN(hold) and gets 2 then it adds 1, so you get 3.
SQL Server may not perform the operations exactly as I described. The SQL statement tells SQL Server what you're looking for, but not how to get it. SQL Server has the freedom to use whatever method it determines is the most efficient for getting the answer.
The key is to always think in terms of sets and how do those sets get put together (through JOINs), filtered (through WHERE conditions or ON conditions within a join, and when necessary, grouped and aggregated (MIN, MAX, AVG, etc.).
have you tried
WHILE (#Result is not null)
BEGIN
select #Result=(SELECT Hold from Numbers WHERE Name='Test' AND Hold=#hold)
set #hold=#hold+1
END
Here's a more advanced version of Tom H.'s query:
SELECT MIN(N1.hold) + 1
FROM Numbers N1
LEFT OUTER JOIN Numbers N2
ON N2.Name = N1.Name AND N2.hold = N1.hold + 1
WHERE N1.name = 'Test' AND N2.name IS NULL
It's not as intuitive if you're not familiar with SQL, but it uses identical logic. For those who are more familiar with SQL, it makes the relationship between N1 and N2 easier to see. It may also be easier for the query optimizer to handle, depending on your DBMS.
Try this:
declare #hold int
declare #Result int
set #hold=0
set #Result=0
declare #max int
SELECT #max=MAX(Hold) FROM Numbers
WHILE (#hold <= #max)
BEGIN
select #Result=(SELECT Hold from Numbers WHERE Name='Test' AND Hold=#hold)
set #hold=#hold+1
END
print #hold
While is tricky in T-SQL - you can use this for (foreach) looping through (temp) tables too - with:
-- Foreach with T-SQL while
DECLARE #tempTable TABLE (rownum int IDENTITY (1, 1) Primary key NOT NULL, Number int)
declare #RowCnt int
declare #MaxRows int
select #RowCnt = 1
select #MaxRows=count(*) from #tempTable
declare #number int
while #RowCnt <= #MaxRows
begin
-- Number from given RowNumber
SELECT #number=Number FROM #tempTable where rownum = #RowCnt
-- next row
Select #RowCnt = #RowCnt + 1
end

How to keep a rolling checksum in SQL?

I am trying to keep a rolling checksum to account for order, so take the previous 'checksum' and xor it with the current one and generate a new checksum.
Name Checksum Rolling Checksum
------ ----------- -----------------
foo 11829231 11829231
bar 27380135 checksum(27380135 ^ 11829231) = 93291803
baz 96326587 checksum(96326587 ^ 93291803) = 67361090
How would I accomplish something like this?
(Note that the calculations are completely made up and are for illustration only)
This is basically the running total problem.
Edit:
My original claim was that is one of the few places where a cursor based solution actually performs best. The problem with the triangular self join solution is that it will repeatedly end up recalculating the same cumulative checksum as a subcalculation for the next step so is not very scalable as the work required grows exponentially with the number of rows.
Corina's answer uses the "quirky update" approach. I've adjusted it to do the check sum and in my test found that it took 3 seconds rather than 26 seconds for the cursor solution. Both produced the same results. Unfortunately however it relies on an undocumented aspect of Update behaviour. I would definitely read the discussion here before deciding whether to rely on this in production code.
There is a third possibility described here (using the CLR) which I didn't have time to test. But from the discussion here it seems to be a good possibility for calculating running total type things at display time but out performed by the cursor when the result of the calculation must be saved back.
CREATE TABLE TestTable
(
PK int identity(1,1) primary key clustered,
[Name] varchar(50),
[CheckSum] AS CHECKSUM([Name]),
RollingCheckSum1 int NULL,
RollingCheckSum2 int NULL
)
/*Insert some random records (753,571 on my machine)*/
INSERT INTO TestTable ([Name])
SELECT newid() FROM sys.objects s1, sys.objects s2, sys.objects s3
Approach One: Based on the Jeff Moden Article
DECLARE #RCS int
UPDATE TestTable
SET #RCS = RollingCheckSum1 =
CASE WHEN #RCS IS NULL THEN
[CheckSum]
ELSE
CHECKSUM([CheckSum] ^ #RCS)
END
FROM TestTable WITH (TABLOCKX)
OPTION (MAXDOP 1)
Approach Two - Using the same cursor options as Hugo Kornelis advocates in the discussion for that article.
SET NOCOUNT ON
BEGIN TRAN
DECLARE #RCS2 INT
DECLARE #PK INT, #CheckSum INT
DECLARE curRollingCheckSum CURSOR LOCAL STATIC READ_ONLY
FOR
SELECT PK, [CheckSum]
FROM TestTable
ORDER BY PK
OPEN curRollingCheckSum
FETCH NEXT FROM curRollingCheckSum
INTO #PK, #CheckSum
WHILE ##FETCH_STATUS = 0
BEGIN
SET #RCS2 = CASE WHEN #RCS2 IS NULL THEN #CheckSum ELSE CHECKSUM(#CheckSum ^ #RCS2) END
UPDATE dbo.TestTable
SET RollingCheckSum2 = #RCS2
WHERE #PK = PK
FETCH NEXT FROM curRollingCheckSum
INTO #PK, #CheckSum
END
COMMIT
Test they are the same
SELECT * FROM TestTable
WHERE RollingCheckSum1<> RollingCheckSum2
I'm not sure about a rolling checksum, but for a rolling sum for instance, you can do this using the UPDATE command:
declare #a table (name varchar(2), value int, rollingvalue int)
insert into #a
select 'a', 1, 0 union all select 'b', 2, 0 union all select 'c', 3, 0
select * from #a
declare #sum int
set #sum = 0
update #a
set #sum = rollingvalue = value + #sum
select * from #a
Select Name, Checksum
, (Select T1.Checksum_Agg(Checksum)
From Table As T1
Where T1.Name < T.Name) As RollingChecksum
From Table As T
Order By T.Name
To do a rolling anything, you need some semblance of an order to the rows. That can be by name, an integer key, a date or whatever. In my example, I used name (even though the order in your sample data isn't alphabetical). In addition, I'm using the Checksum_Agg function in SQL.
In addition, you would ideally have a unique value on which you compare the inner and outer query. E.g., Where T1.PK < T.PK for an integer key or even string key would work well. In my solution if Name had a unique constraint, it would also work well enough.