How to get the next number in a sequence - sql

I have a table like this:
+----+-----------+------+-------+--+
| id | Part | Seq | Model | |
+----+-----------+------+-------+--+
| 1 | Head | 0 | 3 | |
| 2 | Neck | 1 | 3 | |
| 3 | Shoulders | 2 | 29 | |
| 4 | Shoulders | 2 | 3 | |
| 5 | Stomach | 5 | 3 | |
+----+-----------+------+-------+--+
How can I insert another record with the next seq after Stomach for Model 3. So here is what the new table suppose to look like:
+----+-----------+------+-------+--+
| id | Part | Seq | Model | |
+----+-----------+------+-------+--+
| 1 | Head | 0 | 3 | |
| 2 | Neck | 1 | 3 | |
| 3 | Shoulders | 2 | 29 | |
| 4 | Shoulders | 2 | 3 | |
| 5 | Stomach | 5 | 3 | |
| 6 | Groin | 6 | 3 | |
+----+-----------+------+-------+--+
Is there a way to craft an insert query that will give the next number after the highest seq for Model 3 only. Also, looking for something that is concurrency safe.

If you do not maintain a counter table, there are two options. Within a transaction, first select the MAX(seq_id) with one of the following table hints:
WITH(TABLOCKX, HOLDLOCK)
WITH(ROWLOCK, XLOCK, HOLDLOCK)
TABLOCKX + HOLDLOCK is a bit overkill. It blocks regular select statements, which can be considered heavy even though the transaction is small.
A ROWLOCK, XLOCK, HOLDLOCK table hint is probably a better idea (but: read the alternative with a counter table further on). The advantage is that it does not block regular select statements, ie when the select statements don't appear in a SERIALIZABLE transaction, or when the select statements don't provide the same table hints. Using ROWLOCK, XLOCK, HOLDLOCK will still block insert statements.
Of course you need to be sure that no other parts of your program select the MAX(seq_id) without these table hints (or outside a SERIALIZABLE transaction) and then use this value to insert rows.
Note that depending on the number of rows that are locked this way, it is possible that SQL Server will escalate the lock to a table lock. Read more about lock escalation here.
The insert procedure using WITH(ROWLOCK, XLOCK, HOLDLOCK) would look as follows:
DECLARE #target_model INT=3;
DECLARE #part VARCHAR(128)='Spine';
BEGIN TRY
BEGIN TRANSACTION;
DECLARE #max_seq INT=(SELECT MAX(seq) FROM dbo.table_seq WITH(ROWLOCK,XLOCK,HOLDLOCK) WHERE model=#target_model);
IF #max_seq IS NULL SET #max_seq=0;
INSERT INTO dbo.table_seq(part,seq,model)VALUES(#part,#max_seq+1,#target_model);
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH
An alternative and probably a better idea is to have a counter table, and provide these table hints on the counter table. This table would look like the following:
CREATE TABLE dbo.counter_seq(model INT PRIMARY KEY, seq_id INT);
You would then change the insert procedure as follows:
DECLARE #target_model INT=3;
DECLARE #part VARCHAR(128)='Spine';
BEGIN TRY
BEGIN TRANSACTION;
DECLARE #new_seq INT=(SELECT seq FROM dbo.counter_seq WITH(ROWLOCK,XLOCK,HOLDLOCK) WHERE model=#target_model);
IF #new_seq IS NULL
BEGIN SET #new_seq=1; INSERT INTO dbo.counter_seq(model,seq)VALUES(#target_model,#new_seq); END
ELSE
BEGIN SET #new_seq+=1; UPDATE dbo.counter_seq SET seq=#new_seq WHERE model=#target_model; END
INSERT INTO dbo.table_seq(part,seq,model)VALUES(#part,#new_seq,#target_model);
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH
The advantage is that fewer row locks are used (ie one per model in dbo.counter_seq), and lock escalation cannot lock the whole dbo.table_seq table thus blocking select statements.
You can test all this and see the effects yourself, by placing a WAITFOR DELAY '00:01:00' after selecting the sequence from counter_seq, and fiddling with the table(s) in a second SSMS tab.
PS1: Using ROW_NUMBER() OVER (PARTITION BY model ORDER BY ID) is not a good way. If rows are deleted/added, or ID's changed the sequence would change (consider invoice id's that should never change). Also in terms of performance having to determine the row numbers of all previous rows when retrieving a single row is a bad idea.
PS2: I would never use outside resources to provide locking, when SQL Server already provides locking through isolation levels or fine-grained table hints.

The correct way to handle such insertions is to use an identity column or, if you prefer, a sequence and a default value for the column.
However, you have a NULL value for the seq column, which does not seem correct.
The problem with a query such as:
Insert into yourtable(id, Part, Seq, Model)
Select 6, 'Groin', max(Seq) + 1, 3
From yourtable;
is that two such queries, running at the same time, could produce the same value. The recommendation is to declare seq as a unique, identity column and let the database do all the work.

Let's first list the challenges:
We cannot use a normal constraint as there are existing null values and we also need to cater for duplicates as well as gaps - if we look at the existing data. This is fine, we will figure it out ;-> in step 3
We require safety for concurrent operations (thus some form or mix of transactions, isolation levels and possibly a "kinda SQL mutex".) Gut feel here is a stored proc for a couple of reasons:
2.1 It protects more easily from sql injection
2.2 We can control the isolation levels (table locking) more easily and recover from some issues which come with this kind of requirement
2.3 We can use application level db locks to control the concurrency
We must store or find the next value on every insert. The word concurrency tells us already that there will be contention and probably high throughput (else please stick to single threads). So we must already be thinking: do not read from the same table you want to write to in an already complicated world.
So with that short prequel, let's attempt a solution:
As a start, we are creating your original table and then also a table to hold the sequence (BodyPartsCounter) which we are setting to the last used sequence + 1:
CREATE TABLE BodyParts
([id] int identity, [Part] varchar(9), [Seq] varchar(4), [Model] int)
;
INSERT INTO BodyParts
([Part], [Seq], [Model])
VALUES
('Head', NULL, 3),
('Neck', '1', 3),
('Shoulders', '2', 29),
('Shoulders', '2', 3),
('Stomach', '5', 3)
;
CREATE TABLE BodyPartsCounter
([id] int
, [counter] int)
;
INSERT INTO BodyPartsCounter
([id], [counter])
SELECT 1, MAX(id) + 1 AS id FROM BodyParts
;
Then we need to create the stored procedure which will do the magic. In short, it acts as a mutex, basically guaranteeing you concurrency (if you do not do inserts or updates into the same tables elsewhere). It then get's the next seq, updates it and inserts the new row. After this has all happened it will commit the transaction and release the stored proc for the next waiting calling thread.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- =============================================
-- Author: Charlla
-- Create date: 2016-02-15
-- Description: Inserts a new row in a concurrently safe way
-- =============================================
CREATE PROCEDURE InsertNewBodyPart
#bodypart varchar(50),
#Model int = 3
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
BEGIN TRANSACTION;
-- Get an application lock in your threaded calls
-- Note: this is blocking for the duration of the transaction
DECLARE #lockResult int;
EXEC #lockResult = sp_getapplock #Resource = 'BodyPartMutex',
#LockMode = 'Exclusive';
IF #lockResult = -3 --deadlock victim
BEGIN
ROLLBACK TRANSACTION;
END
ELSE
BEGIN
DECLARE #newId int;
--Get the next sequence and update - part of the transaction, so if the insert fails this will roll back
SELECT #newId = [counter] FROM BodyPartsCounter WHERE [id] = 1;
UPDATE BodyPartsCounter SET [counter] = #newId + 1 WHERE id = 1;
-- INSERT THE NEW ROW
INSERT INTO dbo.BodyParts(
Part
, Seq
, Model
)
VALUES(
#bodypart
, #newId
, #Model
)
-- END INSERT THE NEW ROW
EXEC #lockResult = sp_releaseapplock #Resource = 'BodyPartMutex';
COMMIT TRANSACTION;
END;
END
GO
Now run the test with this:
EXEC #return_value = [dbo].[InsertNewBodyPart]
#bodypart = N'Stomach',
#Model = 4
SELECT 'Return Value' = #return_value
SELECT * FROM BodyParts;
SELECT * FROM BodyPartsCounter
This all works - but be careful - there's a lot to consider with any kind of multithreaded app.
Hope this helps!

I believe the best bet to handle this kind of sequence generation scenario is the counter table as TT suggested. I just wanted to show you here a slightly simplified version of TT implementation.
Tables:
CREATE TABLE dbo.counter_seq(model INT PRIMARY KEY, seq INT);
CREATE TABLE dbo.table_seq(part varchar(128), seq int, model int);
Simpler version (No SELECT statement to retrieve the current seq):
DECLARE #target_model INT=3;
DECLARE #part VARCHAR(128)='Otra MAS';
BEGIN TRY
BEGIN TRANSACTION;
DECLARE #seq int = 1
UPDATE dbo.counter_seq WITH(ROWLOCK,HOLDLOCK) SET #seq = seq = seq + 1 WHERE model=#target_model;
IF ##ROWCOUNT = 0 INSERT INTO dbo.counter_seq VALUES (#target_model, 1);
INSERT INTO dbo.table_seq(part,seq,model)VALUES(#part,#seq,#target_model);
COMMIT
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH

Since you want the sequence to be based on the a specific model, just add that into the where clause when doing the select. This will ensure the Max(SEQ) pertains only to that model series. Also since the SEQ can be null wrap it in a ISNULL, so if it is null it will be 0, so 0 + 1, will set the next to 1.
The basic way to do this is :
Insert into yourtable(id, Part, Seq, Model)
Select 6, 'Groin', ISNULL(max(Seq),0) + 1, 3
From yourtable
where MODEL = 3;

I would not try to store the Seq value in the table in the first place.
As you said in the comments, your ID is IDENTITY, which increases automatically in a very efficient and concurrent-safe way by the server. Use it for determining the order in which rows were inserted and the order in which the Seq values should be generated.
Then use ROW_NUMBER to generate values of Seq partitioned by Model (the sequence restarts from 1 for each value of Model) as needed in the query.
SELECT
ID
,Part
,Model
,ROW_NUMBER() OVER(PARTITION BY Model ORDER BY ID) AS Seq
FROM YourTable

insert into tableA (id,part,seq,model)
values
(6,'Groin',(select MAX(seq)+1 from tableA where model=3),3)

create function dbo.fncalnxt(#model int)
returns int
begin
declare #seq int
select #seq= case when #model=3 then max(id) --else
end from tblBodyParts
return #seq+1
end
--query idea To insert values, ideal if using SP to insert
insert into tblBodyParts values('groin',dbo.fncalnxt(#model),#model)
You can try this i guess.
A novice shot, correct me if im wrong. i'd suggest using function to get the value in seq column based on model;
you'll have to check the else case though to return another value you want, when model!=3, it'll return null now.

Assuming you have following table:
CREATE TABLE tab (
id int IDENTITY(1,1) PRIMARY KEY,
Part VARCHAR(32) not null,
Seq int not null,
Model int not null
);
INSERT INTO
tab(Part,Seq,Model)
VALUES
('Head', 0, 3),
('Neck', 1, 3),
('Shoulders', 2, 29),
('Shoulders', 2, 3),
('Stomach', 5, 3);
The query below will allow you to import multiple records, without ruine the model_seq
INSERT INTO
tab (model, part, model_seq)
SELECT
n.model,
n.part,
-- ensure new records will get receive the proper model_seq
IFNULL(max_seq + model_seq, model_seq) AS model_seq
FROM
(
SELECT
-- row number for each model new record
ROW_NUMBER() OVER(PARTITION BY model ORDER BY part) AS model_seq,
n.model,
n.part,
MAX(t.seq) AS max_seq
FROM
-- Table-values constructor allows you to prepare the
-- temporary data (with multi rows),
-- where you could join the existing one
-- to retrieve the max(model_seq) if any
(VALUES
('Stomach',3),
('Legs',3),
('Legs',29),
('Arms',1)
) AS n(part, model)
LEFT JOIN
tab
ON
tab.model = n.model
GROUP BY
n.model n.part
) AS t
We need row_number() to ensure if we import more than one value the order will be kept. More info about ROW_NUMBER() OVER() (Transact-SQL)
Table-value constructor is used to create a table with the new values and join the MAX model_seq for model.
You could find more about table-value contructor here: Table Value Constructor (Transact-SQL)

Related

Azure Synapse fastest way to process 20k statements in order

I am designing an incremental update process for a cloud based database (Azure). The only existing changelog is a .txt file that records every insert, delete, and update statement that the database processes. There is no change data capture table available, or any database table that records changes and I cannot enable watermarking on the database. The .txt file is structured as follows:
update [table] set x = 'data' where y = 'data'
go
insert into [table] values (data)
go
delete from [table] where x = data
go
I have built my process to convert the .txt file into a table in the cloud as follows:
update_id | db_operation | statement | user | processed_flag
----------|--------------|-------------------------------------------------|-------|---------------
1 | 'update' | 'update [table] set x = data where y = data' | user1 | 0
2 | 'insert' | 'insert into [table] values (data)' | user2 | 0
3 | 'delete' | 'delete from [table] where x = data' | user3 | 1
I use this code to create a temporary table of the unprocessed transactions, and then loop over the table, create a sql statement and then execute that transaction.
CREATE TABLE temp_incremental_updates
WITH
(
DISTRIBUTION = HASH ( [user] ),
HEAP
)
AS
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Sequence,
[user],
[statement]
FROM upd.incremental_updates
WHERE processed_flag = 0;
DECLARE #nbr_statements INT = (SELECT COUNT(*) FROM temp_incremental_updates),
#i INT = 1;
WHILE #i <= #nbr_statements
BEGIN
DECLARE #sql_code NVARCHAR(4000) = (SELECT [statement] FROM temp_incremental_updates WHERE Sequence = #i);
EXEC sp_executesql #sql_code;
SET #i +=1;
END
DROP TABLE temp_incremental_updates;
UPDATE incremental_updates SET processed_flag = 1
This is taking a very long time, upwards of an hour. Is there a different way I can quickly processes multiple sql statements that need to occur in a specific order? Order is relevant because, for example: if I try to process a delete statement before the insert statement that created that data, azure synapse will throw an error.
Less than 2 hours for 20k individual statements is pretty good for Synapse!
Synapse isn't meant to do transactional processing. You need to convert individual updates to batch updates and execute statements like MERGE for big batches or rows instead of INSERT, UPDATE and DELETE for each row.
In your situation, you could:
Group all inserts/updates by table name
Create a temp table for each group. E.g. table1_insert_updates
Run MERGE like statement from table1_insert_updates to table1.
For deletes:
Group primary keys by table name
Run one DELETE FROM table1 where key in (primary keys) per table.
Frankly 20k is such a bad number, it's not too small and far from big enough. So even after "grouping" you could still have performance issues if you batch/group sizes are too small.
Synapse isn't meant for transaction processing. It'll merge a table with a million rows into a table with a billion rows in less than 5 minutes using a single MERGE statement to upsert a million rows, but if you run 1000 delete and 1000 insert statements one after the other it'll probably take longer!
EDIT: You'll also have to use PARTITION BY and RANK (or ROWNUMBER) to de-duplicate in case there are multiple updates to same row in a single batch. Not easy depending on how your input is (update contains all columns (even unchanged) or only changed columns) this might become very complicated.
Again Synapse is not meant for transaction processing.
Try to declare a cursor for selecting all the data from temp_incremental_updates at once, instead of making multiple reads:
CREATE TABLE temp_incremental_updates
WITH
(
DISTRIBUTION = HASH ( [user] ),
HEAP
)
AS
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Sequence,
[user],
[statement]
FROM upd.incremental_updates
WHERE processed_flag = 0;
DECLARE cur CURSOR FOR SELECT [statement] FROM temp_incremental_updates ORDER BY Sequence
OPEN cur
FETCH NEXT FROM cur INTO #sql_code
WHILE ##FETCH_STATUS = 0 BEGIN
EXEC sp_executesql #sql_code;
FETCH NEXT FROM cur INTO #sql_code
END
-- Rest of the code

In SQL Server 2008 R2, is there a way to create a custom auto increment identity field without using IDENTITY(1,1)?

I would like to be able to pull the custom key value from a table, but would also like it to perform like SQL Server's IDENTITY(1,1) column on inserts.
The custom key is for another application and will need to be used by different functions so the value will need to be pulled from a table and available for other areas.
Here are some if my attempts:
Tried a trigger on the table works well on single inserts, failed on using SQL insert (forgetting the fact that a triggers are not per row but by batch)
ALTER TRIGGER [sales].[trg_NextInvoiceDocNo]
ON [sales].[Invoice]
AFTER INSERT
AS
BEGIN
DECLARE #ResultVar VARCHAR(25)
DECLARE #Key VARCHAR(25)
EXEC [dbo].[usp_GetNextKeyCounterChar]
#tcForTbl = 'docNbr', #tcForGrp = 'docNbr', #NewKey = #ResultVar OUTPUT
UPDATE sales.InvoiceRET
SET DocNbr = #ResultVar
FROM sales.InvoiceRET
JOIN inserted ON inserted.id = sales.InvoiceRET.id;
END;
Thought about a scalar function, but functions cannot exec stored procedures or update statements in order to set the new key value in the lookup table.
Thanks
You can use ROW_NUMBER() depending on the type of concurrency you are dealing with. Here is some sample data and a demo you can run locally.
-- Sample table
USE tempdb
GO
IF OBJECT_ID('dbo.sometable','U') IS NOT NULL DROP TABLE dbo.sometable;
GO
CREATE TABLE dbo.sometable
(
SomeId INT NULL,
Col1 INT NOT NULL
);
GO
-- Stored Proc to insert data
CREATE PROC dbo.InsertProc #output BIT AS
BEGIN -- Your proc starts here
INSERT dbo.sometable(Col1)
SELECT datasource.[value]
FROM (VALUES(CHECKSUM(NEWID())%100)) AS datasource([value]) -- simulating data from somewhere
CROSS APPLY (VALUES(1),(1),(1)) AS x(x);
WITH
id(MaxId) AS (SELECT ISNULL(MAX(t.SomeId),0) FROM dbo.sometable AS t),
xx AS
(
SELECT s.SomeId, RN = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))+id.MaxId, s.Col1, id.MaxId
FROM id AS id
CROSS JOIN dbo.sometable AS s
WHERE s.SomeId IS NULL
)
UPDATE xx SET xx.SomeId = xx.RN;
IF #output = 1
SELECT t.* FROM dbo.sometable AS t;
END
GO
Each time I run: EXEC dbo.InsertProc 1; it returns 3 more rows with the correct ID col. Each time I execute it, it adds more rows and auto-increments as needed.
SomeId Col1
-------- ------
1 62
2 73
3 -17

Arithmetic overflow on large table

I have a table with 5 billions of rows in SQL Server 2014 (Developer Edition, x64, Windows 10 Pro x64):
CREATE TABLE TestTable
(
ID BIGINT IDENTITY(1,1),
PARENT_ID BIGINT NOT NULL,
CONSTRAINT PK_TestTable PRIMARY KEY CLUSTERED (ID)
);
CREATE NONCLUSTERED INDEX IX_TestTable_ParentId
ON TestTable (PARENT_ID);
I'm trying to apply the following patch:
-- Create non-nullable column with default (should be online operation in Enterprise/Developer edition)
ALTER TABLE TestTable
ADD ORDINAL TINYINT NOT NULL CONSTRAINT DF_TestTable_Ordinal DEFAULT 0;
GO
-- Populate column value for existing data
BEGIN
SET NOCOUNT ON;
DECLARE #BATCH_SIZE BIGINT = 1000000;
DECLARE #COUNTER BIGINT = 0;
DECLARE #ROW_ID BIGINT;
DECLARE #ORDINAL BIGINT;
DECLARE ROWS_C CURSOR
LOCAL FORWARD_ONLY FAST_FORWARD READ_ONLY
FOR
SELECT
ID AS ID,
ROW_NUMBER() OVER (PARTITION BY PARENT_ID ORDER BY ID ASC) AS ORDINAL
FROM
TestTable;
OPEN ROWS_C;
FETCH NEXT FROM ROWS_C
INTO #ROW_ID, #ORDINAL;
BEGIN TRANSACTION;
WHILE ##FETCH_STATUS = 0
BEGIN
UPDATE TestTable
SET
ORDINAL = CAST(#ORDINAL AS TINYINT)
WHERE
ID = #ROW_ID;
FETCH NEXT FROM ROWS_C
INTO #ROW_ID, #ORDINAL;
SET #COUNTER = #COUNTER + 1;
IF #COUNTER = #BATCH_SIZE
BEGIN
COMMIT TRANSACTION;
SET #COUNTER = 0;
BEGIN TRANSACTION;
END;
END;
COMMIT TRANSACTION;
CLOSE ROWS_C;
DEALLOCATE ROWS_C;
SET NOCOUNT OFF;
END;
GO
-- Drop default constraint from the column
ALTER TABLE TestTable
DROP CONSTRAINT DF_TestTable_Ordinal;
GO
-- Drop IX_TestTable_ParentId index
DROP INDEX IX_TestTable_ParentId
ON TestTable;
GO
-- Create IX_TestTable_ParentId_Ordinal index
CREATE UNIQUE INDEX IX_TestTable_ParentId_Ordinal
ON TestTable (PARENT_ID, ORDINAL);
GO
The aim of patch is to add a column, called ORDINAL, which is an ordinal number of the record within the same parent (defined by PARENT_ID). The patch is run using SQLCMD.
The patch is done is this way for a set of reasons:
Table is too large to run a single UPDATE statement on it (takes enormous amount of time and space in transaction log/tempdb).
Batch updates using a single UPDATE statement with TOP n rows are not simple to implement (if we update table in, say, 1m rows batches, 1000001st row may belong to the same PARENT_ID as 1000000th which will lead to wrong ordinal number assigned to 1000001st record). In other words, SELECT statement run in cursor should be run once (without paging) or more complicated operations (joins/conditions) should be applied.
Adding NULL column and changing it to NOT NULL later is not a good solution since I use SNAPSHOT isolation (full table update will be performed on altering column to be NOT NULL).
The patch works perfect on a small database with a few millions of rows, but, when applied to the one with billions of rows, I get:
Msg 3606, Level 16, State 2, Server XXX, Line 22
Arithmetic overflow occurred.
My first guess was ORDINAL value is too big to fit into TINYINT column, but this is not the case. I created a test database with similar structure and populated with data (more than 255 rows per parent). The error message I get is still arithmetic exception, but with different message code and different wording (explicitly saying it can't fit data into TINYINT).
Currently I have a couple of suspicions, but I haven't managed to find anything that could help me:
CURSOR is not able to handle more than MAX(INT32) rows.
SQLCMD imposed limitations.
Do you have any ideas on what could the problem be?
How about using a While loop but making sure that you keep the same parent_ids together:
DECLARE #SegmentSize BIGINT = 1000000
DECLARE #CurrentSegment BigInt = 0
WHILE 1 = 1
BEGIN
;With UpdateData As
(
SELECT ID AS ID,
ROW_NUMBER() OVER (PARTITION BY PARENT_ID ORDER BY ID ASC) AS ORDINAL
FROM TestData
WHERE ID > #CurrentSegment AND ID <= (#CurrentSegment + #SegmentSize)
)
UPDATE TestData
SET Ordinal = UpdateDate.Ordinal
FROM TestData
INNER JOIN UpdateData ON TestData.Id = UpdateData.Id
IF ##ROWCOUNT = 0
BEGIN
BREAK
END
SET #CurrentSegment = #CuurentSegment + #SegmentSize
END
EDIT - Amended to segment on Parent_Id as per request. This should be
reasonably quick as Parent_id is indexed (added Option(Recompile)
to ensure that actual value is used for the lookup.
Because you are not updating
the whole table this will limit the Transaction Log growth!
DECLARE #SegmentSize BIGINT = 1000000
DECLARE #CurrentSegment BigInt = 0
WHILE 1 = 1
BEGIN
;With UpdateData As
(
SELECT ID AS ID,
ROW_NUMBER() OVER (PARTITION BY PARENT_ID ORDER BY ID ASC) AS ORDINAL
FROM TestData
WHERE Parent_ID > #CurrentSegment AND
Parent_ID <= (#CurrentSegment + #SegmentSize)
)
UPDATE TestData
SET Ordinal = UpdateDate.Ordinal
FROM TestData
INNER JOIN UpdateData ON TestData.Id = UpdateData.Id
OPTION (RECOMPILE)
IF ##ROWCOUNT = 0
BEGIN
BREAK
END
SET #CurrentSegment = #CuurentSegment + #SegmentSize
END

Pipes and filters at DBMS-level: Splitting the MERGE output stream

Scenario
We have a pretty standard data import process in which we load a
staging table, then MERGE it into a target table.
New requirements (green) involve capturing a subset of the imported data
into a separate queue table for completely unrelated processing.
The "challenge"
(1) The subset consists of a selection of the records: those that were
newly inserted into the target table only.
(2) The subset is a projection of some of the inserted columns, but also
at least one column that is only present in the source (the staging
table).
(3) The MERGE statement already uses the OUTPUT..INTO clause
strictly to record the $actions taken by MERGE, so that we can
PIVOT the result and COUNT the number of insertions, updates and
deletions for statistics purposes. We don't really enjoy buffering the
actions for the entire dataset like that and would prefer aggregating
the sums on the fly. Needless to say, we don't want to add more data to
this OUTPUT table.
(4) We don't want to do the matching work that the MERGE
performs a second time for whatever reason, even partially. The
target table is really big, we can't index everything, and the
operation is generally quite expensive (minutes, not seconds).
(5) We're not considering roundtripping any output from the MERGE to
the client just so that the client can route it to the queue by
sending it back immediately. The data has to stay on the server.
(6) We wish to avoid buffering the entire dataset in temporary storage
between staging and the queue.
What would be the best way of going about it?
Failures
(a) The requirement to enqueue only the inserted records prevents us
from targeting the queue table directly in an OUTPUT..INTO clause of
the MERGE, as it doesn't allow any WHERE clause. We can use some
CASE trickery to mark the unwanted records for subsequent deletion
from the queue without processing, but this seems crazy.
(b) Because some columns intended for the queue don't appear in the
target table, we cannot simply add an insertion trigger on the target
table to load the queue. The "data flow split" has to happen sooner.
(c) Since we already use an OUTPUT..INTO clause in the MERGE, we
cannot add a second OUTPUT clause and nest the MERGE into an
INSERT..SELECT to load the queue either. This is a shame, because it
feels like a completely arbitrary limitation for something that works
very well otherwise; the SELECT filters only the records with the
$action we want (INSERT) and INSERTs them in the queue in a single
statement. Thus, the DBMS can theoretically avoid buffering the whole
dataset and simply stream it into the queue. (Note: we didn't pursue
and it's likely that it actually didn't optimize the plan this way.)
Situation
We feel we've exhausted our options, but decided to turn to the hivemind
to be sure. All we can come up with is:
(S1) Create a VIEW of the target table that also contains nullable
columns for the data intended for the queue only, and have the
SELECT statement define them as NULL. Then, setup INSTEAD OF
triggers that populate both the target table and the queue
appropriately. Finally, wire the MERGE to target the view. This
works, but we're not fans of the construct -- it definitely
looks tricky.
(S2) Give up, buffer the entire dataset in a temporary table using
another MERGE..OUTPUT. After the MERGE, immediately copy the data
(again!) from temporary table into the queue.
My understanding is that the main obstacle is the limitation of the OUTPUT clause in SQL Server. It allows one OUTPUT INTO table and/or one OUTPUT that returns result set to the caller.
You want to save the outcome of the MERGE statement in two different ways:
all rows that were affected by MERGE for gathering statistics
only inserted rows for queue
Simple variant
I would use your S2 solution. At least to start with. It is easy to understand and maintain and should be quite efficient, because the most resource-intensive operation (MERGE into Target itself would be performed only once). There is a second variant below and it would be interesting to compare their performance on real data.
So:
Use OUTPUT INTO #TempTable in the MERGE
Either INSERT all rows from #TempTable into Stats or aggregate before inserting. If all you need is aggregated statistics, it makes sense to aggregate results of this batch and merge it into the final Stats instead of copying all rows.
INSERT into Queue only "inserted" rows from #TempTable.
I'll take sample data from the answer by #i-one.
Schema
-- I'll return to commented lines later
CREATE TABLE [dbo].[TestTarget](
-- [ID] [int] IDENTITY(1,1) NOT NULL,
[foo] [varchar](10) NULL,
[bar] [varchar](10) NULL
);
CREATE TABLE [dbo].[TestStaging](
[foo] [varchar](10) NULL,
[bar] [varchar](10) NULL,
[baz] [varchar](10) NULL
);
CREATE TABLE [dbo].[TestStats](
[MergeAction] [nvarchar](10) NOT NULL
);
CREATE TABLE [dbo].[TestQueue](
-- [TargetID] [int] NOT NULL,
[foo] [varchar](10) NULL,
[baz] [varchar](10) NULL
);
Sample data
TRUNCATE TABLE [dbo].[TestTarget];
TRUNCATE TABLE [dbo].[TestStaging];
TRUNCATE TABLE [dbo].[TestStats];
TRUNCATE TABLE [dbo].[TestQueue];
INSERT INTO [dbo].[TestStaging]
([foo]
,[bar]
,[baz])
VALUES
('A', 'AA', 'AAA'),
('B', 'BB', 'BBB'),
('C', 'CC', 'CCC');
INSERT INTO [dbo].[TestTarget]
([foo]
,[bar])
VALUES
('A', 'A_'),
('B', 'B?');
Merge
DECLARE #TempTable TABLE (
MergeAction nvarchar(10) NOT NULL,
foo varchar(10) NULL,
baz varchar(10) NULL);
MERGE INTO TestTarget AS Dst
USING TestStaging AS Src
ON Dst.foo = Src.foo
WHEN MATCHED THEN
UPDATE SET
Dst.bar = Src.bar
WHEN NOT MATCHED BY TARGET THEN
INSERT (foo, bar)
VALUES (Src.foo, Src.bar)
OUTPUT $action AS MergeAction, inserted.foo, Src.baz
INTO #TempTable(MergeAction, foo, baz)
;
INSERT INTO [dbo].[TestStats] (MergeAction)
SELECT T.MergeAction
FROM #TempTable AS T;
INSERT INTO [dbo].[TestQueue]
([foo]
,[baz])
SELECT
T.foo
,T.baz
FROM #TempTable AS T
WHERE T.MergeAction = 'INSERT'
;
SELECT * FROM [dbo].[TestTarget];
SELECT * FROM [dbo].[TestStats];
SELECT * FROM [dbo].[TestQueue];
Result
TestTarget
+-----+-----+
| foo | bar |
+-----+-----+
| A | AA |
| B | BB |
| C | CC |
+-----+-----+
TestStats
+-------------+
| MergeAction |
+-------------+
| INSERT |
| UPDATE |
| UPDATE |
+-------------+
TestQueue
+-----+-----+
| foo | baz |
+-----+-----+
| C | CCC |
+-----+-----+
Second variant
Tested on SQL Server 2014 Express.
OUTPUT clause can send its result set to a table and to the caller. So, OUTPUT INTO can go into the Stats directly and if we wrap the MERGE statement into a stored procedure, then we can use INSERT ... EXEC into the Queue.
If you examine execution plan you'll see that INSERT ... EXEC creates a temporary table behind the scenes anyway (see also The Hidden Costs of INSERT EXEC by
Adam Machanic), so I expect that overall performance would be similar to the first variant when you create temporary table explicitly.
One more problem to solve: Queue table should have only "inserted" rows, not all effected rows. To achieve that you could use a trigger on the Queue table to discard rows other than "inserted". One more possibility is to define a unique index with IGNORE_DUP_KEY = ON and prepare the data in such a way that "non-inserted" rows would violate the unique index and would not be inserted into the table.
So, I'll add an ID IDENTITY column to the Target table and I'll add a TargetID column to the Queue table. (Uncomment them in the script above).
Also, I'll add an index to the Queue table:
CREATE UNIQUE NONCLUSTERED INDEX [IX_TargetID] ON [dbo].[TestQueue]
(
[TargetID] ASC
) WITH (
PAD_INDEX = OFF,
STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF,
IGNORE_DUP_KEY = ON,
DROP_EXISTING = OFF,
ONLINE = OFF,
ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON)
Important part is UNIQUE and IGNORE_DUP_KEY = ON.
Here is the stored procedure for the MERGE:
CREATE PROCEDURE [dbo].[TestMerge]
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
MERGE INTO dbo.TestTarget AS Dst
USING dbo.TestStaging AS Src
ON Dst.foo = Src.foo
WHEN MATCHED THEN
UPDATE SET
Dst.bar = Src.bar
WHEN NOT MATCHED BY TARGET THEN
INSERT (foo, bar)
VALUES (Src.foo, Src.bar)
OUTPUT $action INTO dbo.TestStats(MergeAction)
OUTPUT CASE WHEN $action = 'INSERT' THEN inserted.ID ELSE 0 END AS TargetID,
inserted.foo,
Src.baz
;
END
Usage
TRUNCATE TABLE [dbo].[TestTarget];
TRUNCATE TABLE [dbo].[TestStaging];
TRUNCATE TABLE [dbo].[TestStats];
TRUNCATE TABLE [dbo].[TestQueue];
-- Make sure that `Queue` has one special row with TargetID=0 in advance.
INSERT INTO [dbo].[TestQueue]
([TargetID]
,[foo]
,[baz])
VALUES
(0
,NULL
,NULL);
INSERT INTO [dbo].[TestStaging]
([foo]
,[bar]
,[baz])
VALUES
('A', 'AA', 'AAA'),
('B', 'BB', 'BBB'),
('C', 'CC', 'CCC');
INSERT INTO [dbo].[TestTarget]
([foo]
,[bar])
VALUES
('A', 'A_'),
('B', 'B?');
INSERT INTO [dbo].[TestQueue]
EXEC [dbo].[TestMerge];
SELECT * FROM [dbo].[TestTarget];
SELECT * FROM [dbo].[TestStats];
SELECT * FROM [dbo].[TestQueue];
Result
TestTarget
+----+-----+-----+
| ID | foo | bar |
+----+-----+-----+
| 1 | A | AA |
| 2 | B | BB |
| 3 | C | CC |
+----+-----+-----+
TestStats
+-------------+
| MergeAction |
+-------------+
| INSERT |
| UPDATE |
| UPDATE |
+-------------+
TestQueue
+----------+------+------+
| TargetID | foo | baz |
+----------+------+------+
| 0 | NULL | NULL |
| 3 | C | CCC |
+----------+------+------+
There will be an extra message during INSERT ... EXEC:
Duplicate key was ignored.
if MERGE updated some rows. This warning message is sent when unique index discards some rows during INSERT due to IGNORE_DUP_KEY = ON.
A warning message will occur when duplicate key values are inserted
into a unique index. Only the rows violating the uniqueness constraint
will fail.
Consider following two approaches to solve the problem:
Merge data into target and output inserted into queue in a single statement, and summarize statistics in the trigger created on target. Batch identifier can be passed into trigger via temporary table.
Merge data into target and output inserted into queue in a single statement, and summarize statistics immediately after the merge, using built-in change tracking capabilities, instead of doing it in the trigger.
Approach 1 (merge data and gather statistics in the trigger):
Sample data setup (indexes and constraints omitted for simplicity):
create table staging (foo varchar(10), bar varchar(10), baz varchar(10));
create table target (foo varchar(10), bar varchar(10));
create table queue (foo varchar(10), baz varchar(10));
create table stats (batchID int, inserted bigint, updated bigint, deleted bigint);
insert into staging values
('A', 'AA', 'AAA')
,('B', 'BB', 'BBB')
,('C', 'CC', 'CCC')
;
insert into target values
('A', 'A_')
,('B', 'B?')
,('E', 'EE')
;
Trigger for gathering inserted/updated/deleted statistics:
create trigger target_onChange
on target
after delete, update, insert
as
begin
set nocount on;
if object_id('tempdb..#targetMergeBatch') is NULL
return;
declare #batchID int;
select #batchID = batchID from #targetMergeBatch;
merge into stats t
using (
select
batchID = #batchID,
cntIns = count_big(case when i.foo is not NULL and d.foo is NULL then 1 end),
cntUpd = count_big(case when i.foo is not NULL and d.foo is not NULL then 1 end),
cntDel = count_big(case when i.foo is NULL and d.foo is not NULL then 1 end)
from inserted i
full join deleted d on d.foo = i.foo
) s
on t.batchID = s.batchID
when matched then
update
set
t.inserted = t.inserted + s.cntIns,
t.updated = t.updated + s.cntUpd,
t.deleted = t.deleted + s.cntDel
when not matched then
insert (batchID, inserted, updated, deleted)
values (s.batchID, s.cntIns, s.cntUpd, cntDel);
end
Merge statements:
declare #batchID int;
set #batchID = 1;-- or select #batchID = batchID from ...;
create table #targetMergeBatch (batchID int);
insert into #targetMergeBatch (batchID) values (#batchID);
insert into queue (foo, baz)
select foo, baz
from
(
merge into target t
using staging s
on t.foo = s.foo
when matched then
update
set t.bar = s.bar
when not matched then
insert (foo, bar)
values (s.foo, s.bar)
when not matched by source then
delete
output $action, inserted.foo, s.baz
) m(act, foo, baz)
where act = 'INSERT'
;
drop table #targetMergeBatch
Check the results:
select * from target;
select * from queue;
select * from stats;
Target:
foo bar
---------- ----------
A AA
B BB
C CC
Queue:
foo baz
---------- ----------
C CCC
Stats:
batchID inserted updated deleted
-------- ---------- --------- ---------
1 1 2 1
Approach 2 (gather statistics, using change tracking capabilities):
Sample data setup is the same as in previous case (just drop everything incl. trigger and recreate tables from scratch), except that in this case we need to have PK on target to make sample work:
create table target (foo varchar(10) primary key, bar varchar(10));
Enable change tracking on database:
alter database Test
set change_tracking = on
Enable change tracking on target table:
alter table target
enable change_tracking
Merge data and grab statistics immediately after that, filtering by the change context to count only rows affected by merge:
begin transaction;
declare #batchID int, #chVersion bigint, #chContext varbinary(128);
set #batchID = 1;-- or select #batchID = batchID from ...;
SET #chVersion = change_tracking_current_version();
set #chContext = newid();
with change_tracking_context(#chContext)
insert into queue (foo, baz)
select foo, baz
from
(
merge into target t
using staging s
on t.foo = s.foo
when matched then
update
set t.bar = s.bar
when not matched then
insert (foo, bar)
values (s.foo, s.bar)
when not matched by source then
delete
output $action, inserted.foo, s.baz
) m(act, foo, baz)
where act = 'INSERT'
;
with ch(foo, op) as (
select foo, sys_change_operation
from changetable(changes target, #chVersion) ct
where sys_change_context = #chContext
)
insert into stats (batchID, inserted, updated, deleted)
select #batchID, [I], [U], [D]
from ch
pivot(count_big(foo) for op in ([I], [U], [D])) pvt
;
commit transaction;
Check the results:
select * from target;
select * from queue;
select * from stats;
They are same as in previous sample.
Target:
foo bar
---------- ----------
A AA
B BB
C CC
Queue:
foo baz
---------- ----------
C CCC
Stats:
batchID inserted updated deleted
-------- ---------- --------- ---------
1 1 2 1
I suggest extracting the stats be coding using three independent AFTER INSERT / DELETE / UPDATE triggers along the lines of:
create trigger dbo.insert_trigger_target
on [dbo].[target]
after insert
as
insert into dbo.[stats] ([action],[count])
select 'insert', count(1)
from inserted;
go
create trigger dbo.update_trigger_target
on [dbo].[target]
after update
as
insert into dbo.[stats] ([action],[count])
select 'update', count(1) from inserted -- or deleted == after / before image, count will be the same
go
create trigger dbo.delete_trigger_target
on [dbo].[target]
after delete
as
insert into dbo.[stats] ([action],[count])
select 'delete', count(1) from deleted
go
If you need more context, put something in CONTEXT_INFO and pluck it out from the triggers.
Now, I'm going to assert that the AFTER triggers are not that expensive, but you'll need to test that to be sure.
Having dealt with that, you'll be free to use the OUTPUT clause (NOT OUTPUT INTO) in the MERGE and then use that nested inside a select to subset the data that you want to go into the queue table.
Justification
Because of the need to access columns from both staging and target in order to build the data for queue, this HAS to be done using the OUTPUT option in MERGE, since nothing else has access to "both sides".
Then, if we have hijacked the OUTPUT clause for queue, how can we re-work that functionality? I think the AFTER triggers will work, given the requirements for stats that you have described. Indeed, the stats could be quite complex if required, given the images that are available. I'm asserting that the AFTER triggers are "not that expensive" since the data of both before and after must always be available in order that a transaction can be both COMMITTED OR ROLLED BACK - yes, the data needs to be scanned (even to get the count) but that doesn't seem like too much of a cost.
In my own analysis that scan added about 5% to the execution plan's base cost
Sound like a solution?
Have you considered ditching the merge and just doing an insert where not exists and an update? You could then use the output clause from the insert to populate your queue table.
Import through a staging table might be more efficient with sequential rather then set-oriented processing. I would consider rewriting MERGE into a stored procedure with cursor scan. Then for each record you can have as many outputs as you like plus any counts without pivot at a total cost of one staging table scan.
Stored procedure might also provide opportunities to split processing into smaller transactions whereas triggers on bigger data sets might lead to transaction log overflow.
Unless I'm missing something, a simple insert command should meet all your requirements.
insert into queue
(foo, baz)
select staging.foo, staging.baz
from staging join target on staging.foo = target.boo
where whatever
This would happen after the merge into target.
For new records only, do this before the merge
insert into queue
(foo, baz)
select staging.foo, staging.baz
from staging left join target on staging.foo = target.boo
where target.foo = null

Check for duplicates in an insertion of a stored procedure

I am trying to write a stored procedure that inserts data, but with some fairly simple checks that seem like good practice.
The table currently has 300 columns, of which there is a sequential primary_key_id, a column that we want to check before inserting, say address, a child_of column used when there is new data (what we are inserting) and then the remaining 297 columns.
So let's say the table currently looks like this:
----------------------------------------------------------------------
|PK |Address |child_of |other_attr_1|other_attr2|...
----------------------------------------------------------------------
|1 | 123 Main St |NULL |... |... |...
|2 | 234 South Rd |NULL |... |... |...
|3 | 345 West Rd |NULL |... |... |...
----------------------------------------------------------------------
and we want to add this row, where the address has a new attribute new in the other_attr_1 column. We would use the child_of to reference the primary_key_id of the previous row record. This will allow for a basic history (I hope).
|4 | 123 Main St |1 |new |... |...
How do I check for the duplication in the stored procedure? Do I iterate over each entering parameter with what is already in the DB if it is there?
Here is the code I have thus far:
USE [databaseINeed]
-- SET some_stuff ON --or off :)
-- ....
-- GO
CREATE Procedure [dbo].[insertNonDuplicatedData]
#address text, #other_attr_1 numeric = NULL, #other_attr_2 numeric = NULL, #other_attr_3 numeric = NULL,....;
AS
BEGIN TRY
-- If the address already exists, lets check for updated data
IF EXISTS (SELECT 1 FROM tableName WHERE address = #address)
BEGIN
-- Look at the incoming data vs the data already in the record
--HERE IS WHERE I THINK THE CODE SHOULD GO, WITH SOMETHING LIKE the following pseudocode:
if any attribute parameter values is different than what is already stored
then Insert into tableName (address, child_of, attrs) Values (#address, THE_PRIMARY_KEY_OF_THE_RECORD_THAT_SHARES_THE_ADDRESS, #other_attrs...)
RETURN
END
-- We don't have any data like this, so lets create a new record altogther
ELSE
BEGIN
-- Every time a SQL statement is executed it returns the number of rows that were affected. By using "SET NOCOUNT ON" within your stored procedure you can shut off these messages and reduce some of the traffic.
SET NOCOUNT ON
INSERT INTO tableName (address, other_attr_1, other_attr_2, other_attr_3, ...)
VALUES(#address,#other_attr_1,#other_attr_2,#other_attr_3,...)
END
END TRY
BEGIN CATCH
...
END CATCH
I tried adding a CONSTRAINT on the table itself for all of the 297 attributes that need to be unique when checking against the address column via:
ALTER TABLE tableName ADD CONSTRAINT
uniqueAddressAttributes UNIQUE -- tried also with NONCLUSTERED
(other_attr_1,other_attr_2,...)
but I get an error
ERROR: cannot use more than 32 columns in an index SQL state: 54011
and I think I might be heading down the wrong path trying to rely on the unique constraint.
Surely having such numbers of columns is not a good practice, anyway you can try using a INTERSECT to check the values at once
-- I assume you get the last id to set the
-- THE_PRIMARY_KEY_OF_THE_RECORD_THAT_SHARES_THE_ADDRESS
DECLARE #PK int = (SELECT MAX(PK) FROM tableName WHERE address = #address)
-- No need for an EXISTS(), just check the #PK
IF #PK IS NOT NULL
BEGIN
IF EXISTS(
-- List of attributes from table
-- Possibly very poor performance to get the row by ntext
SELECT other_attr_1, other_attr_2 ... FROM tableName WHERE PK = #PK
INTERSECT
-- List of attributes from variables
SELECT #other_attr_1, #other_attr_2 ...
)
BEGIN
Insert into tableName (address, child_of, attrs) Values
(#address, #PK, #other_attr_1, #other_attr_2 ...)
END
END
With that many columns you could consider doing a hash of all your columns at time of insert, then storing the result in (yet another) column. In your stored procedure you could do the same hash to the input parameters, then check for hash collisions instead of doing field by field comparison on all those fields.
You'd have to probably do some data conversion to make your 300ish columns all nvarchar so they could be concatenated for input into the HASHBYTES function. Also, if any of the columns may be NULL, you'd have to consider how to treat them. For example, if an existing record has field 216 set to NULL and the row attempting to be added is exactly the same, except field 216 is an empty string, is that a match?
Also, with that many columns, the concatenation may run over the max input size of the hashbytes function, so you may need to break it up into multiple hashes of smaller chunks.
That all said, does your architecture really require this 300ish column structure? If you could get away from that, I wouldn't be having to get quite so creative here.
I don't have enough rep to comment, so I am posting as an answer instead.
Eric's SQL should be changed from IF EXISTS to IF NOT EXISTS
I believe the desired logic should be:
If there is an existing address record, check if any attributes are different.
If any attributes are different, insert a new address record, storing the primary key of the latest existing address record in the child_of column
Refactoring Chris & Eric's SQL:
USE [databaseINeed]
-- SET some_stuff ON --or off :)
-- ....
-- GO
CREATE Procedure [dbo].[insertNonDuplicatedData]
#address text, #other_attr_1 numeric = NULL, #other_attr_2 numeric = NULL, #other_attr_3 numeric = NULL,....;
AS
BEGIN TRY
-- If the address already exists, lets check for updated data
IF EXISTS (SELECT 1 FROM tableName WHERE address = #address)
BEGIN
-- Look at the incoming data vs the data already in the record
--HERE IS WHERE I THINK THE CODE SHOULD GO, WITH SOMETHING LIKE the following pseudocode:
DECLARE #PK int = (SELECT MAX(PK) FROM tableName WHERE address = #address)
IF NOT EXISTS(
-- List of attributes from table
-- Possibly very poor performance to get the row by ntext
SELECT other_attr_1, other_attr_2 ... FROM tableName WHERE PK = #PK
INTERSECT
-- List of attributes from variables
SELECT #other_attr_1, #other_attr_2 ...
)
BEGIN
-- #simplyink: existing address record has different combination of (297 column) attribute values
-- at least one attribute column is different (no intersection)
Insert into tableName (address, child_of, attrs) Values
(#address, #PK, #other_attr_1, #other_attr_2 ...)
END
RETURN
END
-- We don't have any data like this, so lets create a new record altogther
ELSE
BEGIN
-- Every time a SQL statement is executed it returns the number of rows that were affected. By using "SET NOCOUNT ON" within your stored procedure you can shut off these messages and reduce some of the traffic.
SET NOCOUNT ON
INSERT INTO tableName (address, other_attr_1, other_attr_2, other_attr_3, ...)
VALUES(#address,#other_attr_1,#other_attr_2,#other_attr_3,...)
END
END TRY
BEGIN CATCH
...
END CATCH