"Merging" two tables in T-SQL - replacing or preserving duplicate IDs - sql

I have a web application that uses a fairly large table (millions of rows, about 30 columns). Let's call that TableA. Among the 30 columns, this table has a primary key named "id", and another column named "campaignID".
As part of the application, users are able to upload new sets of data pertaining to new "campaigns".
These data sets have the same structure as TableA, but typically only about 10,000-20,000 rows.
Every row in a new data set will have a unique "id", but they'll all share the same campaignID. In other words, the user is loading the complete data for a new "campaign", so all 10,000 rows have the same "campaignID".
Usually, users are uploading data for a NEW campaign, so there are no rows in TableA with the same campaignID. Since the "id" is unique to each campaign, the id of every row of new data will be unique in TableA.
However, in the rare case where a user tries to load a new set of rows for a "campaign" that's already in the database, the requirement was to remove all the old rows for that campaign from TableA first, and then insert the new rows from the new data set.
So, my stored procedure was simple:
BULK INSERT the new data into a temporary table (#tableB)
Delete any existing rows in TableA with the same campaignID
INSERT INTO Table A ([columns]) SELECT [columns] from #TableB
Drop #TableB
This worked just fine.
But the new requirement is to give users 3 options when they upload new data for handling "duplicates" - instances where the user is uploading data for a campaign that's already in TableA.
Remove ALL data in TableA with the same campaignID, then insert all the new data from #TableB. (This is the old behavior. With this option, they'll never be duplicates.)
If a row in #TableB has the same id as a row in TableA, then update that row in TableA with the row from #TableB (Effectively, this is "replacing" the old data with the new data)
If a row in #TableB has the same id as a row in TableA, then ignore that row in #TableB (Essentially, this is preserving the original data, and ignoring the new data).
A user doesn't get to choose this on a row-by-row basis. She chooses how the data will be merged, and this logic is applied to the entire data set.
In a similar application I worked on that used MySQL, I used the "LOAD DATA INFILE" function, with the "REPLACE" or "IGNORE" option. But I don't know how to do this with SQL Server/T-SQL.
Any solution needs to be efficient enough to handle the fact that TableA has millions of rows, and #TableB (the new data set) may have 10k-20k rows.
I googled for something like a "Merge" command (something that seems to be supported for SQL Server 2008), but I only have access to SQL Server 2005.
In rough pseudocode, I need something like this:
If user selects option 1:
[I'm all set here - I have this working]
If user selects option 2 (replace):
merge into TableA as Target
using #TableB as Source
on TableA.id=#TableB.id
when matched then
update row in TableA with row from #TableB
when not matched then
insert row from #TableB into TableA
If user selects option 3 (preserve):
merge into TableA as Target
using #TableB as Source
on TableA.id=#TableB.id
when matched then
do nothing
when not matched then
insert row from #TableB into TableA

How about this?
option 2:
begin tran;
delete from tablea where exists (select 1 from tableb where tablea.id=tableb.id);
insert into tablea select * from tableb;
commit tran;
option 3:
begin tran;
delete from tableb where exists (select 1 from tablea where tablea.id=tableb.id);
insert into tablea select * from tableb;
commit tran;
As for performance, so long as the id field(s) in tablea (the big table) are indexed, you should be fine.

Why are you using Upserts when he claims he wanted a MERGE? MAREG in SQL 2008 is faster and more efficient.
I would let the merge handle the differences.

Related

Update trigger select fields from same row

i want an update trigger an a specific field. if that field value is changed i want to insert into a different table selecting all values of the row where update was made even though it was just for one field .
example
id--------value1--------value2
1-----------abc ----------efg
if value1 is updated to hij, i want to select id(1), value1(hij) and value2(efg) and insert into a different table.
i cannot do inserted.Id or inserted.value2 since both fields are not updated.
NOTE: please note only 1 field is updated, other field values are the same before and after, in my question i have just used an example, but in real life a record will be inserted and i am expected to insert the same values onto a different table. but upon insert the record wont be approved until later when approved field value is changed thats when i am expected the bring the values from other fields to different table.
In your UPDATE trigger, you have access to the Deleted and Inserted pseudo tables which contain the old values (before the UPDATE) and the new ones after the UPDATE.
So you should be able to write something like this:
CREATE TRIGGER trg_Updated
ON dbo.YourTableName
FOR UPDATE
AS
INSERT INTO dbo.ThisOtherTableOfYours(Id, Value1, Value2)
SELECT
i.Id, i.Value1, i.Value2
FROM
Inserted i
INNER JOIN
Deleted d ON i.Id = d.Id
WHERE
i.Value1 <> d.Value1
The SELECT basically joins the two pseudo tables with the old and new values, and selects those rows which have a difference in the Value1 column.
From those columns, the new values after the update are being inserted into your other table. And the Inserted table does contain ALL columns (with their new values) from your table - not just those that have been actually updated - ALL of them!
You should use a simple trigger:
create or replace trigger NAME on TABLENAME after update as
if :new.value1 = 'hij'{
insert into OTHERTABLE (id, value1, value2) values (:old.id, :old.value1, :old.value2);
}

SQL Triggers - Deleted or Updated? or maybe something else?

I am trying to figure out which i need to use here: deleted, inserted or updated.
basically.
I need to write some data to the history table, when the main table is updated, and only if the status changes from something to either pending or active.
This is what I have now:
ALTER TRIGGER [dbo].[trg_SourceHistory] ON [dbo].[tblSource]
FOR UPDATE AS
DECLARE #statusOldValue char(1)
DECLARE #statusNewValue char(1)
SELECT #statusOldValue = statusCode FROM deleted
SELECT #statusNewValue= statusCode FROM updated
IF (#statusOldValue <> #statusNewValue) AND
(#statusOldValue = 'P' or #statusOldValue = 'A')
BEGIN TRY
INSERT * INTO tblHistoryTable)
select * from [DELETED]
so I want the new data to stay in the main table, the the history table to be updated with what is being overwritten... right now it just copies the same info over. so after update, both my tables have the same data.
There are only the Inserted and Deleted pseudo tables - there's no Updated.
For an UPDATE, Inserted contains the new values (after the update) while Deleted contains the old values before the update.
Also be aware that the triggers is fired once per batch - not once for each row. So both pseudo tables will potentially contain multiple rows! Don't just assume a single row and assign this to a variable - this
SELECT #statusOldValue = statusCode FROM deleted
SELECT #statusNewValue= statusCode FROM updated
will fail if you have multiple rows ! You need to write your triggers in such a fashion that they work with multiple rows in Inserted and Deleted !
Update: yes - there IS a much better way to write this:
ALTER TRIGGER [dbo].[trg_SourceHistory] ON [dbo].[tblSource]
FOR UPDATE
AS
INSERT INTO dbo.tblHistoryTable(Col1, Col2, Col3, ...., ColN)
SELECT Col1, COl2, Col3, ..... ColN
FROM Deleted d
INNER JOIN Inserted i ON i.PrimaryKey = d.PrimaryKey
WHERE i.statusCode <> d.statusCode
AND d.statusCode IN ('A', 'P')
Basically:
explicitly specify the columns you want to insert - both in the INSERT statement as well as the SELECT statement retrieving the data to insert - to avoid any nasty surprises
create an INNER JOIN between Inserted and Deleted pseudo-tables to get all rows that were updated
specify all other conditions (different status codes etc.) in the WHERE clause of the SELECT
This solution works for batches of rows being updated - it won't fail on a multi-row update....
You need to use both the inserted and deleted tables together to check for records that:
1. Already existed (to check it's not an insert)
2. Still exists (to check it's not a delete)
3. The Status field changed
You also need to make sure you do that in a set based approach, as per marc_s's answer, triggers are not single record processes.
INSERT INTO
tblHistoryTable
SELECT
deleted.*
FROM
inserted
INNER JOIN
deleted
ON inserted.PrimaryKey = deleted.PrimaryKey
WHERE
inserted.StatusCode <> deleted.StatusCode
AND (inserted.StatusCode = 'P' OR inserted.StatusCode = 'A')
inserted = the new values
deleted = the old values
There is no updated table, you are looking for inserted.

How to fix this stored procedure problem

I have 2 tables. The following are just a stripped down version of these tables.
TableA
Id <pk> incrementing
Name varchar(50)
TableB
TableAId <pk> non incrementing
Name varchar(50)
Now these tables have a relationship to each other.
Scenario
User 1 comes to my site and does some actions(in this case adds rows to Table A). So I use a SqlBulkCopy all this data in Table A.
However I need to add the data also to Table B but I don't know the newly created Id's from Table A as SQLBulkCopy won't return these.
So I am thinking of having a stored procedure that finds all the id's that don't exist in Table B and then insert them in.
INSERT INTO TableB (TableAId , Name)
SELECT Id,Name FROM TableA as tableA
WHERE not exists( ...)
However this comes with a problem. A user at any time can delete something from TableB so if a user deletes say a row and then another user comes around or even the same user comes around and does something to Table A my stored procedure will bring back that deleted row in Table B. Since it will still exist in Table A but not Table B and thus satisfy the stored procedure condition.
So is there a better way of dealing with two tables that need to be updated when using bulk insert?
SQLBulkCopy complicates this so I'd consider using a staging table and an OUTPUT clause
Example, in a mixture of client pseudo code and SQL
create SQLConnection
Create #temptable
Bulkcopy to #temptable
Call proc on same SQLConnection
proc:
INSERT tableA (..)
OUTPUT INSERTED.key, .. INTO TableB
SELECT .. FROM #temptable
close connection
Notes:
temptable will be local to the connection and be isolated
the writes to A and B will be atomic
overlapping or later writes don't care about what happens later to A and B
emphasising the last point, A and B will only ever be populated from the set of rows in #temptable
Alternative:
Add another column to A and B called sessionid and use that to identify row batches.
One option would be to use SQL Servers output clause:
INSERT YourTable (name)
OUTPUT INSERTED.*
VALUES ('NewName')
This will return the id, name of the inserted rows to the client, so you can use them in the insert operation for the second table.
Just as an alternative solution you could use database triggers to update the second table.

Is it wiser to use a function in between First and Next Insertions based on Select?

PROCEDURE add_values
AS BEGIN
INSERT INTO TableA
SELECT id, name
FROM TableC ("This selection will return multiple records")
While it inserts in TableA i would like insert into another table(TableB) for that particular record which got inserted in tableA
Note:The columns in TableA and TableB are different , is it wise to call a function before inserting into TableB as i would like to perform certain gets and sets based on the id inserted in tableA.
If you want to insert a set of rows into two tables, you'd have to store it in a temporary table first and then do the two INSERT statement from there
INSERT INTO #TempTable
SELECT id, name
FROM TableC ("This selection will return multiple records")
INSERT INTO TableA
SELECT (fieldlist) FROM #TempTable
INSERT INTO TableB
SELECT (fieldlist) FROM #TempTable
Apart from Marc_S answer, one more way is
First insert the needed records into Table A from Table C. Then pump the needed records from Table A to Table B
Though many ways has been suggested by many peoples in your question that u asked just 3 hrs ago How to Insert Records based on the Previous Insert

Row number in Sybase tables

Sybase db tables do not have a concept of self updating row numbers. However , for one of the modules , I require the presence of rownumber corresponding to each row in the database such that max(Column) would always tell me the number of rows in the table.
I thought I'll introduce an int column and keep updating this column to keep track of the row number. However I'm having problems in updating this column in case of deletes. What sql should I use in delete trigger to update this column?
You can easily assign a unique number to each row by using an identity column. The identity can be a numeric or an integer (in ASE12+).
This will almost do what you require. There are certain circumstances in which you will get a gap in the identity sequence. (These are called "identity gaps", the best discussion on them is here). Also deletes will cause gaps in the sequence as you've identified.
Why do you need to use max(col) to get the number of rows in the table, when you could just use count(*)? If you're trying to get the last row from the table, then you can do
select * from table where column = (select max(column) from table).
Regarding the delete trigger to update a manually managed column, I think this would be a potential source of deadlocks, and many performance issues. Imagine you have 1 million rows in your table, and you delete row 1, that's 999999 rows you now have to update to subtract 1 from the id.
Delete trigger
CREATE TRIGGER tigger ON myTable FOR DELETE
AS
update myTable
set id = id - (select count(*) from deleted d where d.id < t.id)
from myTable t
To avoid locking problems
You could add an extra table (which joins to your primary table) like this:
CREATE TABLE rowCounter
(id int, -- foreign key to main table
rownum int)
... and use the rownum field from this table.
If you put the delete trigger on this table then you would hugely reduce the potential for locking problems.
Approximate solution?
Does the table need to keep its rownumbers up to date all the time?
If not, you could have a job which runs every minute or so, which checks for gaps in the rownum, and does an update.
Question: do the rownumbers have to reflect the order in which rows were inserted?
If not, you could do far fewer updates, but only updating the most recent rows, "moving" them into gaps.
Leave a comment if you would like me to post any SQL for these ideas.
I'm not sure why you would want to do this. You could experiment with using temporary tables and "select into" with an Identity column like below.
create table test
(
col1 int,
col2 varchar(3)
)
insert into test values (100, "abc")
insert into test values (111, "def")
insert into test values (222, "ghi")
insert into test values (300, "jkl")
insert into test values (400, "mno")
select rank = identity(10), col1 into #t1 from Test
select * from #t1
delete from test where col2="ghi"
select rank = identity(10), col1 into #t2 from Test
select * from #t2
drop table test
drop table #t1
drop table #t2
This would give you a dynamic id (of sorts)