Create a procedure to merge data and avoid duplicates

Create a procedure to merge data and avoid duplicates - sql

I am trying to create a SQL Server merge procedure that would allow me to merge new entries in the data set and nullify duplicates in the table. Both tables of the same type. I am trying to perform a merge and avoid duplicates. The ID and Email will always be a one to one relation. However, the source table sometimes will send the same email with two different Ids. We want to keep only one record per person and nullify all the email for the invalid record. Initial thoughts are to join the source table with the target table on email and check which emails have two occurrences and nullify, but how could I put this in one procedure.
Table 1 and Table 2:
Id | Email | First | Last | Building | Date |....
Example of duplicate:
1 | tst#tst.com | ...
2 | tst#tst.com | ...
Needed output:
1 | tst#tst.com
2 | null
Procedure:
CREATE PROCEDURE mergingTwo #TableType
AS
BEGIN
MERGE [target]
USING [source] ON [target].Id = [source].Id OR [target].Email = [source].Email
WHEN MATCHED THEN
UPDATE
SET
WHEN NOT MATCHED BY TARGET THEN
INSERT

Can do the Merge First then nullify the e-mail in a Second update like
With cte as (select id, row_number() over (partition by e-mail order by id asc) n_row
From table_foo)
Update table_foo
Set email = null
From table_foo
Inner Join cte
On cte.id = table_foo.id
And cte.n_row > 1

Sounds like a job for a union (unless you really want those NULL entries).
SELECT Email FROM Table1
UNION
SELECT Email FROM Table2
;

Related

How to add a row and timestamp one SQL Server table based on a change in a single column of another SQL Server table

[UPDATE: 2/20/19]
I figured out a pretty trivial solution to solve this problem.
CREATE TRIGGER TriggerClaims_History on Claims
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON
INSERT INTO Claims_History
SELECT name, status, claim_date
FROM Claims
EXCEPT SELECT name, status, claim_date FROM Claims_History
END
GO
I am standing up a SQL Server database for a project I am working on. Important info: I have 3 tables - enrollment, cancel, and claims. There are files located on a server that populate these tables every day. These files are NOT deltas (i.e. each new file placed on server every day contains data from all previous files) and because of this, I am able to simply drop all tables, create tables, and then populate tables from files each day. My question is regarding my claims table - since tables will be dropped and created each night, I need a way to keep track of all the different status changes.
I'm struggling to figure out the best way to go about this.
I was thinking of creating a claims_history table that is NOT dropped each night. Essentially I'd want my claims_history table to be populated each time an initial new record is added to the claims table. Then I'd want to scan the claims table and add a row to the claims_history table if and only if there was a change in the status column (i.e. claims.status != claims_history.status).
Day 1:
select * from claims
id | name | status
1 | jane doe | received
select * from claims_history
id | name | status | timestamp
1 | jane doe | received | datetime
Day 2:
select * from claims
id | name | status
1 | jane doe | processed
select * from claims_history
id | name | status | timestamp
1 | jane doe | received | datetime
1 | jane doe | processed | datetime
Is there a SQL script that can do this? I'd also like to automatically have the timestamp field populate in claims_history table each time a new row is added (status change). I know I could write a python script to handle something like this, but i'd like to keep it in SQL if at all possible. Thank you.

Acording to your questions you need to create a trigger after update of the column claims.status and it very simple to do that use this link to know and see how to do a simple trigger click here create asimple sql server trigger
then as if there is many problem to manipulate dateTime in a query a would suggest you to use UNIX time instead of using datetime you can use Long or bigInt UNix time store the date as a number to know the currente time simple use the query SELECT UNIX_TIMESTAMP()

A very common approach is to use a staging table and a production (or final) table. All your ETLs will truncate and load the staging table (volatile) and then you execute an Stored Procedure that adds only the new records to your final table. This requires that all the data you handle this way have some form of key that identifies unequivocally a row.
What happens if your files suddenly change format or are badly formatted? You will drop your table and won't be able to load it back until you fix your ETL. This approach will save you from that, since the process will fail while loading the staging table and won't impact the final table. You can also keep deleted records for historic reasons instead of having them deleted.
I prefer to separate the staging tables into their proper schema, for example:
CREATE SCHEMA Staging
GO
CREATE TABLE Staging.Claims (
ID INT,
Name VARCHAR(100),
Status VARCHAR(100))
Now you do all your loads from your files into these staging tables, truncating them first:
TRUNCATE TABLE Staging.Claims
BULK INSERT Staging.Claims
FROM '\\SomeFile.csv'
WITH
--...
Once this table is loaded you execute a specific SP that adds your delta between the staging content and your final table. You can add whichever logic you want here, like doing only inserts for new records, or inserting already existing values that were updated on another table. For example:
CREATE TABLE dbo.Claims (
ClaimAutoID INT IDENTITY PRIMARY KEY,
ClaimID INT,
Name VARCHAR(100),
Status VARCHAR(100),
WasDeleted BIT DEFAULT 0,
ModifiedDate DATETIME,
CreatedDate DATETIME DEFAULT GETDATE())
GO
CREATE PROCEDURE Staging.UpdateClaims
AS
BEGIN
BEGIN TRY
BEGIN TRANSACTION
-- Update changed values
UPDATE C SET
Name = S.Name,
Status = S.Status,
ModifiedDate = GETDATE()
FROM
Staging.Claims AS S
INNER JOIN dbo.Claims AS C ON S.ID = C.ClaimID -- This has to be by the key columns
WHERE
ISNULL(C.Name, '') <> ISNULL(S.Name, '') AND
ISNULL(C.Status, '') <> ISNULL(S.Status, '')
-- Insert new records
INSERT INTO dbo.Claims (
ClaimID,
Name,
Status)
SELECT
ClaimID = S.ID,
Name = S.Name,
Status = S.Status
FROM
Staging.Claims AS S
WHERE
NOT EXISTS (SELECT 'not yet loaded' FROM dbo.Claims AS C WHERE S.ID = C.ClaimID) -- This has to be by the key columns
-- Mark deleted records as deleted
UPDATE C SET
WasDeleted = 1,
ModifiedDate = GETDATE()
FROM
dbo.Claims AS C
WHERE
NOT EXISTS (SELECT 'not anymore on files' FROM Staging.Claims AS S WHERE S.ClaimID = C.ClaimID) -- This has to be by the key columns
COMMIT
END TRY
BEGIN CATCH
DECLARE #v_ErrorMessage VARCHAR(MAX) = ERROR_MESSAGE()
IF ##TRANCOUNT > 0
ROLLBACK
RAISERROR (#v_ErrorMessage, 16, 1)
END CATCH
END
This way you always work with dbo.Claims and the records are never lost (just updated or inserted).
If you need to check the last status of a particular claim you can create a view:
CREATE VIEW dbo.vClaimLastStatus
AS
WITH ClaimsOrdered AS
(
SELECT
C.ClaimAutoID,
C.ClaimID,
C.Name,
C.Status,
C.ModifiedDate,
C.CreatedDate,
DateRanking = ROW_NUMBER() OVER (PARTITION BY C.ClaimID ORDER BY C.CreatedDate DESC)
FROM
dbo.Claims AS C
)
SELECT
C.ClaimAutoID,
C.ClaimID,
C.Name,
C.Status,
C.ModifiedDate,
C.CreatedDate,
FROM
ClaimsOrdered AS C
WHERE
DateRanking = 1

Update bulk number of records in oracle

I am new to sql. Can someone help me with this requirement.
I have table with 10000 records like this
CompanyID Name
300001 A
300004 B
300005 C
300007 D
|
|
|
310000 XXX
And I have a another list of companyIDs that I am going to update the above table(It is just an excel sheet not a table)
OldID NewID
300001 500001
300002 500002
300003 500003
300004 500004
300005 500005
|
|
310000 510000
My requirement is, If I found the companyID in the first table I need to update it with the NewID and If I didn't find the companyId in the first table I have to create a new row in the table with the NewID regardless of oldID.
Is there any possibility to do both update and insert in a single query?

You're describing an "upsert" or MERGE statement, typically:
merge into table_a
using (<some_statement>)
on (<some_condition>)
when matched then
update
set ...
when not matched then
insert (<column_list>)
values (<column_list>);
However, a MERGE can't update a value that's referenced in the ON clause, which is what will be required in order to do what you're asking. You will, therefore, require two statements:
update table_to_be_updated t
set companyid = (select newid from new_table where oldid = t.companyid )
insert into table_to_be_updated
select newid
from newtable t
where not exists ( select 1
from table_to_be_updated
where t.newid = companyid )
If it's possible for a newid and an oldid to be the same then you're going to run into problems. This also assumes that your new table is unique on oldid and newid - it has to be unique in order to do what you want so I don't think this is an unreasonable assumption.

Return rows of a table that actually changed in an UPDATE

Using Postgres, I can perform an update statement and return the rows affected by the commend.
UPDATE accounts
SET status = merge_accounts.status,
field1 = merge_accounts.field1,
field2 = merge_accounts.field2,
etc.
FROM merge_accounts WHERE merge_accounts.uid =accounts.uid
RETURNING accounts.*
This will give me a list of all records that matched the WHERE clause, however will not tell me which rows were actually updated by the operation.
In this simplified use-case it of course would be trivial to simply add another guard AND status != 'Closed, however my real world use-case involves updating potentially dozens of fields from a merge table with 10,000+ rows, and I want to be able to detect which rows were actually changed, and which are identical to their previous version. (The expectation is very few rows will actually have changed).
The best I've got so far is
UPDATE accounts
SET x=..., y=...
FROM accounts as old WHERE old.uid = accounts.uid
FROM merge_accounts WHERE merge_accounts.uid = accounts.uid
RETURNING accounts, old
Which will return a tuple of old and new rows that can then be diff'ed inside my Java codebase itself - however this requires significant additional network traffic and is potentially error prone.
The ideal scenario is to be able to have postgres return just the rows that actually had any values changed - is this possible?
Here on github is a more real world example of what I'm doing, incorporating some of the suggestions so far.
Using Postgres 9.1, but can use 9.4 if required. The requirements are effectively
Be able to perform an upsert of new data
Where we may only know the specific key/value pair to update on any given row
Get back a result containing just the rows that were actually changed by the upsert
Bonus - get a copy of the old records as well.
Since this question was opened I've gotten most of this working now, although I'm unsure if my approach is a good idea or not - it's a bit hacked together.

Only update rows that actually change
That saves expensive updates and expensive checks after the UPDATE.
To update every column with the new value provided (if anything changes):
UPDATE accounts a
SET (status, field1, field2) -- short syntax for ..
= (m.status, m.field1, m.field2) -- .. updating multiple columns
FROM merge_accounts m
WHERE m.uid = a.uid
AND (a.status IS DISTINCT FROM m.status OR
a.field1 IS DISTINCT FROM m.field1 OR
a.field2 IS DISTINCT FROM m.field2)
RETURNING a.*;
Due to PostgreSQL's MVCC model any change to a row writes a new row version. Updating a single column is almost as expensive as updating every column in the row at once. Rewriting the rest of the row comes at practically no cost, as soon as you have to update anything.
Details:
How do I (or can I) SELECT DISTINCT on multiple columns?
UPDATE a whole row in PL/pgSQL
Shorthand for whole rows
If the row types of accounts and merge_accounts are identical and you want to adopt everything from merge_accounts into accounts, there is a shortcut comparing the whole row type:
UPDATE accounts a
SET (status, field1, field2)
= (m.status, m.field1, m.field2)
FROM merge_accounts m
WHERE a.uid = m.uid
AND m IS DISTINCT FROM a
RETURNING a.*;
This even works for NULL values. Details in the manual.
But it's not going to work for your home-grown solution where (quoting your comment):
merge_accounts is identical, save that all non-pk columns are array types
It requires compatible row types, i.e. each column shares the same data type or there is at least an implicit cast between the two types.
For your special case
UPDATE accounts a
SET (status, field1, field2)
= (COALESCE(m.status[1], a.status) -- default to original ..
, COALESCE(m.field1[1], a.field1) -- .. if m.column[1] IS NULL
, COALESCE(m.field2[1], a.field2))
FROM merge_accounts m
WHERE m.uid = a.uid
AND (m.status[1] IS NOT NULL AND a.status IS DISTINCT FROM m.status[1]
OR m.field1[1] IS NOT NULL AND a.field1 IS DISTINCT FROM m.field1[1]
OR m.field2[1] IS NOT NULL AND a.field2 IS DISTINCT FROM m.field2[1])
RETURNING a.*
m.status IS NOT NULL works if columns that shouldn't be updated are NULL in merge_accounts.
m.status <> '{}' if you operate with empty arrays.
m.status[1] IS NOT NULL covers both options.
Related:
Return pre-UPDATE column values using SQL only

if you aren't relying on side-effectts of the update, only update the records that need to change
UPDATE accounts
SET status = merge_accounts.status,
field1 = merge_accounts.field1,
field2 = merge_accounts.field2,
etc.
FROM merge_accounts WHERE merge_accounts.uid =accounts.uid
AND NOT (status IS NOT DISTINCT FROM merge_accounts.status
AND field1 IS NOT DISTINCT FROM merge_accounts.field1
AND field2 IS NOT DISTINCT FROM merge_accounts.field2
)
RETURNING accounts.*

I would recommend using the information_schema.columns table to introspect the columns dynamically, and then use those within a plpgsql function to dynamically generate the UPDATE statement.
i.e. this DDL:
create table foo
(
id serial,
val integer,
name text
);
insert into foo (val, name) VALUES (10, 'foo'), (20, 'bar'), (30, 'baz');
And this query:
select column_name
from information_schema.columns
where table_name = 'foo'
order by ordinal_position;
will yield the columns for the table in the order that they were defined in the table DDL.
Essentially you would use the above SELECT within the function to dynamically build up your UPDATE statement by iterating over the results of the above SELECT in a FOR LOOP to dynamically build up both the SET and WHERE clauses.

Some variation of this ?
SELECT * FROM old;
id | val
----+-----
1 | 1
2 | 2
4 | 5
5 | 1
6 | 2
SELECT * FROM new;
id | val
----+-----
1 | 2
2 | 2
3 | 2
5 | 1
6 | 1
SELECT * FROM old JOIN new ON old.id = new.id;
id | val | id | val
----+-----+----+-----
1 | 1 | 1 | 2
2 | 2 | 2 | 2
5 | 1 | 5 | 1
6 | 2 | 6 | 1
(4 rows)
WITH sel AS (
SELECT o.id , o.val FROM old o JOIN new n ON o.id=n.id ),
upd AS (
UPDATE old SET val = new.val FROM new WHERE new.id=old.id RETURNING old.* )
SELECT * from sel, upd WHERE sel.id = upd.id AND sel.val <> upd.val;
id | val | id | val
----+-----+----+-----
1 | 1 | 1 | 2
6 | 2 | 6 | 1
(2 rows)
Refer SO answer and read the entire discussion.

If you are updating a single table and want to know if the row is actually changed you can use this query:
with rows_affected as (
update mytable set (field1, field2, field3)=('value1', 'value2', 3) where id=1 returning *
)
select count(*)>0 as is_modified from rows_affected
join mytable on mytable.id=rows_affected.id
where rows_affected is distinct from mytable;
And you can wrap your existing queries into this one without the need to modify the actual update statements.

SQL Multiple Row Insert w/ multiple selects from different tables

I am trying to do a multiple insert based on values that I am pulling from a another table. Basically I need to give all existing users access to a service that previously had access to a different one. Table1 will take the data and run a job to do this.
INSERT INTO Table1 (id, serv_id, clnt_alias_id, serv_cat_rqst_stat)
SELECT
(SELECT Max(id) + 1
FROM Table1 ),
'33', --The new service id
clnt_alias_id,
'PI' --The code to let the job know to grant access
FROM TABLE2,
WHERE serv_id = '11' --The old service id
I am getting a Primary key constraint error on id.
Please help.
Thanks,
Colin

This query is impossible. The max(id) sub-select will evaluate only ONCE and return the same value for all rows in the parent query:
MariaDB [test]> create table foo (x int);
MariaDB [test]> insert into foo values (1), (2), (3);
MariaDB [test]> select *, (select max(x)+1 from foo) from foo;
+------+----------------------------+
| x | (select max(x)+1 from foo) |
+------+----------------------------+
| 1 | 4 |
| 2 | 4 |
| 3 | 4 |
+------+----------------------------+
3 rows in set (0.04 sec)
You will have to run your query multiple times, once for each record you're trying to copy. That way the max(id) will get the ID from the previous query.

Is there a requirement that Table1.id be incremental ints? If not, just add the clnt_alias_id to Max(id). This is a nasty workaround though, and you should really try to get that column's type changed to auto_increment, like Marc B suggested.

Reset or Update Row Position Integer in Database Table

I am working on a stored procedure in SQL Server 2008 for resetting an integer column in a database table. This integer column stores or persists the display order of the item rows. Users are able to drag and drop items in a particular sort order and we persist that order in the database table using this "Order Rank Integer".
Display queries for items always append a "ORDER BY OrderRankInt" when retrieving data so the user sees the items in the order they previously specified.
The problem is that this integer column collects a lot of duplicate values after the table items are re-ordered a bit. Hence...
Table
--------
Name | OrderRankInt
a | 1
b | 2
c | 3
d | 4
e | 5
f | 6
After a lot of reordering by the user becomes....
Table
--------
Name | OrderRankInt
a | 1
b | 2
c | 2
d | 2
e | 2
f | 6
These duplicates are primarily because of insertions and user specified order numbers. We're not trying to prevent duplicate order ranks, but we'd like a way to 'Fix' the table on item inserts/modifies.
Is there a way I can reset the OrderRankInt column with a single UPDATE Query?
Or do I need to use a cursor? What would the syntax for that cursor look like?
Thanks,
Kervin
EDIT
Update with Remus Rusanu solution. Thanks!!
CREATE PROCEDURE EPC_FixTableOrder
#sectionId int = 0
AS
BEGIN
-- "Common Table Expression" to append a 'Row Number' to the table
WITH tempTable AS
(
SELECT OrderRankInt, ROW_NUMBER() OVER (ORDER BY OrderRankInt) AS rn
FROM dbo.[Table]
WHERE sectionId = #sectionId -- Fix for a specified section
)
UPDATE tempTable
SET OrderRankInt = rn; -- Set the Order number to the row number via CTE
END
GO

with cte as (
select OrderId, row_number() over (order by Name) as rn
from Table)
update cte
set OrderId = rn;
This doesn't account for any foreign key relationships, I hope you are taken care of those.

Fake it. Make the column nullable, set the values to NULL, alter it to be an autonumber, and then turn off autonumber and nullable.
(You could skip the nullable steps.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Create a procedure to merge data and avoid duplicates - sql

Can do the Merge First then nullify the e-mail in a Second update like With cte as (select id, row_number() over (partition by e-mail order by id asc) n_row From table_foo) Update table_foo Set email = null From table_foo Inner Join cte On cte.id = table_foo.id And cte.n_row > 1

Sounds like a job for a union (unless you really want those NULL entries). SELECT Email FROM Table1 UNION SELECT Email FROM Table2 ;

Related

How to add a row and timestamp one SQL Server table based on a change in a single column of another SQL Server table

Update bulk number of records in oracle

Return rows of a table that actually changed in an UPDATE

SQL Multiple Row Insert w/ multiple selects from different tables

Reset or Update Row Position Integer in Database Table

Categories

Resources