BigQuery - How can i make Merge Statement Handle Duplicate rows? - sql

I currently have a table which uses compound columns to uniquely identify each row of data.. So they are times we get duplicate rows. I have an Merge Statement that copies data from the source table to the destination. Anytime we have duplicate rows on the Source Table. The Merge statement copies the Duplicate along with the other rows of data to the Destination. The merge statement is
MERGE `project.dataset.destination` T
USING `project.dataset.source` S
ON (T.department = S.department OR T.department IS NULL and S.department IS NULL) AND
(T.category = S.category OR T.category IS NULL AND S.category IS NULL) AND
(T.subcategory = S.subcategory OR T.subcategory IS NULL AND S.subcategory IS NULL) AND
(T.subset = S.subset OR T.subset IS NULL AND S.subset IS NULL) AND
(T.country = S.country OR T.country IS NULL AND S.country IS NULL) AND
(T.state = S.state OR T.state IS NULL AND S.state IS NULL) AND
(T.county = S.county OR T.county IS NULL AND S.country IS NULL) AND
(T.date = S.date OR T.date IS NULL AND S.date IS NULL)
WHEN NOT MATCHED AND ((department = "SPORT" AND subcategory IN ("FOOTBALL", "PONG")) AND
(country IN("USA", "CANADA") )) THEN
INSERT ROW
WHEN NOT MATCHED BY SOURCE THEN
DELETE
My question is; is there any way i can handle the Inserting of Duplicates to the Destination Table?
Or if the Duplicates rows is Inserting into the Destination. when this Merge statement runs the next day; is there anyway i can modify this Merge Statement to Delete any duplicate found on the Destination table?
Thanks in Advance

General guideline for a good SQL question: provide sample source table and dest table, then the expected results.
That said, based on the comments, 'duplicate' here means:
having the same row of data appear more than one on the source table
Deduping the source table could happen in a subquery like
MERGE `project.dataset.destination` T
USING (
SELECT ...
FROM `project.dataset.source`
GROUP BY <your_unique_key>
) S
ON ...

Related

How to solve this SQL specific merge problem

How to solve the error:
The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
merge CARD_ALERTS as t
using #tblAlerts as s
on (t.Id = s.AlertId and t.CardId = s.CardId)
when not matched by target
then insert(Id, ExternalCodeHolder, CardId, IsCardOwner, IBAN, PAN, MinAmount, Currency, ByEmail, BySMS, IssueDate, IsActive)
values(s.AlertId, s.ExternalCodeHolder, s.CardId, s.IsCardOwner, s.IBAN, s.PAN, s.MinAmount, s.Currency, s.ByEmail, s.BySMS, getdate(), 1)
when matched
then update set t.ByEmail = s.ByEmail, t.BySMS = s.BySMS, IsActive = 1, t.MinAmount = s.MinAmount
when not matched by source and t.Id=#AlertId
then update set t.IsActive = 3
As explained in the error message and in the comments, multiple rows of the source table correspond to the same row in the target one.
This will show you which rows of the target table have multiple rows from the source table that try to update them:
select t.Id,t.CardId,count(*) as [count]
from CARD_ALERTS as t
inner join #tblAlerts as s
on (t.Id = s.AlertId and t.CardId = s.CardId)
group by t.Id,t.CardId
having count(*)>1
This will also show you the multiple rows of the source table:
select t.*,s.*
from CARD_ALERTS as t
inner join #tblAlerts as s on (t.Id = s.AlertId and t.CardId = s.CardId)
inner join
(
select ca.Id,ca.CardId
from CARD_ALERTS as ca
inner join #tblAlerts as s
on (ca.Id = s.AlertId and ca.CardId = s.CardId)
group by ca.Id,ca.CardId
having count(*)>1
)tkey on t.Id=tkey.Id and t.CardId=tkey.CardId

TSQL: Prevent Nulls update over existing data

I have 2 tables:
Orders (Table I update and want keep existing data, and prevent overwriting with nulls)
WRK_Table (This table is identical to Orders but can not guarantee all columns will have data when running update)
PK column 'Master_Ordernum'
There are many columns in the tables, the WRK_Table will have the PK, but data in other columns can not be counted on.
I want the WRK_Table to update Orders only with actual data and not Nulls.
I was thinking along this line:
UPDATE [Orders]
SET [Status] = CASE WHEN S.[Status] IS NOT NULL THEN S.[Status] END
FROM [WRK_TABLE] S
WHERE [Orders].[Master_Ordernum] = S.[Master_Ordernum]
The existing data in Orders is being updated with nulls
does this help?
update orders set name = work.name
from orders
inner join work on orders.id = work.id and work.Name is not null
Not the same col / table name but you should get the idea
EDIT
Your SQL, NOT TESTED
UPDATE Orders
SET [Status] = WRK_Table.[Status]
FROM [Orders]
inner join WRK_Table on [Orders].[Master_Ordernum] = [WRK_Table].[Master_Ordernum] and WRK_Table.[Status] is not null
If you want to test and undo your data just do something like (assuming you don't have 100s of rows)
begin tran
/* .. SQL THAT UPDATES ... */
select * from orders // review the data
rollback tran // Undo changes
If you're going to use a Case statement, you need to include an else to prevent it from just inserting that null. Try this - if the work table has a null, let it just overwrite the value with the existing value.
UPDATE [Orders]
SET [Status] = CASE WHEN S.[Status] IS NOT NULL THEN S.[Status] else [Orders].[Status] END
FROM [WRK_TABLE] S
WHERE [Orders].[Master_Ordernum] = S.[Master_Ordernum]
Not sure if you can use
Update my_table set col = ISNULL(#newval,#existingval) where id=#id
https://msdn.microsoft.com/en-us/library/ms184325.aspx

Copy values from one table to another in SQL

I have 2 tables. I need to update all rows of table 1 with the values in specific columns from table 2. They have the same structure.
UPDATE #TempTable
SET [MyColumn] =
(
SELECT [MyColumn]
FROM
[udf_AggregateIDs] (#YearId) AS [af]
INNER JOIN [MyForm] ON
(
[af].[FormID] = [MyForm].[FormID] AND
[af].[FormID] = #MyFormId
)
WHERE [Description] = [MyForm].[Description]
)
I get an error saying Subquery returned more than 1 value. I only added the where clause in because i thought sql is struggling to match the rows, but both tables have the same rows.
It should return multiple values because i'm trying to copy across all rows for MyColumn from the one table to the other.
Ideas?
is Description unique ?
select [Description], count(*) from [MyForm] group by [Description] having count(*)>1
You don't need a sub query..just join the tables..
same type of question has been answered here. Hope it helps.
Have to guess here because your query isn't self-documenting. Does MyColumn come from the function? Does #TempTable have a description column? Who knows, because you didn't prefix them with an alias? Try this. You may have to adjust since you know your schema and we don't.
UPDATE t
SET [MyColumn] = func.MyColumn -- have to guess here
FROM dbo.[udf_AggregateIDs] (#YearId) AS func
INNER JOIN dbo.MyForm AS f
ON func.FormID = f.FormID
INNER JOIN #TempTable AS t
ON t.Description = f.Description -- guessing here also
WHERE f.FormID = #MyFormID;

Compare tables and identify new or changed fields

New to SQL, would like to compare fields between a stg and src table. The identify any differences between the tables and assign transaction status of 'C' for change. Any new records will be set with 'A' for add.
STG_DM_CLIENT and SRC_DM_CLIENT
What is the best way to do it, would it be best to do a some form of union all. Unsure how to proceed, any assistance welcomed.
You can identify new records by using NOT IN or NOT EXISTS
update STG_DM_CLIENT SET TransactionStatus = 'A' WHERE ID IN
(select Id from STG_DM_CLIENT
where Id not in (select Id from SRC_DM_CLIENT))
Then, you can identify changed records by comparing fields:
update STG_DM_CLIENT SET TransactionStatus = 'C' WHERE ID IN
(select STG_DM_CLIENT.Id from STG_DM_CLIENT
join SRC_DM_CLIENT on SRC_DM_CLIENT.Id = STG_DM_CLIENT.Id
where (SRC_DM_CLIENT.Field1 != STG_DM_CLIENT.Field1
OR SRC_DM_CLIENT.Field2 != STG_DM_CLIENT.Field2 ...))
update
[STG]
set
TransactionStatus = CASE WHEN [SRC].id IS NULL THEN 'A' ELSE 'C' END
from
STG_DM_CLIENT AS [STG]
left join
SRC_DM_CLIENT AS [SRC]
ON [STG].id = [SRC].id -- Or whatever relates the records 1:1
WHERE
[SRC].id IS NULL
OR [STG].field1 <> [SRC].field1
OR [STG].field2 <> [SRC].field2
OR [STG].field3 <> [SRC].field3
...
OR [STG].fieldn <> [SRC].fieldn

Doing an Update Ignore in SQL Server 2005

I have a table where I wish to update some of the rows. All the fields are not null. I'm doing a sub-query, and I wish to update the table with the non-Null results.
See Below for my final answer:
In MySQL, I solve this problem by doing an UPDATE IGNORE. How do I make this work in SQL Server 2005? The sub-query uses a four-table Join to find the data to insert if it exists. The Update is being run against a table that could have 90,000+ records, so I need a solution that uses SQL, rather than having the Java program that's querying the database retrieve the results and then update those fields where we've got non-Null values.
Update: My query:
UPDATE #SearchResults SET geneSymbol = (
SELECT TOP 1 symbol.name FROM
GeneSymbol AS symbol JOIN GeneConnector AS geneJoin
ON symbol.id = geneJoin.geneSymbolID
JOIN Result AS sSeq ON geneJoin.sSeqID = sSeq.id
JOIN IndelConnector AS joiner ON joiner.sSeqID = sSeq.id
WHERE joiner.indelID = #SearchResults.id ORDER BY symbol.id ASC)
WHERE isSNV = 0
If I add "AND symbol.name IS NOT NULL" to either WHERE I get a SQL error. If I run it as is I get "adding null to a non-null column" errors. :-(
Thank you all, I ended up finding this:
UPDATE #SearchResults SET geneSymbol =
ISNULL ((SELECT TOP 1 symbol.name FROM
GeneSymbol AS symbol JOIN GeneConnector AS geneJoin
ON symbol.id = geneJoin.geneSymbolID
JOIN Result AS sSeq ON geneJoin.sSeqID = sSeq.id
JOIN IndelConnector AS joiner ON joiner.sSeqID = sSeq.id
WHERE joiner.indelID = #SearchResults.id ORDER BY symbol.id ASC), ' ')
WHERE isSNV = 0
While it would be better not to do anything in the null case (so I'm going to try to understand the other answers, and see if they're faster) setting the null cases to a blank answer also works, and that's what this does.
Note: Wrapping the ISNULL (...) with () leads to really obscure (and wrong) errors.
with UpdatedGenesDS (
select joiner.indelID, name, row_number() over (order by symbol.id asc) seq
from
GeneSymbol AS symbol JOIN GeneConnector AS geneJoin
ON symbol.id = geneJoin.geneSymbolID
JOIN Result AS sSeq ON geneJoin.sSeqID = sSeq.id
JOIN IndelConnector AS joiner ON joiner.sSeqID = sSeq.id
WHERE name is not null ORDER BY symbol.id ASC
)
update Genes
set geneSymbol = upd.name
from #SearchResults a
inner join UpdateGenesDs upd on a.id = b.intelID
where upd.seq =1 and isSNV = 0
this handles the null completely as all are filtered out by the where predicate (can also be filtered by join predicate if You wish. Is it what You are looking for?
Here's another option, where only those rows in #SearchResults that are succesfully joined will be udpated. If there are no null values in the underlying data, then the inner joins will pull in no null values, and you won't have to worry about filtering them out.
UPDATE #SearchResults
set geneSymbol = symbol.name
from #SearchResults sr
inner join IndelConnector AS joiner
on joiner.indelID = sr.id
inner join Result AS sSeq
on sSeq.id = joiner.sSeqID
inner join GeneConnector AS geneJoin
on geneJoin.sSeqID = sSeq.id
-- Get "lowest" (i.e. first if listed alphabetically) value of name for each id
inner join (select id, min(name) name
from GeneSymbol
group by id) symbol
on symbol.id = geneJoin.geneSymbolID
where isSNV = 0 -- Which table is this value from?
(There might be some syntax problems, without tables I can't debug it)