I am doing some house-keeping on duplicate data. I have different tables like Recipes, Ingredients, and RecipeIngredients.
In the Ingredients table, users have previously added multiple ingredients with the same name/title, e.g., "chicken" will appear many times instead of just one. I want to remove the duplicates but still keep a reference to the recipe.
I am trying to use SQL MERGE but it is deleting the wrong data, and I have starred myself blind on it. What am I doing wrong / it's probably just some quick fix??
When I run the code below, I get this table relation:
Chicken Recipe
Chicken
Burger Recipe
Salt, Pepper, Patty
But what I really want is:
Chicken Recipe
Chicken, Salt
Burger Recipe
Salt, Pepper, Patty
The MERGE statement deletes the "Salt" from RecipeIngredient instead of removing the duplicate. What am I doing wrong?
-- create table structure
CREATE TABLE #Recipes (
Id int,
Title nvarchar(50)
)
CREATE TABLE #Ingredients (
Id int,
Title nvarchar(50)
)
CREATE TABLE #RecipeIngredients (
Id int,
Recipe_id int,
Ingredient_id int
)
-- load data
INSERT INTO #Recipes (Id,Title) VALUES (1,'Chicken Recipe');
INSERT INTO #Recipes (Id,Title) VALUES (2,'Burger Recipe');
INSERT INTO #Ingredients (Id,Title) VALUES (1,'Chicken');
INSERT INTO #Ingredients (Id,Title) VALUES (2,'Chicken'); -- duplicate ingredient
INSERT INTO #Ingredients (Id,Title) VALUES (3,'Salt');
INSERT INTO #Ingredients (Id,Title) VALUES (4,'Pepper');
INSERT INTO #Ingredients (Id,Title) VALUES (5,'Patty');
INSERT INTO #RecipeIngredients (Id,Recipe_id,Ingredient_id) VALUES (1,1,2); -- chicken has chicken
INSERT INTO #RecipeIngredients (Id,Recipe_id,Ingredient_id) VALUES (2,1,3); -- chicken has salt
INSERT INTO #RecipeIngredients (Id,Recipe_id,Ingredient_id) VALUES (3,2,3); -- burger has salt
INSERT INTO #RecipeIngredients (Id,Recipe_id,Ingredient_id) VALUES (4,2,4); -- burger has pepper
INSERT INTO #RecipeIngredients (Id,Recipe_id,Ingredient_id) VALUES (5,2,5); -- burger has patty
-- try to clean up
MERGE #RecipeIngredients
USING
(
SELECT MAX(id) as MyId
FROM #Ingredients
GROUP BY Title
) NewIngredients ON #RecipeIngredients.Id = NewIngredients.MyId
WHEN MATCHED THEN
UPDATE SET #RecipeIngredients.Ingredient_id = NewIngredients.MyId
WHEN NOT MATCHED BY SOURCE THEN DELETE;
GO
-- delete duplicate ingredients, i.e., that no longer has a value in table #RecipeIngredients
DELETE FROM #Ingredients WHERE Id NOT IN (SELECT Ingredient_Id FROM #RecipeIngredients)
-- clean up
DROP TABLE #Recipes
DROP TABLE #Ingredients
DROP TABLE #RecipeIngredients
You should first update all the duplicate IDs to the single ID and then do the cleanup.
I've changed the winning ID determination from MAX to MIN since it will leave and have the same value in case there was an insert inbetween (hopefully you increment IDs as identity). Or you may use SNAPSHOT isolation level to enforce the max (or SERIALIZABLE to stop producing more new duplicates during this transaction). Also the cleanup of #Ingredients table should not use in filter, because as of design it is ok to have unused ingredients and users do not want to lose their data. So I've deleted the duplicates in the same way with MIN(id).
This is updated MERGE statement which sets Ingredient_ID to the single value
MERGE #RecipeIngredients as t
USING
(
SELECT id, min(id) over(partition by title) as MaxId
FROM #Ingredients
) as NewIngredients
ON t.Ingredient_Id = NewIngredients.Id
WHEN MATCHED THEN
UPDATE SET t.Ingredient_id = NewIngredients.MaxId;
Then I remove the duplicates from the #RecipeIngredients:
/*Cleanup duplicates from RecipeIngredients*/
delete t from (
select
row_number() over(
partition by
Recipe_id,
Ingredient_id
order by
id asc
) as rn
from #RecipeIngredients
) as t
where rn > 1
And finally cleanup the deduplicated records in #Ingredients table:
delete t from (
select
id,
min(id) over(partition by title) as minid
from #Ingredients
) as t
where id <> minid
And all the code in the db<>fiddle is here
UPD
I've added the more robust way to do a cleanup:
remove duplicates from #Ingredients table first
capture the deleted records
then update Ingredient_Id for that deleted IDs in the #RecipeIngredients table, purging the duplicates in it (that can be created after unification, I do not know if it is the case) with MERGE statement.
Here's the new code and db<>fiddle for it. Also I've added anothed duplicated ingredient in #Ingredients table and another ingredient with different Ingredient_Id in #RecipeIngredients table (to show the MERGE's deletion part).
/*Declare the table for unified ingredients*/
declare #deletedIngredients table (
id int,
unifiedId int
);
/*Cleanup of duplicate ingredients and
catch the deleted records with the corresponding unified Id
*/
with i_del as (
/*Leave only the first (by ID) record with the same name*/
select id, min(id) over(partition by title) as unifiedId
from #Ingredients
)
delete from i
/*Catch the deleted records and corresponding unified Ids*/
output
deleted.id,
i_del.unifiedId
into #deletedIngredients
from #Ingredients as i
join i_del
on i.id = i_del.id
/*Remove only duplicates where Id is not equal to the master record Id*/
where i.id <> i_del.unifiedId
;
/*Then do an update of IDs on the RecipeIngredients
and delete the duplicates from it (that can be created during the unification of Ingredients_Id)
*/
merge into #RecipeIngredients as t
using (
select
ri.id,
i.unifiedid,
/*Number the rows per Receipe_Id and new Ingredient_Id*/
row_number() over(
partition by
ri.Recipe_Id,
i.unifiedId
order by ri.id asc
) as rn
from #RecipeIngredients as ri
join #deletedIngredients as i
on ri.Ingredient_Id = i.id
) as s
on t.id = s.id
/*The first record should have the new unified id*/
when matched and s.rn = 1 then
update set ingredient_id = s.unifiedId
/*And unintentionally created duplicate should be removed*/
when matched and s.rn > 1 then delete
;
commit;
Related
I have two database tables. What I need to do is to copy specific data from one storage to another, but also keep the mapping to the photos. First part I can do easily writing
INSERT INTO item (storage_id, price, quantity, description, document_id)
SELECT 10, price, quantity, description, document_id
FROM item
WHERE quantity >= 10 AND price <= 100
but after that newly inserted items does not have photos. Note, that document_id field is unique for not copied items.
Assuming id columns are auto-generated surrogate primary keys, like a serial or IDENTITY column.
Use a data-modifying CTE with RETURNING to make do with a single scan on each source table:
WITH sel AS (
SELECT id, price, quantity, description, document_id
FROM item
WHERE quantity >= 10
AND price <= 100
)
, ins_item AS (
INSERT INTO item
(storage_id, price, quantity, description, document_id)
SELECT 10 , price, quantity, description, document_id
FROM sel
RETURNING id, document_id -- document_id is UNIQUE in this set!
)
INSERT INTO item_photo
(item_id, date, size)
SELECT ii.id , ip.date, ip.size
FROM ins_item ii
JOIN sel s USING (document_id) -- link back to org item.id
JOIN item_photo ip ON ip.item_id = s.id; -- join to org item.id
CTE sel reads all we need from table items.
CTE ins_item inserts into table item. The RETURNING clause returns newly generated id values, together with the (unique!) document_id, so we can link back.
Finally, the outer INSERT inserts into item_photo. We can select matching rows after linking back to the original item.id.
Related:
Insert New Foreign Key Row for new Strings
But:
document_id field is unique for not copied items.
Does that guarantee we are dealing with unique document_id values?
Given that document_id is the same in the two sets, we can used that to ensure that after the first copy, that all duplicate entries that have photos are copied across.
Note: This is still a dirty hack, but it will work. Ideally with data synchronizations we make sure that there is a reference or common key in all the target tables. You could also use output parameters to capture the new id values or use a cursor or other looping constructs to process the records 1 by 1 and copy the photos at the same time instead of trying to update the photos after the initial copy stage.
This query will insert photos for items that do NOT have photos but another item with the same document_id does have photos.
INSERT INTO item_photo (item_id, "date", size)
SELECT "source_photo".item_id, "source_photo"."date", "source_photo". Size
FROM item "target_item"
INNER JOIN item "source_item" on "target_item".document_id = "source_item".document_id
INNER JOIN item_photo "source_photo" ON "source_item".id = "source_photo".item_id
WHERE "target_item".id <> "source_item".id
AND NOT EXISTS ( SELECT id FROM item_photo WHERE item_id = "target_item".id)
AND source_item.id IN (
SELECT MIN(p.item_id) as "item_id"
FROM item_photo p
INNER JOIN item i ON p.item_id = i.id
GROUP BY document_id
)
I wrote a stored procedure that can insert bulk data into table using the merge statement.
Problem is that when I insert itemid 1024,1000,1012,1025 in this order, then SQL Server automatically changes order of itemid 1000,1012,1024,1025.
I want to insert data that I actually pass.
Here is sample code. This will parse XML string into table object:
DECLARE #tblPurchase TABLE
(
Purchase_Detail_ID INT ,
Purchase_ID INT ,
Head_ID INT ,
Item_ID INT
);
INSERT INTO #tblPurchase (Purchase_Detail_ID, Purchase_ID, Head_ID, Item_ID)
SELECT
Tbl.Col.value('Purchase_Detail_ID[1]', 'INT') AS Purchase_Detail_ID,
Tbl.Col.value('Purchase_ID[1]', 'INT') AS Purchase_ID,
Tbl.Col.value('Head_ID[1]', 'INT') AS Head_ID,
Tbl.Col.value('Item_ID[1]', 'INT') AS Item_ID
FROM
#PurchaseDetailsXML.nodes('/documentelement/TRN_Purchase_Details') Tbl(Col)
This will insert bulk data into the TRN_Purchase_Details table:
MERGE TRN_Purchase_Details MTD
USING (SELECT
Purchase_Detail_ID,
Id AS Purchase_ID,
Head_ID, Item_ID
FROM
#tblPurchase
LEFT JOIN
#ChangeResult ON 1 = 1) AS TMTD ON MTD.Purchase_Detail_ID = TMTD.Purchase_Detail_ID
AND MTD.Purchase_ID = TMTD.Purchase_ID
WHEN MATCHED THEN
UPDATE SET MTD.Head_ID = TMTD.Head_ID,
MTD.Item_ID = TMTD.Item_ID
WHEN NOT MATCHED BY TARGET THEN
INSERT (Purchase_ID, Head_ID, Item_ID)
VALUES (Purchase_ID, Head_ID, Item_ID)
WHEN NOT MATCHED BY SOURCE AND
MTD.Purchase_ID = (SELECT TOP 1 Id
FROM #ChangeResult
WHERE Id > 0) THEN
DELETE;
Rows in a SQL table don't have any order. They come back in indeterminate order unless you specify an order by.
Try adding an identity column to your temporary table?
DECLARE #tblPurchase TABLE
(
ID int identity,
Purchase_Detail_ID INT ,
The identity column might capture the order of the XML elements.
If that doesn't work, you can calculate the position of the elements in the XML and store that position in the temporary table.
As mentioned elsewhere, data in a table is stored as an unordered set. If you need to be able to go back to your table after data is inserted and determine the order that it was inserted, you'll have to add a column to the table schema to record that information.
It could be something as simple as adding an IDENTITY column, which will increment on each row addition, or perhaps a column with a DATETIME data type and a GETDATE() default value so you not only know the order rows were added, but exactly when that happened.
I'm stuck with this since last week. I have two tables, where the id column of CustomerTbl correlates with CustomerID column of PurchaseTbl:
What I'm trying to achieve is I want to duplicate the data of the table from itself, but copy the newly generated id of CustomerTbl to PurchaseTbl's CustomerID
Just like from the screenshots above. Glad for any help :)
You may use OUTPUT clause to access to the new ID. But to access to both OLD ID and NEW ID, you will need to use MERGE statement. INSERT statement does not allow you to access to the source old id.
first you need somewhere to store the old and new id, a mapping table. You may use table variable or temp table
declare #out table
(
old_id int,
new_id int
)
then the merge statement with output clause
merge
#CustomerTbl as t
using
(
select id, name
from CustomerTbl
) as s
on 1 = 2 -- force it to `false`, not matched
when not matched then
insert (name)
values (name)
output -- the output clause
s.id, -- old_id
inserted.id -- new_id
into #out (old_id, new_id);
after that you just use the #out to join back using old_id to obtain the new_id for the PurchaseTbl
insert into PurchaseTbl (CustomerID, Item, Price)
select o.new_id, p.Item, p.Price
from #out o
inner join PurchaseTbl p on o.old_id = p.CustomerID
Not sure what your end game is, but one way you could solve this is this:
INSERT INTO purchaseTbl ( customerid ,
item ,
price )
SELECT customerid + 3 ,
item ,
price
FROM purchaseTbl;
I have 2 tables and one nested table:
1.stores data about products which include following columns:
ITEM - product id(key)
STORE - store id(key)
PRICE
NORMAL_PRICE
DISCOUNTS - nested table with info about discounts include columns:
PromotionId(key)
PromotionDescription
PromotionEndDate
MinQty
DiscountedPrice
DiscountedPricePerMida
2- temp table with new discounts include columns:
PROMOTIONID(key)
PRODUCTID(key)
PROMOTIONDESCRIPTION
PROMOTIONENDDATE
MINQTY
DISCOUNTEDPRICE
DISCOUNTEDPRICEPERMIDA
What i need to do is merge table 2 into table 1 - if no match insert else ignore
(when match is: product id matching in table 1 and 2 and for this product sub table PROMOTIONID match PROMOTIONID from table 2)
This is where I got so far and I have difficulty with nested part - ON clause and Insert clause
MERGE INTO PRICES P
USING(SELECT * FROM TMP_PROMO)T
ON(P.ITEM=T.PRODUCTID AND P.STORE=50 AND P.DISCOUNTS.PROMOTIONID=T.PROMOTIONID)
WHEN NOT MATCHED THEN INSERT (P.DISCOUNTS)
VALUES(T.PROMOTIONID,
T.PROMOTIONDESCRIPTION,
T.PROMOTIONENDDATE,
T.MINQTY,
T.DISCOUNTEDPRICE,
T.DISCOUNTEDPRICEPERMIDA);
I know that this is wrong but I can't find anywhere how to do it
example:
Prices table:
row1(1,50,...,nested_table[(11,...),(12,...)])
row2(2,50,...,nested_table[(10,...),(12,...)])
new promo table:
(15,1,...)
(11,1,...)
new promo with id 15 will be added to row1 and row2
and promo with id 11 will not be added
Please help,
thanks
What you intend to do is not realy a MERGE. You are adding a new promotion in each record that doesn't contain it.
Below is an answer how yu would approach it if you would use not a nested table but a conventional child table.
Setup (simplified to a minimum)
create table ITEM
(ITEM_ID NUMBER PRIMARY KEY);
create table ITEM_PROMO
(ITEM_ID NUMBER REFERENCES ITEM(ITEM_ID),
PROMO_ID NUMBER);
create table TMP_PROMO
(PROMO_ID NUMBER);
insert into ITEM values (1);
insert into ITEM values (2);
insert into ITEM_PROMO values (1,11);
insert into ITEM_PROMO values (1,12);
insert into ITEM_PROMO values (2,10);
insert into ITEM_PROMO values (2,12);
insert into TMP_PROMO values (15);
insert into TMP_PROMO values (11);
commit;
The first thing you need to find is which promotions are missing for an item.
Use a cross join to get all combination and constrain those promotions that EXISTS for a particular ITEM_ID:
select ITEM.ITEM_ID, TMP_PROMO.PROMO_ID
from ITEM cross join TMP_PROMO
where NOT EXISTS (select NULL from ITEM_PROMO where ITEM_ID = ITEM.ITEM_ID and PROMO_ID = TMP_PROMO.PROMO_ID)
;
This gives as expected
ITEM_ID PROMO_ID
---------- ----------
2 11
1 15
2 15
Now simple add those new promotions
INSERT INTO ITEM_PROMO
select ITEM.ITEM_ID, TMP_PROMO.PROMO_ID
from ITEM cross join TMP_PROMO
where NOT EXISTS (select NULL from ITEM_PROMO where ITEM_ID = ITEM.ITEM_ID and PROMO_ID = TMP_PROMO.PROMO_ID)
;
This should give you a hint how to approach while using nested tables (or how to change the DB design:)
I have an Access table of the form (I'm simplifying it a bit)
ID AutoNumber Primary Key
SchemeName Text (50)
SchemeNumber Text (15)
This contains some data eg...
ID SchemeName SchemeNumber
--------------------------------------------------------------------
714 Malcolm ABC123
80 Malcolm ABC123
96 Malcolms Scheme ABC123
101 Malcolms Scheme ABC123
98 Malcolms Scheme DEF888
654 Another Scheme BAR876
543 Whatever Scheme KJL111
etc...
Now. I want to remove duplicate names under the same SchemeNumber. But I want to leave the record which has the longest SchemeName for that scheme number. If there are duplicate records with the same longest length then I just want to leave only one, say, the lowest ID (but any one will do really). From the above example I would want to delete IDs 714, 80 and 101 (to leave only 96).
I thought this would be relatively easy to achieve but it's turning into a bit of a nightmare! Thanks for any suggestions. I know I could loop it programatically but I'd rather have a single DELETE query.
See if this query returns the rows you want to keep:
SELECT r.SchemeNumber, r.SchemeName, Min(r.ID) AS MinOfID
FROM
(SELECT
SchemeNumber,
SchemeName,
Len(SchemeName) AS name_length,
ID
FROM tblSchemes
) AS r
INNER JOIN
(SELECT
SchemeNumber,
Max(Len(SchemeName)) AS name_length
FROM tblSchemes
GROUP BY SchemeNumber
) AS w
ON
(r.SchemeNumber = w.SchemeNumber)
AND (r.name_length = w.name_length)
GROUP BY r.SchemeNumber, r.SchemeName
ORDER BY r.SchemeName;
If so, save it as qrySchemes2Keep. Then create a DELETE query to discard rows from tblSchemes whose ID value is not found in qrySchemes2Keep.
DELETE
FROM tblSchemes AS s
WHERE Not Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID);
Just beware, if you later use Access' query designer to make changes to that DELETE query, it may "helpfully" convert the SQL to something like this:
DELETE s.*, Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID)
FROM tblSchemes AS s
WHERE (((Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID))=False));
DELETE FROM Table t1
WHERE EXISTS (SELECT 1 from Table t2
WHERE t1.SchemeNumber = t2.SchemeNumber
AND Length(t2.SchemeName) > Length(t1.SchemeName)
)
Depend on your RDBMS you may use function different from Length (Oracle - length, mysql - length, sql server - LEN)
delete ShortScheme
from Scheme ShortScheme
join Scheme LongScheme
on ShortScheme.SchemeNumber = LongScheme.SchemeNumber
and (len(ShortScheme.SchemeName) < len(LongScheme.SchemeName) or (len(ShortScheme.SchemeName) = len(LongScheme.SchemeName) and ShortScheme.ID > LongScheme.ID))
(SQL Server flavored)
Now updated to include the specified tie resolution. Although, you may get better performance doing it in two queries: first deleting the schemes with shorter names as in my original query and then going back and deleting the higher ID where there was a tie in name length.
I'd do this in multiple steps. Large delete operations done in a single step make me too nervous -- what if you make a mistake? There's no sql 'undo' statement.
-- Setup the data
DROP Table foo;
DROP Table bar;
DROP Table bat;
DROP Table baz;
CREATE TABLE foo (
id int(11) NOT NULL,
SchemeName varchar(50),
SchemeNumber varchar(15),
PRIMARY KEY (id)
);
insert into foo values (714, 'Malcolm', 'ABC123' );
insert into foo values (80, 'Malcolm', 'ABC123' );
insert into foo values (96, 'Malcolms Scheme', 'ABC123' );
insert into foo values (101, 'Malcolms Scheme', 'ABC123' );
insert into foo values (98, 'Malcolms Scheme', 'DEF888' );
insert into foo values (654, 'Another Scheme ', 'BAR876' );
insert into foo values (543, 'Whatever Scheme ', 'KJL111' );
-- Find all the records that have dups, find the longest one
create table bar as
select max(length(SchemeName)) as max_length, SchemeNumber
from foo
group by SchemeNumber
having count(*) > 1;
-- Find the one we want to keep
create table bat as
select min(a.id) as id, a.SchemeNumber
from foo a join bar b on a.SchemeNumber = b.SchemeNumber
and length(a.SchemeName) = b.max_length
group by SchemeNumber;
-- Select into this table all the rows to delete
create table baz as
select a.id from foo a join bat b where a.SchemeNumber = b.SchemeNumber
and a.id != b.id;
This will give you a new table with only records for rows that you want to remove.
Now check these out and make sure that they contain only the rows you want deleted. This way you can make sure that when you do the delete, you know exactly what to expect. It should also be pretty fast.
Then when you're ready, use this command to delete the rows using this command.
delete from foo where id in (select id from baz);
This seems like more work because of the different tables, but it's safer probably just as fast as the other ways. Plus you can stop at any step and make sure the data is what you want before you do any actual deletes.
If your platform supports ranking functions and common table expressions:
with cte as (
select row_number()
over (partition by SchemeNumber order by len(SchemeName) desc) as rn
from Table)
delete from cte where rn > 1;
try this:
Select * From Table t
Where Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber )
And Id >
(Select Min (Id)
From Table
Where SchemeNumber = t.SchemeNumber
And SchemeName = t.SchemeName)
or this:,...
Select * From Table t
Where Id >
(Select Min(Id) From Table
Where SchemeNumber = t.SchemeNumber
And Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber))
if either of these selects the records that should be deleted, just change it to a delete
Delete
From Table t
Where Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber )
And Id >
(Select Min (Id)
From Table
Where SchemeNumber = t.SchemeNumber
And SchemeName = t.SchemeName)
or using the second construction:
Delete From Table t Where Id >
(Select Min(Id) From Table
Where SchemeNumber = t.SchemeNumber
And Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber))