Copy data on parent and child table - sql

I have two database tables. What I need to do is to copy specific data from one storage to another, but also keep the mapping to the photos. First part I can do easily writing
INSERT INTO item (storage_id, price, quantity, description, document_id)
SELECT 10, price, quantity, description, document_id
FROM item
WHERE quantity >= 10 AND price <= 100
but after that newly inserted items does not have photos. Note, that document_id field is unique for not copied items.

Assuming id columns are auto-generated surrogate primary keys, like a serial or IDENTITY column.
Use a data-modifying CTE with RETURNING to make do with a single scan on each source table:
WITH sel AS (
SELECT id, price, quantity, description, document_id
FROM item
WHERE quantity >= 10
AND price <= 100
)
, ins_item AS (
INSERT INTO item
(storage_id, price, quantity, description, document_id)
SELECT 10 , price, quantity, description, document_id
FROM sel
RETURNING id, document_id -- document_id is UNIQUE in this set!
)
INSERT INTO item_photo
(item_id, date, size)
SELECT ii.id , ip.date, ip.size
FROM ins_item ii
JOIN sel s USING (document_id) -- link back to org item.id
JOIN item_photo ip ON ip.item_id = s.id; -- join to org item.id
CTE sel reads all we need from table items.
CTE ins_item inserts into table item. The RETURNING clause returns newly generated id values, together with the (unique!) document_id, so we can link back.
Finally, the outer INSERT inserts into item_photo. We can select matching rows after linking back to the original item.id.
Related:
Insert New Foreign Key Row for new Strings
But:
document_id field is unique for not copied items.
Does that guarantee we are dealing with unique document_id values?

Given that document_id is the same in the two sets, we can used that to ensure that after the first copy, that all duplicate entries that have photos are copied across.
Note: This is still a dirty hack, but it will work. Ideally with data synchronizations we make sure that there is a reference or common key in all the target tables. You could also use output parameters to capture the new id values or use a cursor or other looping constructs to process the records 1 by 1 and copy the photos at the same time instead of trying to update the photos after the initial copy stage.
This query will insert photos for items that do NOT have photos but another item with the same document_id does have photos.
INSERT INTO item_photo (item_id, "date", size)
SELECT "source_photo".item_id, "source_photo"."date", "source_photo". Size
FROM item "target_item"
INNER JOIN item "source_item" on "target_item".document_id = "source_item".document_id
INNER JOIN item_photo "source_photo" ON "source_item".id = "source_photo".item_id
WHERE "target_item".id <> "source_item".id
AND NOT EXISTS ( SELECT id FROM item_photo WHERE item_id = "target_item".id)
AND source_item.id IN (
SELECT MIN(p.item_id) as "item_id"
FROM item_photo p
INNER JOIN item i ON p.item_id = i.id
GROUP BY document_id
)

Related

How to use SQL MERGE to remove duplicates, and update data?

I am doing some house-keeping on duplicate data. I have different tables like Recipes, Ingredients, and RecipeIngredients.
In the Ingredients table, users have previously added multiple ingredients with the same name/title, e.g., "chicken" will appear many times instead of just one. I want to remove the duplicates but still keep a reference to the recipe.
I am trying to use SQL MERGE but it is deleting the wrong data, and I have starred myself blind on it. What am I doing wrong / it's probably just some quick fix??
When I run the code below, I get this table relation:
Chicken Recipe
Chicken
Burger Recipe
Salt, Pepper, Patty
But what I really want is:
Chicken Recipe
Chicken, Salt
Burger Recipe
Salt, Pepper, Patty
The MERGE statement deletes the "Salt" from RecipeIngredient instead of removing the duplicate. What am I doing wrong?
-- create table structure
CREATE TABLE #Recipes (
Id int,
Title nvarchar(50)
)
CREATE TABLE #Ingredients (
Id int,
Title nvarchar(50)
)
CREATE TABLE #RecipeIngredients (
Id int,
Recipe_id int,
Ingredient_id int
)
-- load data
INSERT INTO #Recipes (Id,Title) VALUES (1,'Chicken Recipe');
INSERT INTO #Recipes (Id,Title) VALUES (2,'Burger Recipe');
INSERT INTO #Ingredients (Id,Title) VALUES (1,'Chicken');
INSERT INTO #Ingredients (Id,Title) VALUES (2,'Chicken'); -- duplicate ingredient
INSERT INTO #Ingredients (Id,Title) VALUES (3,'Salt');
INSERT INTO #Ingredients (Id,Title) VALUES (4,'Pepper');
INSERT INTO #Ingredients (Id,Title) VALUES (5,'Patty');
INSERT INTO #RecipeIngredients (Id,Recipe_id,Ingredient_id) VALUES (1,1,2); -- chicken has chicken
INSERT INTO #RecipeIngredients (Id,Recipe_id,Ingredient_id) VALUES (2,1,3); -- chicken has salt
INSERT INTO #RecipeIngredients (Id,Recipe_id,Ingredient_id) VALUES (3,2,3); -- burger has salt
INSERT INTO #RecipeIngredients (Id,Recipe_id,Ingredient_id) VALUES (4,2,4); -- burger has pepper
INSERT INTO #RecipeIngredients (Id,Recipe_id,Ingredient_id) VALUES (5,2,5); -- burger has patty
-- try to clean up
MERGE #RecipeIngredients
USING
(
SELECT MAX(id) as MyId
FROM #Ingredients
GROUP BY Title
) NewIngredients ON #RecipeIngredients.Id = NewIngredients.MyId
WHEN MATCHED THEN
UPDATE SET #RecipeIngredients.Ingredient_id = NewIngredients.MyId
WHEN NOT MATCHED BY SOURCE THEN DELETE;
GO
-- delete duplicate ingredients, i.e., that no longer has a value in table #RecipeIngredients
DELETE FROM #Ingredients WHERE Id NOT IN (SELECT Ingredient_Id FROM #RecipeIngredients)
-- clean up
DROP TABLE #Recipes
DROP TABLE #Ingredients
DROP TABLE #RecipeIngredients
You should first update all the duplicate IDs to the single ID and then do the cleanup.
I've changed the winning ID determination from MAX to MIN since it will leave and have the same value in case there was an insert inbetween (hopefully you increment IDs as identity). Or you may use SNAPSHOT isolation level to enforce the max (or SERIALIZABLE to stop producing more new duplicates during this transaction). Also the cleanup of #Ingredients table should not use in filter, because as of design it is ok to have unused ingredients and users do not want to lose their data. So I've deleted the duplicates in the same way with MIN(id).
This is updated MERGE statement which sets Ingredient_ID to the single value
MERGE #RecipeIngredients as t
USING
(
SELECT id, min(id) over(partition by title) as MaxId
FROM #Ingredients
) as NewIngredients
ON t.Ingredient_Id = NewIngredients.Id
WHEN MATCHED THEN
UPDATE SET t.Ingredient_id = NewIngredients.MaxId;
Then I remove the duplicates from the #RecipeIngredients:
/*Cleanup duplicates from RecipeIngredients*/
delete t from (
select
row_number() over(
partition by
Recipe_id,
Ingredient_id
order by
id asc
) as rn
from #RecipeIngredients
) as t
where rn > 1
And finally cleanup the deduplicated records in #Ingredients table:
delete t from (
select
id,
min(id) over(partition by title) as minid
from #Ingredients
) as t
where id <> minid
And all the code in the db<>fiddle is here
UPD
I've added the more robust way to do a cleanup:
remove duplicates from #Ingredients table first
capture the deleted records
then update Ingredient_Id for that deleted IDs in the #RecipeIngredients table, purging the duplicates in it (that can be created after unification, I do not know if it is the case) with MERGE statement.
Here's the new code and db<>fiddle for it. Also I've added anothed duplicated ingredient in #Ingredients table and another ingredient with different Ingredient_Id in #RecipeIngredients table (to show the MERGE's deletion part).
/*Declare the table for unified ingredients*/
declare #deletedIngredients table (
id int,
unifiedId int
);
/*Cleanup of duplicate ingredients and
catch the deleted records with the corresponding unified Id
*/
with i_del as (
/*Leave only the first (by ID) record with the same name*/
select id, min(id) over(partition by title) as unifiedId
from #Ingredients
)
delete from i
/*Catch the deleted records and corresponding unified Ids*/
output
deleted.id,
i_del.unifiedId
into #deletedIngredients
from #Ingredients as i
join i_del
on i.id = i_del.id
/*Remove only duplicates where Id is not equal to the master record Id*/
where i.id <> i_del.unifiedId
;
/*Then do an update of IDs on the RecipeIngredients
and delete the duplicates from it (that can be created during the unification of Ingredients_Id)
*/
merge into #RecipeIngredients as t
using (
select
ri.id,
i.unifiedid,
/*Number the rows per Receipe_Id and new Ingredient_Id*/
row_number() over(
partition by
ri.Recipe_Id,
i.unifiedId
order by ri.id asc
) as rn
from #RecipeIngredients as ri
join #deletedIngredients as i
on ri.Ingredient_Id = i.id
) as s
on t.id = s.id
/*The first record should have the new unified id*/
when matched and s.rn = 1 then
update set ingredient_id = s.unifiedId
/*And unintentionally created duplicate should be removed*/
when matched and s.rn > 1 then delete
;
commit;

Column Not Allowed SQL

Error Code Screenshot (ShipDate is now the error)
For my school project we are to create a product database where the customer places an order etc. I've compared my code to classmates and it's the essentially the same except I have less columns. This section of the code inserts the user input into the Orders Table.
The 2nd to last Column, OrderStatus, is where the * appears in the console. I apologize ahead of time if it looks messy, for some reason the format in the Body doesn't carry over to publish posts.
CODE:
INSERT INTO Orders
VALUES (OrderNum,
OrderDate,
CustID,
PNum,
UnitPrice,
QtyOrder,
TotalCost,
ShipDate,
QtyShipped,
OrderStatus,
NULL);
SELECT MaxNum,
SYSDATE,
&vCustID,
'&vPNum',
UnitPrice,
&vQty,
TotalCost,
ShipDate,
QtyShipped,
'Open',
Orders.ReasonNum
FROM CancelledOrder, Orders, Counter
WHERE Orders.ReasonNum = CancelledOrder.ReasonNum;
COMMIT;
Orders Table for reference
CREATE TABLE Orders
(
OrderNum NUMBER (4) PRIMARY KEY,
OrderDate DATE,
CustID CHAR (3),
PNum VARCHAR2 (3),
UnitPrice NUMBER,
QtyOrder NUMBER,
TotalCost NUMBER,
ShipDate DATE,
QtyShipped NUMBER,
OrderStatus VARCHAR2 (10),
ReasonNum NUMBER,
CONSTRAINT fk_CustID FOREIGN KEY (CustID) REFERENCES Customer (CustID),
CONSTRAINT fk_PNum FOREIGN KEY (PNum) REFERENCES Product (PNum),
CONSTRAINT fk_ReasonNum FOREIGN KEY
(ReasonNum)
REFERENCES CancelledOrder (ReasonNum)
);
I presume that INSERT should go along with SELECT, i.e.
insert into ...
select ... from
On your example:
INSERT INTO Orders (OrderNum, --> no VALUES keyword, but list of columns
OrderDate,
CustID,
PNum,
UnitPrice,
QtyOrder,
TotalCost,
ShipDate,
QtyShipped,
OrderStatus,
reasonnum) --> reasonnum instead of null
SELECT MaxNum,
SYSDATE,
&vCustID,
'&vPNum',
UnitPrice,
&vQty,
TotalCost,
ShipDate,
QtyShipped,
'Open',
Orders.ReasonNum
FROM CancelledOrder, Orders, Counter
WHERE Orders.ReasonNum = CancelledOrder.ReasonNum;
Also, check FROM & WHERE clauses: there are 3 tables involved with only one condition. You'll get - as a result - more rows than you expected, unless you fix that (or unless COUNTER table contains only 1 row).
For these examples, imagine two tables, a and b, with 3 columns each.
When you are inserting, the statement must either use:
Method A) Here we instruct the database to INSERT (to all or to specific columns) the results of a query. To do this, we write INSERT INTO SELECT ..... for example:
INSERT INTO table_a select table_b.* from table_b --Useful when we know how many columns table a and b have;
or
INSERT INTO table_a select b.column_2, b.column_3, b.column_1 from table_b --Usefull if b had more columns and we want those three, or if the order of the columns of b needs id different from the order of the columns of a
[In this case, all columns of table a will be filled with the respective columns from the rows of table b that the select part of the query returns]
or:
INSERT INTO table_a (tab_a_column1, tab_a_column3) select b.column_1, b.column_3 from table_b
[In this case, only the specified columns of table a will be filled with the columns from table b that the select part of the query returns]
-> Note that in these examples the VALUES keyword is never used
Method B) In this case we instruct the database to insert a sinlge new row with specific values into the table (to all or to specific columns of the table). In this method we do not use a select query at all:
INSERT INTO table_a VALUES ( 1, 'asdf', 5658 ) ;
In this example we just give some values to be inserted. They will be put in the corresponding columns of table_a, in the order that the columns are in the table.
INSERT INTO table_a (tab_a_column1, tab_a_column3) VALUES (1, 5658);
The numbers 1 and 5658 will be inserted to the first and third column, while the second one will be left NULL
So, when using VALUES, we are only inserting one row.
But when using Method A, our one statement may insert any number of rows at one go. It all will depend on how many rows the SELECT part of the query returns.
NOTE: the select part of the query in method A has no limit to how complex it can be. For example it can have multiple joins, where clauses, group by ... and more.
A good link that explains INSERT INTO can be found here:
https://www.techonthenet.com/sql/insert.php

How to copy the column id from another table?

I'm stuck with this since last week. I have two tables, where the id column of CustomerTbl correlates with CustomerID column of PurchaseTbl:
What I'm trying to achieve is I want to duplicate the data of the table from itself, but copy the newly generated id of CustomerTbl to PurchaseTbl's CustomerID
Just like from the screenshots above. Glad for any help :)
You may use OUTPUT clause to access to the new ID. But to access to both OLD ID and NEW ID, you will need to use MERGE statement. INSERT statement does not allow you to access to the source old id.
first you need somewhere to store the old and new id, a mapping table. You may use table variable or temp table
declare #out table
(
old_id int,
new_id int
)
then the merge statement with output clause
merge
#CustomerTbl as t
using
(
select id, name
from CustomerTbl
) as s
on 1 = 2 -- force it to `false`, not matched
when not matched then
insert (name)
values (name)
output -- the output clause
s.id, -- old_id
inserted.id -- new_id
into #out (old_id, new_id);
after that you just use the #out to join back using old_id to obtain the new_id for the PurchaseTbl
insert into PurchaseTbl (CustomerID, Item, Price)
select o.new_id, p.Item, p.Price
from #out o
inner join PurchaseTbl p on o.old_id = p.CustomerID
Not sure what your end game is, but one way you could solve this is this:
INSERT INTO purchaseTbl ( customerid ,
item ,
price )
SELECT customerid + 3 ,
item ,
price
FROM purchaseTbl;

SQL aliasing with FROM AS

SELECT A.barName AS BarName1, B.barName AS BarName2
FROM (
SELECT Sells.barName, COUNT(barName) AS count
FROM Sells
GROUP BY barName
) AS A, B
WHERE A.count = B.count
I'm trying to do a self join on this table that I created, but I'm not sure how to alias the table twice in this format (i.e. FROM AS). Unfortunately, this is a school assignment where I can't create any new tables. Anyone have experience with this syntax?
edit: For clarification I'm using PostgreSQL 8.4. The schema for the tables I'm dealing with are as follows:
Drinkers(name, addr, hobby, frequent)
Bars(name, addr, owner)
Beers(name, brewer, alcohol)
Drinks(drinkerName, drinkerAddr, beerName, rating)
Sells(barName, beerName, price, discount)
Favorites(drinkerName, drinkerAddr, barName, beerName, season)
Again, this is for a school assignment, so I'm given read-only access to the above tables.
What I'm trying to find is pairs of bars (Name1, Name2) that sell the same set of drinks. My thinking in doing the above was to try and find pairs of bars that sell the same number of drinks, then list the names and drinks side by side (BarName1, Drink1, BarName2, Drink2) to try and compare if they are indeed the same set.
You have not mentioned what RDBMS you use.
If Oracle or MS SQL, you can do something like this (I use my sample data table, but you can try it with your tables):
create table some_data (
parent_id int,
id int,
name varchar(10)
);
insert into some_data values(1, 2, 'val1');
insert into some_data values(2, 3, 'val2');
insert into some_data values(3, 4, 'val3');
with data as (
select * from some_data
)
select *
from data d1
left join data d2 on d1.parent_id = d2.id
In your case this query
SELECT Sells.barName, COUNT(barName) AS count
FROM Sells
GROUP BY barName
should be placed in WITH section and referenced from main query 2 times as A and B.
It is slightly unclear what you are trying to achive. Are you looking for a list bar names, with how many times they appear in the table? If so, there are a couple ways you could do this. Firstly:
SELECT SellsA.barName AS BarName1, SellsB.count AS Count
FROM
(
SELECT DISTINCT barName
FROM Sells
) SellsA
LEFT JOIN
(
SELECT Sells.barName, COUNT(barName) AS count
FROM Sells
GROUP BY barName
) AS SellsB
ON SellsA.barName = SellsB.barName
Secondly, if you are using MSSQL:
SELECT barNamr, MAX(rn) AS Count
FROM
(
SELECT barName,
ROW_NUMBER() OVER(ORDR BY barName PARTITION BY barName) as rn
FROM Sells
) CountSells
GROUP BY barName
Thirdly, you could avoid a self-join in MSSQL, by using OVER():
SELECT
barName
COUNT(*) OVER(ORDER BY barName PARTITION BY barName) AS Count
FROM Sells

Update query with 'not exists' check causes primary key violation

The following tables are involved:
Table Product:
product_id
merged_product_id
product_name
Table Company_Product:
product_id
company_id
(Company_Product has a primary key on both the product_id and company_id columns)
I now want to run an update on Company_Product to set the product_id column to a merged_ product_id. This update could cause duplicates which would trigger a primary key violation, so therefore I added a 'not exists' check in the where clause and my query looks like this:
update cp
set cp.product_id = p.merged_product_id
from Company_Product cp
join Product p on p.product_id = cp.product_id
where p.merged_product_id is not null
and not exists
(select * from Company_Product cp2
where cp2.company_id = cp.company_id and
cp2.product_id = p.merged_product_id)
But this query fails with a primary key violation.
What I think might happen is that because the Product table contains multiple rows with the same merged_product_id, it will succeed the for the first product, but when going to the next product with the same merged_product_id, it'll fail because the 'not exists' subquery does not see the first change, as the query has not finished and committed yet.
Am I right in thinking this, and how would I change the query to make it work?
[EDIT] Some data examples:
Product:
product_id merged_product_id
23 35
24 35
25 12
26 35
27 NULL
Company_Product:
product_id company_id
23 2
24 2
25 2
26 3
27 4
[EDIT 2] Eventually I went with this solution, which uses a temporary table to to the update on and then inserts the updated data into the original Company_Product table:
create table #Company_Product
(product_id int, company_id int)
insert #Company_Product select * from Company_Product
update cp
set cp.product_id = p.merged_product_id
from #Company_Product cp
join Product p on p.product_id = cp.product_id
where p.merged_product_id is not null
delete from Company_Product
insert Company_Product select distinct * from #Company_Product
drop table #Company_Product
A primary key is supposed to be three things:
Non-null
Unique
Unchanging
By altering part of the primary key you're violating requirement #3.
I think you'd be better off creating a new table, populating it, then drop the constraints, drop the original table, and rename the new table to the desired name (then of course, re-apply the original constraints). In my experience this gives you the chance to check out the 'new' data before making it 'live'.
Share and enjoy.
You can use MERGE if you are on SQL 2008 at least.
Otherwise you're going to have to choose a criteria to establish which merged_product_id you want in and which one you leave out:
update cp
set cp.product_id = p.merged_product_id
from Company_Product cp
cross apply (
select top(1) merged_product_id
from Product
where product_id = cp.product_id
and p.merged_product_id is not null
and not exists (
select * from Company_Product cp2
where cp2.company_id = cp.company_id and
cp2.product_id = merged_product_id)
order by <insert diferentiating criteria here>) as p
Note that this is not safe if multiple concurrent requests are running the merge logic.
I can't quite see how your structure is meant to work or what this update is trying to achieve. You seem to be updating Company_Product and setting a (new) product_id on an existing row that apparently has a different product_id; e.g., changing the row from one product to another. This seems...an odd use case, I'd expect you to be inserting a new unique row. So I think I'm missing something.
If you're converting Company_Product to using a new set of product IDs instead of an old set (the name "merged_product_id" makes me speculate this), are you sure that there is no overlap between the old and new? That would cause a problem like what you're describing.
Without seeing your data, I believe your analysis is correct - the entire set is updated and then the commit fails since it results in a constraint violation. An EXISTS is never re-evaluated after "partial commit" of some of the UPDATE.
I think you need to more precisely define your rules regarding attempting to change multiple products to the same product according to the merged_product_id and then make those explicit in your query. For instance, you could exclude any products which would fall into that category with a further NOT EXISTS with appropriate query.
I think you are correct on why the update is failing. To fix this, run a delete query on your company_product table to remove the extra product_ids where the same merged_prduct_id will be applied.
here is a stab at what the query might be
delete company_product
where product_id not in (
select min(product_id)
from product
group by merged_product_id
)
and product_id not in (
select product_id
from product
where merged_product_id is null
)
-- Explanation added in resonse to comment --
What this tries to do is to delete rows that will be duplicates after the update. Since you have products with multiple merged ids, you really only need one of those products (for each company) in the table when you are done. So, my query (if it works...) will keep the min original product id for each merged product id - then your update will work.
So, let's say you have 3 product ids which will map to 2 merged ids: 1 -> 10, 2 -> 20, 3 -> 20. And you have the following company_product data:
product_id company_id
1 A
2 A
3 A
If you run your update against this, it will try to change both the second and third rows to product id 20, and it will fail. If you run the delete I suggest, it will remove the third row. After the delete and the update, the table will look like this:
product_id company_id
10 A
20 A
Try this:
create table #Company_Product
(product_id int, company_id int)
create table #Product (product_id int,merged_product_id int)
insert into #Company_Product
select 23, 2
union all select 24, 2
union all select 25, 2
union all select 26, 3
union all select 27, 4
insert into #product
Select 23, 35
union all select 24, 35
union all select 25, 12
union all select 26, 35
union all select 27, NULL
update cp
set product_id = merged_product_id
from #company_product cp
join
(
select min(product_id) as product_id, merged_product_id
from #product where merged_product_id is not null
group by merged_product_id
) a on a.product_id = cp.product_id
delete cp
--select *
from #company_product cp
join #product p on cp.product_id = p.product_id
where cp.product_id <> p.merged_product_id
and p.merged_product_id is not null