Removing duplicate SQL records to permit a unique key - sql

I have a table ('sales') in a MYSQL DB which should rightfully have had a unique constraint enforced to prevent duplicates. To first remove the dupes and set the constraint is proving a bit tricky.
Table structure (simplified):
'id (unique, autoinc)'
product_id
The goal is to enforce uniqueness for product_id. The de-duping policy I want to apply is to remove all duplicate records except the most recently created, eg: the highest id.
Or to put another way, I would like to delete only duplicate records, excluding the ids matched by the following query whilst also preserving the existing non-duped records:
select id
from sales s
inner join (select product_id,
max(id) as maxId
from sales
group by product_id
having count(product_id) > 1) groupedByProdId on s.product_id
and s.id = groupedByProdId.maxId
I've struggled with this on two fronts - writing the query to select the correct records to delete and then also the constraint in MYSQL where a subselect FROM clause of a DELETE cannot reference the same table from which data is being removed.
I checked out this answer and it seemed to deal with the subject, but seem specific to sql-server, though I wouldn't rule this question out from duplicating another.

In reply to your comment, here's a query that works in MySQL:
delete YourTable
from YourTable
inner join YourTable yt2
on YourTable.product_id = yt2.product_id
and YourTable.id < yt2.id
This would only remove duplicate rows. The inner join will filter out the latest row for each product, even if no other rows for the same product exist.
P.S. If you try to alias the table after FROM, MySQL requires you to specify the name of the database, like:
delete <DatabaseName>.yt
from YourTable yt
inner join YourTable yt2
on yt.product_id = yt2.product_id
and yt.id < yt2.id;

Perhaps use ALTER IGNORE TABLE ... ADD UNIQUE KEY.
For example:
describe sales;
+------------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| product_id | int(11) | NO | | NULL | |
+------------+---------+------+-----+---------+----------------+
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 3 |
| 5 | 3 |
| 6 | 2 |
+----+------------+
ALTER IGNORE TABLE sales ADD UNIQUE KEY idx1(product_id), ORDER BY id DESC;
Query OK, 6 rows affected (0.03 sec)
Records: 6 Duplicates: 3 Warnings: 0
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 6 | 2 |
| 5 | 3 |
| 2 | 1 |
+----+------------+
See this pythian post for more information.
Note that the ids end up in reverse order. I don't think this matters, since order of the ids should not matter in a database (as far as I know!). If this displeases you however, the post linked to above shows a way to solve this problem too. However, it involves creating a temporary table which requires more hard drive space than the in-place method I posted above.

I might do the following in sql-server to eliminate the duplicates:
DELETE FROM Sales
FROM Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id
It looks like the analogous delete statement for mysql might be:
DELETE FROM Sales
USING Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id

This type of problem is easier to solve with CTEs and Ranking functions, however, you should be able to do something like the following to solve your problem:
Delete Sales
Where Exists(
Select 1
From Sales As S2
Where S2.product_id = Sales.product_id
And S2.id > Sales.Id
Having Count(*) > 0
)

Related

Complex SQL Joins with Where Clauses

Being pretty new to SQL, I ask for your patience. I have been banging my head trying to figure out how to create this VIEW by joining 3 tables. I am going to use mock tables, etc to keep this very simple. So that I can try to understand the answer - no just copy and paste.
ICS_Supplies:
Supplies_ ID Item_Description
-------------------------------------
1 | PaperClips
2 | Rubber Bands
3 | Stamps
4 | Staples
ICS_Orders:
ID SuppliesID RequisitionNumber
----------------------------------------------------
1 | 1 | R1234a
6 | 4 | R1234a
2 | 1 | P2345b
3 | 2 | P3456c
4 | 3 | R4567d
5 | 4 | P5678e
ICS_Transactions:
ID RequsitionNumber OrigDate TransType OpenClosed
------------------------------------------------------------------
1 | R1234a | 06/12/20 | Req | Open
2 | P2345b | 07/09/20 | PO | Open
3 | P3456c | 07/14/20 | PO | Closed
4 | R4567d | 08/22/20 | Req | Open
5 | P5678e | 11/11/20 | PO | Open
And this is what I want to see in my View Results
Supplies_ID Item RequsitionNumber OriginalDate TransType OpenClosed
---------------------------------------------------------------------------------------
1 | Paper Clips | P2345b | 07/09/20 | PO | OPEN
2 | Rubber Bands | Null | Null | Null | Null
3 | Stamps | Null | Null | Null | Null
4 | Staples | P56783 | 11/11/20 | PO | OPEN
I just can't get there. I want to always have the same amount of records that we have in the ICS_Supplies Table. I need to join to the ICS_Orders Table in order to grab the Requisition Number because that's what I need to join on the ICS_Transactions Table. I don't want to see data in the new added fields UNLESS ICS_Transactions.TransType = 'PO' AND ICS_Transactions.OpenClosed = 'OPEN', otherwise the joined fields should be seen as null, regardless to what they contain. IF that is possible?
My research shows this is probably a LEFT Join, which is very new to me. I had made many attempts on my own, and then posted my question yesterday. But I was struggling to ask the correct question and it was recommended by other members that I post the question again . .
If needed, I can share what I have done, but I fear it will make things overly confusing as I was going in the wrong direction.
I am adding a link to the original question, for those that need some background info
Original Question
If there is any additional information needed, just ask. I do apologize in advance if I have left out any needed details.
This is a bit tricky, because you want to exclude rows in the second table depending on whether there is a match in the third table - so two left joins are not what you are after.
I think this implements the logic you want:
select s.supplies_id, s.item_description,
t.requisition_number, t.original_date, t.trans_type, t.open_closed
from ics_supplies s
left join ics_transaction t
on t.transtype = 'PO'
and t.open_closed = 'Open'
and exists (
select 1
from ics_order o
where o.supplies_id = s.supplies_id and o.requisition_number = t.requisition_number
)
Another way to phrase this would be an inner join in a subquery, then a left join:
select s.supplies_id, s.item_description,
t.requisition_number, t.original_date, t.trans_type, t.open_closed
from ics_supplies s
left join (
select o.supplies_id, t.*
from ics_order o
inner join ics_transaction t
on t.requisition_number = o.requisition_number
where t.transtype = 'PO' and t.open_closed = 'Open'
) t on t.supplies_id = s.supplies_id
This query should return the data for supplies. The left join will add in all orders that have a supply_id (and return null for the orders that don't).
select
s.supplies_id
,s.Item_Description as [Item]
,t.RequisitionNumber
,t.OrigDate as [OriginalDate]
,t.TransType
,t.OpenClosed
from ICS_Supplies s
left join ICS_Orders o on o.supplies_id = s.supplies_id
left join ICS_Transactions t on t.RequisitionNumber = o.RequisitionNumber
where t.TransType = 'PO'
and t.OpenClosed = 'Open'
The null values will automatically show null if the record doesn't exist. For example, you are joining to the Transactions table and if there isn't a transaction_id for that supply then it will return 'null'.
Modify your query, run it, then maybe update your question using real examples if it's possible.
In the original question you wrote:
"I only need ONE matching record from the ICS_Transactions Table.
Ideally, the one that I want is the most current
'ICS_Transactions.OriginalDate'."
So the goal is to get the most recent transaction for which the TransType is 'PO' and OpenClosed is 'Open'. That the purpose of the CTE 'oa_cte' in this code. The appropriate transactions are then LEFT JOIN'ed on SuppliesId. Something like this
with oa_cte(SuppliesId, RequsitionNumber, OriginalDate,
TransType, OpenClosed, RowNum) as (
select o.SuppliesId, o.RequsitionNumber,
t.OrigDate, t.TransType, t.OpenClosed,
row_number() over (partition by o.SuppliesId
order by t.OrigDate desc)
from ICS_Orders o
join ICS_Transactions t on o.RequisitionNumber=t.RequisitionNumber
where t.TransType='PO'
and t.OpenClosed='OPEN')
select s.*, oa.*
from ICS_Supplies s
left join oa_cte oa on s.SuppliesId=oa.SuppliesId
and oa.RowNum=1;

One-to-Many SQL SELECT concatenated into single row

I'm using Postgres and I have the following schemes.
Orders
| id | status |
|----|-------------|
| 1 | delivered |
| 2 | recollected |
Comments
| id | text | user | order |
|----|---------|------|-------|
| 1 | texto 1 | 10 | 20 |
| 2 | texto 2 | 20 | 20 |
So, in this case, an order can have many comments.
I need to iterate over the orders and get something like this:
| id | status | comments |
|----|-------------|----------------|
| 1 | delivered | text 1, text 2 |
| 2 | recollected | |
I tried to use LEFT JOIN but it didn't work
SELECT
Order.id,
Order.status,
"Comment".text
FROM "Order"
LEFT JOIN "Comment" ON Order.id = "Comment"."order"
it returns this:
| id | status | text |
|----|-------------|--------|
| 1 | delivered | text 1 |
| 1 | delivered | text 2 |
| 2 | recollected| |
You are almost there - you just need aggregation:
SELECT
o.id,
o.status,
STRING_AGG(c.text, ',') comments
FROM "Order" o
LEFT JOIN "Comment" c ON p.id = c."order"
GROUP BY o.id, o.status
I would strongly recommend against having a table (and/or a column) called order: because it conflicts with a language keyword. I would also recommend avoiding quoted identifiers as much as possible - they make the queries longer to write, for no benefit.
Note that you can also use a correlated subquery:
SELECT
o.id,
o.status,
(SELECT STRING_AGG(c.text, ',') FROM "Comment" c WHERE c."order" = p.id) comments
FROM "Order" o
You can make it work with LEFT JOIN and aggregate after the join. But it's typically more efficient to aggregate first and join later.
If most or all rows in "Comment" are involved:
SELECT o.id, o.status, c.comments
FROM "Order" o
LEFT JOIN (
SELECT "order" AS id, string_agg(text, ', ') AS comments
FROM "Comment"
GROUP BY 1
) c USING (id);
Indexes won't matter, while most rows have to be read anyway.
For only a small percentage of rows (like, if you have a selective filter on "Order"):
SELECT o.id, o.status, c.comments
FROM "Order" o
LEFT JOIN LATERAL (
SELECT string_agg(text, ', ') AS comments
FROM "Comment"
WHERE "order" = o.id
) c ON true
WHERE <some_selective_filter>;
In this case, be sure to have an index on ("Comment"."order"), or more specialized, a covering index including text:
CREATE INDEX foo ON "Comment" ("order") INCLUDE (text);
Related:
Concatenate multiple result rows of one column into one, group by another column
Multiple array_agg() calls in a single query
Does a query with a primary key and foreign keys run faster than a query with just primary keys?
Aside: Consider legal, lower-case, unquoted identifiers in Postgres. In particular, don't (ab-)use completely reserved SQL keywords like ORDER as identifier. Much clearer and less potential for sneaky errors. See:
Are PostgreSQL column names case-sensitive?

Deleting duplicate rows with primary keys that are connected to other tables

A process was causing duplicate rows in a table where there were not supposed to be any. There are several great answers to deleting duplicate rows online. But, what if those duplicates with ID primary keys all have data in other tables tied to them?
Is there a way to delete all duplicates in the first table and migrate all data tied to those keys to the single PK ID that wasn't deleted?
For example:
TABLE 1
+-------+----------+----------+------------+
| ID(PK)| Model | ItemType | Color |
+-------+----------+----------+------------+
| 1 | 4 | B | Red |
| 2 | 4 | B | Red |
| 3 | 5 | A | Blue |
+-------+----------+----------+------------+
TABLE 2
+-------+----------+---------+
| ID(PK)| OtherID | Type |
+-------+----------+---------+
| 1 | 1 | Type1 |
| 2 | 1 | Type2 |
| 3 | 2 | Type3 |
| 4 | 2 | Type4 |
| 5 | 2 | Type5 |
+-------+----------+---------+
So I would theoretically want to delete the entry with ID: 2 from TABLE 1, and then have the OtherID fields in TABLE 2 switch to 1. This would actually be needed for X number of tables. This particular situation has 4 tables connected to its ID PK.
You cannot do this automatically. But you can do this with some queries. First, you set all the foreign keys to the correct id, which is presumably the smallest one:
with ids (
select t1.*, min(id) over (partition by Model, ItemType, Color) as min_id
from table1 t1
)
update t2
set t2.otherid = ids.min_id
from table2 t2 join
ids
on t2.otherid = ids.id
where ids.id <> ids.min_id;
Then delete the ids that are either duplicated or not referenced in table2 (depending on which you actually want):
with ids (
select t1.*, min(id) over (partition by Model, ItemType, Color) as min_id
from table1 t1
)
delete from ids
where id <> min_id;
Note: If the database has concurrent users, you might want to put it in single user mode for this operation or lock the tables so they are not modified during these two operations.
To do this right, you want to wrap everything in a single transaction and perform this during a regular maintenance period. Anything else could leave things as inconsistent as they are now.
Make a determination as to which "key" you will use.
Update all of the child tables to use the new "key" where the value is the old "key".
There should be no FK dependencies on the duplicate records, delete them.
Once all ambiguities are resolved, place an unique constraint on (ItemType,Color) (or whatever the real columns are).
If there are a lot of instances, you may need to write a script to handle this and use the information in sys.foreign_keys and sys.foreign_key_columns to determine which records to update and in which order.

SQL many-to-many select help needed

I have 2 tables
Bid_customer
|bidkey | customerkey
| 1 | 1
| 1 | 2
| 1 | 3
customer_groups
| groupkey | customerkey
| 1 | 1
| 1 | 2
| 1 | 3
What I'm trying to get is a result that will look like
| bidkey | groupkey
| 1 | 1
I've tried a cursor and joins but just don't seem to be able to get what i need any ideas or suggestions
EDIT: customers can belong to more that one group also
I am not sure who meaningful your sample data is. However following is a simple example.
Query:
select distinct b.bidkey, g.gkey
from bidcus b
inner join cusgroup g
on
b.cuskey = g.cuskey
and g.gkey = 10;
Results:
BIDKEY GKEY
1 10
Reference: SQLFIDDLE
In order to have a working Many-to-Many relationship in a database you need to have an intermediary table that defines the relationship so you do not get duplicates or mismatched values.
This select statement will join all bids with all groups because the customer matches.
Select bidkey, groupkey
From customer_groups
Inner Join bid_customer
Where customer_groups.customerkey = Bid_customer.customerkey
Hers is a sample Many to Many Relationship:
For your question:
You will need another table that joins the data. For example, GroupBids
customer_groups and bid_customer would have a one-to-many relationship with GroupBids
You would then do the following select to get your data.
Select bidkey, groupkey
From bid_customer
inner join GroupBids
ON bid_customer.primarykey = GroupBids.idBidKey
inner join customer_groups
ON customer_groups.primarykey = GroupBids.idCustomerGroupkey
This would make sure only related groups and bids are returned

prevent from double/triple SUMing when JOINing

i am joining two tables: accn_demographics and accn_payments. The relationship between the two tables is one to many between accn_demographics.accn_id and accn_payments.accn_id
My question is when I am summing the PAID_AMT and COPAY_AMT, I am getting double/triple/quadrouple the number that I should be getting.
Is there an obvious problem with my join condition?
select sum(p.paid_amt) as SumPaidAmount
, sum(p.copay_amt) as SumCoPay
, p.pmt_date
, d.load_Date
, p.ACCN_ID
from accn_payments p
join
(
select distinct load_date, accn_id
from accn_demographics
) d
on p.ACCN_ID=d.ACCN_ID
where p.POSTED='Y'
and p.pmt_date between '20120701' and '20120731'
group by p.pmt_date, d.load_Date,p.ACCN_ID
order by 3 desc
thanks so much for your guidance.
You need to do the summation in a subquery:
select sum(p.SumPaidAmount) as SumPaidAmount, sum(p.SumCoPay) as SumCoPay,
p.pmt_date, d.load_Date, p.ACCN_ID
from (select accn_id, p.pmt_date, sum(paid_amt) as SumPaidAmt,
sum(copay_amt) as SumCoPay
from accn_payments p
where p.POSTED='Y' and
p.pmt_date between '20120701' and '20120731'
group by accn_id, pmt_date
) p join
(select distinct load_date, accn_id from accn_demographics) d
on p.ACCN_ID=d.ACCN_ID
group by p.pmt_date, d.load_Date,p.ACCN_ID
order by 3 desc
Question: do you really intend for pmt_date to be in the final results? It looks like you want to remove it from both the outer SELECT and the subquery.
The only thing I can see if that (select distinct load_date, accn_id from accn_demographics) might return several matches. Look at your data and run a separate query
select distinct load_date, accn_id from accn_demographics WHERE accn_id=SomeID
where SomeID is one of the result accounts that is returning double/triple values. That should pinpoint your problem.
Yes, but it's not so obvious for beginners. What happens is that for every accn_payments record, you're matching on ONLY the accn_id, which means if there are multiple records in accn_demographics for that particular accn_id, then you will get duplicate accn_payment records due to the join. Is there another limiting field on accn_demographics to join back to the payments?
Ultimately, think of it this way:
accn_payments (p):
accn_id | paid_amt | copay_amt | ...
----------------------------------------------------
1 | 100.00 | 20.00 | ...
accn_demographics (d):
accn_id | load_date | ...
------------------------------------
1 | 2012/01/01 | ...
1 | 2012/03/05 | ...
1 | 2012/06/23 | ...
After joining, your results will look like this:
p.accn_id | p.paid_amt | p.copay_amt | p... | d.accn_id | d.load_date | d...
----------------------------------------------------------------------------
1 | 100.00 | 20.00 | .... | 1 | 2012/01/01 | ....
1 | 100.00 | 20.00 | .... | 1 | 2012/03/05 | ....
1 | 100.00 | 20.00 | .... | 1 | 2012/06/21 | ....
As you can see, the same row from accn_payments gets replicated for every matching accn_demographics record, since you specified only the accn_id column to be the join criteria. It can't limit the results any further, so it the DB engine says "Hey, look, this p record matches for all these d records, this must be what he was asking for!" Obviously not what was intended, as when you sum on the p.paid_amt and p.copay_amt, it performs a sum for ALL ROWS (even though they are duplicated).
Ultimately, see if you can limit the join criteria for accn_demographics even further (by some date, perhaps), that way you limit the number of duplicate payment records during the join.