Remove duplicate rows based on specific columns - sql

I have a table that contains these columns:
ID (varchar)
SETUP_ID (varchar)
MENU (varchar)
LABEL (varchar)
The thing I want to achieve is to remove all duplicates from the table based on two columns (SETUP_ID, MENU).
Table I have:
id | setup_id | menu | label |
-------------------------------------
1 | 10 | main | txt |
2 | 10 | main | txt |
3 | 11 | second | txt |
4 | 11 | second | txt |
5 | 12 | third | txt |
Table I want:
id | setup_id | menu | label |
-------------------------------------
1 | 10 | main | txt |
3 | 11 | second | txt |
5 | 12 | third | txt |

You can achieve this with a common table expression (cte)
with cte as (
select id, setup_id, menu,
row_number () over (partition by setup_id, menu, label) rownum
from atable )
delete from atable a
where id in (select id from cte where rownum >= 2)
This will give you your desired output.
Common Table Expression docs

Assuming a table named tbl where both setup_id and menu are defined NOT NULL and id is the PRIMARY KEY.
EXISTS will do nicely:
DELETE FROM tbl t0
WHERE EXISTS (
SELECT FROM tbl t1
WHERE t1.setup_id = t0.setup_id
AND t1.menu = t0.menu
AND t1.id < t0.id
);
This deletes every row where a dupe with lower id is found, effectively only keeping the row with the smallest id from each set of dupes. An index on (setup_id, menu) or even (setup_id, menu, id) will help performance with big tables a lot.
If there is no PK and no reliable UNIQUE (combination of) column(s), you can fall back to using the ctid. If NULL values can be involved, you need to specify how to deal with those.
Consider:
Delete duplicate rows from small table
How to delete duplicate rows without unique identifier
How do I (or can I) SELECT DISTINCT on multiple columns?
After cleaning up duplicates, add a UNIQUE constraint to prevent new dupes:
ALTER TABLE tbl ADD CONSTRAINT tbl_setup_id_menu_uni UNIQUE (setup_id, menu);
If you had an index on (setup_id, menu), drop that now. It's superseded by the UNIQUE constraint.

I have found a solution that fits me the best.
Here it is if anyone needs it:
DELETE FROM table_name
WHERE id IN
(SELECT id
FROM
(SELECT id,
ROW_NUMBER() OVER( PARTITION BY setup_id,
menu
ORDER BY id ) AS row_num
FROM table_name ) t
WHERE t.row_num > 1 );

link: https://www.postgresql.org/docs/current/queries-union.html
https://www.postgresql.org/docs/current/sql-select.html#SQL-DISTINCT
let's sat table name is a
select distinct on (setup_id,menu ) a.* from a;
Key point: The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.
Which means you can only order by setup_id,menu in this distinct on query scope.
Want the opposite:
EXCEPT returns all rows that are in the result of query1 but not in the result of query2. (This is sometimes called the difference between two queries.) Again, duplicates are eliminated unless EXCEPT ALL is used.
SELECT * FROM a
EXCEPT
select distinct on (setup_id,menu ) a.* from a;

You can try something along these lines to delete all but the first row in case of duplicates (please note that this is not tested in any way!):
DELETE FROM your_table WHERE id IN (
SELECT unnest(duplicate_ids[2:]) FROM (
SELECT array_agg(id) AS duplicate_ids FROM your_table
GROUP BY SETUP_ID, MENU
HAVING COUNT(*) > 1
)
)
)
The above collects the ids of the duplicate rows (COUNT(*) > 1) in an array (array_agg), then takes all but the first element in that array ([2:]) and "explodes" the id values into rows (unnest).
The outer query just deletes every id that ends up in that result.

For mysql the similar question is already answered here Find and remove duplicate rows by two columns
Try if any of the approach helps in this matter.
I like the below one for MySql:
ALTER IGNORE TABLE your_table ADD UNIQUE (SETUP_ID, MENU);

DELETE t1
FROM table_name t1
join table_name t2 on
(t2.setup_id = t1.setup_id or t2.menu = t1.menu) and t2.id < t1.id

There are many ways to find and delete all duplicate row(s) based on conditions. But I like inner join method, which works very fast even in a large amount of Data. Please check follows :
DELETE T1 FROM <TableName> T1
INNER JOIN <TableName> T2
WHERE
T1.id > T2.id AND
T1.<ColumnName1> = T2.<ColumnName1> AND T1.<ColumnName2> = T2.<ColumnName2>;
In your case you can write as follows :
DELETE T1 FROM <TableName> T1
INNER JOIN <TableName> T2
WHERE
T1.id > T2.id AND
T1.setup_id = T2. setup_id;
Let me know if you face any issue or need more help.

Related

Combine multiple rows with different column values into a single one

I'm trying to create a single row starting from multiple ones and combining them based on different column values; here is the result i reached based on the following query:
select distinct ID, case info when 'name' then value end as 'NAME', case info when 'id' then value end as 'serial'
FROM TABLENAME t
WHERE info = 'name' or info = 'id'
Howerver the expected result should be something along the lines of
I tried with group by clauses but that doesn't seem to work.
The RDBMS is Microsoft SQL Server.
Thanks
SELECT X.ID,MAX(X.NAME)NAME,MAX(X.SERIAL)AS SERIAL FROM
(
SELECT 100 AS ID, NULL AS NAME, '24B6-97F3'AS SERIAL UNION ALL
SELECT 100,'A',NULL UNION ALL
SELECT 200,NULL,'8113-B600'UNION ALL
SELECT 200,'B',NULL
)X
GROUP BY X.ID
For me GROUP BY works
A simple PIVOT operator can achieve this for dynamic results:
SELECT *
FROM
(
SELECT id AS id_column, info, value
FROM tablename
) src
PIVOT
(
MAX(value) FOR info IN ([name], [id])
) piv
ORDER BY id ASC;
Result:
| id_column | name | id |
|-----------|------|------------|
| 100 | a | 24b6-97f3 |
| 200 | b | 8113-b600 |
Fiddle here.
I'm a fan of a self join for things like this
SELECT tName.ID, tName.Value AS Name, tSerial.Value AS Serial
FROM TableName AS tName
INNER JOIN TableName AS tSerial ON tSerial.ID = tName.ID AND tSerial.Info = 'Serial'
WHERE tName.Info = 'Name'
This initially selects only the Name rows, then self joins on the same IDs and now filter to the Serial rows. You may want to change the INNER JOIN to a LEFT JOIN if not everything has a Name and Serial and you want to know which Names don't have a Serial

ORACLE - Setting RANK of duplicated on a big table, optimization needed

This is a simplified extract for a more complex algorithm.
The problem is I have a simple table C_HASH like this:
CREATE TABLE C_HASH
(
HASH CHAR (48),
RANK INTEGER
);
First I fill the table with all the hash values. But because I can have duplicates in HASH, to identify the duplicates one by one I need to set the RANK by HASH.
I do this SQL statement but it is way to long, I have indexed the HASH column, with no effect:
UPDATE C_HASH a set RANK = ( select temp.rank from ( select rowid, rank() over ( PARTITION BY HASH ORDER BY ROWID ) rank from C_HASH ) temp where temp.rowid = a.rowid);
I need to optimize this! A clue?
You could use the merge syntax:
merge into c_hash c
using (
select rowid, row_number() over(partition by hash order by rowid) rank
from c_hash
) c1
on (c1.rowid = c.rowid)
when matched then update set c.rank = c1.rank
Demo on DB Fiddle
Sample data:
HASH | RANK
:----------------------------------------------- | ---:
foo | null
foo | null
foo | null
bar | null
Results:
HASH | RANK
:----------------------------------------------- | ---:
foo | 1
foo | 2
foo | 3
bar | 1
If you are going to update a lot of rows, it might be more efficient to create a new table, using the insert ... select syntax:
create table c_hash2 as
select hash, row_number() over(partition by hash order by rowid) as rank
from c_hash
This is going to take a long time, because you are updating all rows. But you can simplify the logic to:
update c_hash h
set rank = (select count(*)
from c_hash h2
where h2.hash = h.hash and h2.rowid <= h.rowid
);
This should be table to take advantage of your existing index.

duplicates removal in database [duplicate]

I am using postgres.
I want to delete Duplicate rows.
The condition is that , 1 copy from the set of duplicate rows would not be deleted.
i.e : if there are 5 duplicate records then 4 of them will be deleted.
Try the steps described in this article: Removing duplicates from a PostgreSQL database.
It describes a situation when you have to deal with huge amount of data which isn't possible to group by.
A simple solution would be this:
DELETE FROM foo
WHERE id NOT IN (SELECT min(id) --or max(id)
FROM foo
GROUP BY hash)
Where hash is something that gets duplicated.
delete from table
where not id in
(select max(id) from table group by [duplicate row])
This is random (max Value) choice which row you need to keep.
If you have aggre whit this please provide more details
The fastest is is join to the same table.
http://www.postgresql.org/docs/8.1/interactive/sql-delete.html
CREATE TABLE test(id INT,id2 INT);
CREATE TABLE
mapy=# INSERT INTO test VALUES(1,2);
INSERT 0 1
mapy=# INSERT INTO test VALUES(1,3);
INSERT 0 1
mapy=# INSERT INTO test VALUES(1,4);
INSERT 0 1
DELETE FROM test t1 USING test t2 WHERE t1.id=t2.id AND t1.id2<t2.id2;
DELETE 2
mapy=# SELECT * FROM test;
id | id2
----+-----
1 | 4
(1 row)
delete from table t1
where rowid > (SELECT min(rowid) FROM table t2 group by
t2.id,t2.name );
DELETE f1 from foo as f1, foo as f2
where f1.duplicate_column= f2.duplicate_column
AND f1.id > f2.id;

MySQL get rows but prefer one column value over another

A bit of a strange one, I want to write a MySQL query that will get results from a table, but prefer one value of a column over another, ie
id name value prioirty
1 name1 value1 NULL
2 name1 value1 1
3 name2 value2 NULL
4 name3 value3 NULL
So here name1 has two entries, but one has a prioirty of 1. I want to get all the values from the table, but prefer the values with whatever priorty I'm after.
The results I'd be after would be
id name value prioirty
2 name1 value1 1
3 name2 value2 NULL
4 name3 value3 NULL
An equivalent way of saying it would be 'get all rows from the table, but prefer rows with a priority of x'.
This should do it:
SELECT
T1.id,
T1.name,
T1.value,
T1.priority
FROM
My_Table T1
LEFT OUTER JOIN My_Table T2 ON
T2.name = T1.name AND
T2.priority > COALESCE(T1.priority, -1)
WHERE
T2.id IS NULL
This also allows you to have multiple priority levels with the highest being the one that you want to return (if you had a 1 and 2, the 2 would be returned).
I will also say though that it does seem like there are some design problems in the DB. My approach would have been:
My_Table (id, name)
My_Values (id, priority, value)
with an FK on id to id. PKs on id in My_Table and id, priority in My_Values. Of course, I'd use appropriate table names too.
You need to redesign your table first.
It should be:
YourTable (Id, Name, Value)
YourTablePriority (PriorityId, Priority, Id)
Update:
select * from YourTable a
where a.Id not in
(select b.Id from YourTablePriority b)
This should work in sql server, you may need a little change to make it work in mysql.
Maybe something like:
SELECT id, name, value, priority FROM
table_name GROUP BY name ORDER BY priority
Although not having a database in front of me I can't test it...
If I understand correctly, you want the value of a name given a specific priority, or the value associated with a NULL priority. (You do not necessarily want the MAX(priority) that exists.)
Yes, you've got some awkward design issues which you should address, but let's solve the problem you do have at present (and you can later migrate to the problem you ought to have :) ):
mysql> SET #priority = 1; -- the priority we want, if recorded
mysql> PREPARE stmt FROM "
SELECT
t0.*
FROM
t t0
LEFT JOIN
(SELECT DISTINCT name, priority FROM t WHERE priority = ?) t1
ON t0.name = t1.name
WHERE
t0.priority = t1.priority
OR
t1.priority IS NULL
";
mysql> EXECUTE stmt USING #priority;
+----+-------+--------+----------+
| id | name | value | priority |
+----+-------+--------+----------+
| 2 | name1 | valueX | 1 |
| 3 | name2 | value2 | NULL |
| 4 | name3 | value3 | NULL |
+----+-------+--------+----------+
3 rows in set (0.00 sec)
(Note that I changed the prioritized value of "name1" to "valueX" in the above -- your original formulation had identical value values for "name1" regardless of priority, which made it hard for me to understand why you cared to discriminate one from the other.)

Adding Row Numbers To a SELECT Query Result in SQL Server Without use Row_Number() function

i need Add Row Numbers To a SELECT Query without using Row_Number() function.
and without using user defined functions or stored procedures.
Select (obtain the row number) as [Row], field1, field2, fieldn from aTable
UPDATE
i am using SAP B1 DIAPI, to make a query , this system does not allow the use of rownumber() function in the select statement.
Bye.
I'm not sure if this will work for your particular situation or not, but can you execute this query with a stored procedure? If so, you can:
A) Create a temp table with all your normal result columns, plus a Row column as an auto-incremented identity.
B) Select-Insert your original query, sans the row column (SQL will fill this in automatically for you)
C) Select * on the temp table for your result set.
Not the most elegant solution, but will accomplish the row numbering you are wanting.
This query will give you the row_number,
SELECT
(SELECT COUNT(*) FROM #table t2 WHERE t2.field <= t1.field) AS row_number,
field,
otherField
FROM #table t1
but there are some restrictions when you want to use it. You have to have one column in your table (in the example it is field) which is unique and numeric and you can use it as a reference. For example:
DECLARE #table TABLE
(
field INT,
otherField VARCHAR(10)
)
INSERT INTO #table(field,otherField) VALUES (1,'a')
INSERT INTO #table(field,otherField) VALUES (4,'b')
INSERT INTO #table(field,otherField) VALUES (6,'c')
INSERT INTO #table(field,otherField) VALUES (7,'d')
SELECT * FROM #table
returns
field | otherField
------------------
1 | a
4 | b
6 | c
7 | d
and
SELECT
(SELECT COUNT(*) FROM #table t2 WHERE t2.field <= t1.field) AS row_number,
field,
otherField
FROM #table t1
returns
row_number | field | otherField
-------------------------------
1 | 1 | a
2 | 4 | b
3 | 6 | c
4 | 7 | d
This is the solution without functions and stored procedures, but as I said there are the restrictions. But anyway, maybe it is enough for you.
RRUZ, you might be able to hide the use of a function by wrapping your query in a View. It would be transparent to the caller. I don't see any other options, besides the ones already mentioned.