Removing duplicates and keeping one copy - sql

I have been going through the threads about removing duplicates from a table and keeping one copy .I have seen an illustration in the case one have a table with composite key.anyone with the idea ?
table contr with composite key checkno,salary_month,sal_year
delete (select * from CONTR t1
INNER JOIN
(select CHECKNO, SALARY_YEAR,SALARY_MONTH FROM CONTR
group by CHECKNO, SALARY_YEAR,SALARY_MONTH HAVING COUNT(*) > 1) dupes
ON
t1.CHECKNO = dupes.CHECKNO AND
t1.SALARY_YEAR= dupes.SALARY_YEAR AND
t1.SALARY_MONTH=dupes.SALARY_MONTH);
I expected one duplicate to be removed and one maintained.

You can use this query below to remove duplicates by using rowid as having a unique valued column :
delete contr t1
where rowid <
(
select max(rowid)
from contr t2
where t2.checkno = t1.checkno
and t2.salary_year = t1.salary_year
and t2.salary_month = t1.salary_month
);
Demo

Another way to achieve this assuming you have dupes with 3 columns you have mentioned is
Create a temp table with distinct values
Drop your table
Rename the temp table
Especially if you are dealing huge volume of data this way would be a lot faster than delete.
If the dup data you are working on is subset of your main table the steps would be
Create a temp table with distinct values
Delete all dup columns from main table
Insert data from temp table to main table
The SQL for the first step would be
create table tmp_CONTR AS
select distinct CHECKNO, SALARY_YEAR,SALARY_MONTH -- this part can be modified to match your needs
from CONTR t1;

Related

Copy data from one table to another - Ignore duplicates Postgresql

I am using Postgresql db. I have data in two tables. Table A has 10 records and Table B 5 records.
I would like to copy Table A data to Table B but only copy the new entries (5 records) and ignore the duplicates/already existing data
I would like to copy data from Table A to Table B where Table B will have 10 records (5 old records + 5 new records from Table A)
Can you please help me as to how can this be done?
Assuming id is your primary key, and table structures are identical(both table has common columns as number of columns and data type respectively), use not exists :
insert into TableB
select *
from TableA a
where not exists ( select 0 from TableB b where b.id = a.id )
If you are looking to copy rows unique to A that are not in B then you can use INSERT...SELECT. The SELECT statement should use the union operator EXCEPT:
INSERT INTO B (column)
SELECT column FROM A
EXCEPT
SELECT column FROM B;
EXCEPT (https://www.postgresql.org/docs/current/queries-union.html) compares the two result sets and will return the distinct rows present in result A but not in B, then supply these values to INSERT. For this to work both the columns and respective datatypes must match in both SELECT queries and your INSERT.
INSERT INTO Table_A
SELECT *
FROM Table_B
ON CONFLICT DO NOTHING
Here, the conflict will be taken based on your primary key.

Join two datasets by two columns in PostgreSQL

I have a PostgreSQL query, where I create temp table by joining another temp table, and table from my database.
DROP TABLE IF EXISTS other_temp_table;
CREATE TEMP TABLE other_temp_table AS
SELECT *
FROM base.main_data
WHERE period_start_time::DATE >= '2017-06-20' AND period_start_time::DATE <= '2017-07-26';
------------------------------------------------------------------------------------------------------------------------
DROP TABLE IF EXISTS first_temp_table;
CREATE TEMP TABLE first_temp_table AS
SELECT *
FROM _temp_table
LEFT JOIN base."UL_parameters"
ON temp_table.base_col::INT = base."UL_parameters".base_col::INT
and temp_table.sec_col::INT= base."UL_parameters".sec_col::INT;
Now, the problem is `ERROR: column "sec_col" specified more than once.
But when I delete sec_col join condition and do just base_col, everything is ok. I think that I need to create an alias, but not sure how.
I think that the problem is that there is a sec_col column in both joined tables. Try to replace select * with select column1, column2, ....
It's a good practice to avoid select * in general, because it can cause errors when the table definition changes.

Subquery has too many columns

I have two tables with same structure: tmp_grn and grn.
I have to delete rows from table tmp_grn which already exists in table grn.
Problem is I don't have a unique or primary key but I can determine a unique row with the combination of two columns. Let's say column names are grn_code and item_skucode.
My query:
DELETE FROM tmp_grn
WHERE grn_code AND item_skucode IN
(SELECT grn_code , item_skucode FROM grn);
I am getting this error:
ERROR: subquery has too many columns
What should be the right way to do this?
If you want to combine two columns, you need to put them into parenthesis:
DELETE FROM tmp_grn
WHERE (grn_code, item_skucode) IN (SELECT grn_code, item_skucode
FROM grn);
But suslov's answer using an exists is most probably faster - you need to check the execution plan to verify that.
You can use exists (if you want to check the pair of values):
delete from tmp_grn t
where exists ( select *
from grn
where grn_code = t.grn_code
and item_skucode = t.item_skucode);
delete * from tmp_grn intersect select * from grn

Delete subset of a table based on temp table

I have a table, say myTable. I also have a temp table, say myTableTemp, that contains the exact values I want to keep eliminate from myTable (myTable has more value than I need).
I was initially thinking I could drop myTable, and then rename myTableTemp to myTable`. However there are many FK contraints that I do not want to touch. In theory, my query would look like:
DELETE FROM myTable where in (myTableTemp);
At least logically that is how i think about it
EDIT: The temp table contains the data I want to DELETE from myTable
DELETE FROM myTable where in (myTableTemp);
Isn't the above backwards? Don't you want to keep all the values in myTableTemp?
I would do the following:
DELETE FROM myTable t1
WHERE NOT EXISTS ( SELECT 1 FROM myTableTemp t2
WHERE t2.primary_key = t1.primary_key );
Again, that's assuming that you want to keep everything in myTableTemp and delete everything in myTable that isn't in myTableTemp.
As an alternate solution to eliminate from myTable items present in myTableTemp:
DELETE FROM myTable
WHERE primary_key IN ( SELECT primary_key FROM myTableTemp )
;
It is usually believed that [NOT] EXISTS queries perform better than those using [NOT] IN. But it is not always that obvious.

Deleting at most one record for each unique tuple combination

I want to delete at most one record for each unique (columnA, columnB)-tuple in my following delete statement:
DELETE FROM tableA
WHERE columnA IN
(
--some subqueryA
)
AND columnB IN
(
--some subqueryB
)
How is this accomplished? Please only consider those statements that work when used against MSS 2000 (i.e., T-SQL 2000 syntax). I can do it with iterating through a temptable but I want to write it using only sets.
Example:
subqueryA returns 1
subqueryB returns 2,3
If the original table contained
(columnA, columnB, columnC)
5,2,5
1,2,34
1,2,45
1,3,86
Then
1,2,34
1,3,86
should be deleted. Each unique (columnA, columnB)-tuple will appear at most twice in tableA and each time I run my SQL statement I want to delete at most one of these unique combinations - never two.
If there is one record for a given unique (columnA, columnB)-tuple,
delete it.
If there are two records for a given unique (columnA,
columnB)-tuple, delete only one of them.
Delete tabA
from TableA tabA
Where tabA.columnC in (
select max(tabAA.columnC) from TableA tabAA
where tabAA.columnA in (1)
and tabAA.columnB in (2,3)
group by tabAA.columnA,tabAA.columnB
)
How often are you going to be running this that it matters whether you use temp tables or not? Maybe you should consider adding constraints to the table so you only have to do this once...
That said, in all honesty, the best way to do this for SQL Server 2000 is probably to use the #temp table as you're already doing. If you were trying to delete all but one of each dupe, then you could do something like:
insert the distinct rows into a separate table
delete all the rows from the old table
move the distinct rows back into the original table
I've also done things like copy the distinct rows into a new table, drop the old table, and rename the new table.
But this doesn't sound like the goal. Can you show the code you're currently using with the #temp table? I'm trying to envision how you're identifying the rows to keep, and maybe seeing your existing code will trigger something.
EDIT - now with better understood requirements, I can propose the following query. Please test it on a copy of the table first!
DELETE a
FROM dbo.TableA AS a
INNER JOIN
(
SELECT columnA, columnB, columnC = MIN(columnC)
FROM dbo.TableA
WHERE columnA IN
(
-- some subqueryA
SELECT 1
)
AND columnB IN
(
-- some subqueryB
SELECT 2 UNION SELECT 3
)
GROUP BY columnA, columnB
) AS x
ON a.columnA = x.columnA
AND a.columnB = x.columnB
AND a.columnC = x.columnC;
Note that this doesn't confirm that there are exactly one or two rows that match the grouping on columnA and columnB. Also note that if you run this twice it will delete the remaining row that still matches the subquery!