Remove duplicate rows in Postgres

Remove duplicate rows in Postgres - sql

I have two tables:
Employee:
ID
Name
Surname
143
Amy
Flowers
245
Natasha
Smith
365
John
Alexander
445
Natasha
Smith
565
Monica
Withhouse
644
Amy
Flowers
1023
Amy
Alexander
And employee_details:
ID
Employee_id
Document_numer
1
644
XXXXXXXXX
2
245
XXXXXX
3
365
XXXXXX
I need to remove duplicate records that are in the Employee table and that are not related to the employee_details table. In the example data, I would like to delete the employee doublet with the id 143 and 445.
And I must admit that I have no idea how to do it.Could you give me a hint?
The base is postgres

Delete from Employee
Where id not in (
Select Employee_id
from employee_details
)
and name in (
Select name
from Employee
Group by name having count(name) > 1
)

Though the question is already answered I am adding two different answers here using cte.
create table Employee(ID int, Name varchar(50), Surname varchar(50));
insert into Employee values(143, 'Amy', 'Flowers');
insert into Employee values(245, 'Natasha', 'Smith');
insert into Employee values(365, 'John', 'Alexander');
insert into Employee values(445, 'Natasha', 'Smith');
insert into Employee values(565, 'Monica', 'Withhouse');
insert into Employee values(644, 'Amy', 'Flowers');
insert into Employee values(1023, 'Amy', 'Alexander');
create table employee_details ( ID int, Employee_id int, Document_numer varchar(50));
insert into employee_details values(1, 644, 'XXXXXXXXX');
insert into employee_details values(2, 245, 'XXXXXX');
insert into employee_details values(3, 365, 'XXXXXX');
Delete query 1:
with duplicate_employees as
(
select * , count(id)over(partition by name,surname) duplicate_count from Employee
)
delete from Employee where id in(
select id from duplicate_employees de
where duplicate_count >1
and not exists
(
select 1 from employee_details e where e.Employee_id = de.ID
)
)
select * from employee
Output:
id
name
surname
245
Natasha
Smith
365
John
Alexander
565
Monica
Withhouse
644
Amy
Flowers
1023
Amy
Alexander
db<>fiddle here
Delete query 2:
with cte as
(
Select *, count(*)over(partition by name,surname) duplicate_count,
(case when exists
(
select 1 from employee_details ed where ed.Employee_id = e.ID
)
then 1 else 0 end) exist_in_details
from Employee e
)
delete from Employee where id in (select id from cte where duplicate_count>1 and exist_in_details=0 )
select * from Employee
Output:
id
name
surname
245
Natasha
Smith
365
John
Alexander
565
Monica
Withhouse
644
Amy
Flowers
1023
Amy
Alexander
db<>fiddle here

Related

SQL Server - How to make partially duplicate rows inherit values from original row

In order to link records across datasets I first deleted the records down to non-duplicates based on key linking variables (partitioning over names, dob, sex etc. and deleting where row_number > 1). After the linking was done I'm left with a new variable "unique_id" however this will only be attributed to the original record (since I removed the partial duplicates). I now want to reattach this "unique_id" back to all of the partial duplicates. How could I go about doing this? Is there perhaps a better method I could have used from the start?
Data currently looks like this:
row_number unique_id id first_name last_name activity_date
1 10 2 Davy Jones 1726-11-25
2 -- 12 Davy Jones 1751-02-11
3 -- 43 Davy Jones 1811-06-15
1 100 12114 John Smith 2018-06-01
2 -- 123123 John Smith 2022-07-05
1 90 2591 Mary Sue 2013-05-18
And I want the "unique_id" to inherit the originals like this:
row_number unique_id id first_name last_name activity_date
1 10 2 Davy Jones 1726-11-25
2 10 12 Davy Jones 1751-02-11
3 10 43 Davy Jones 1811-06-15
1 100 12114 John Smith 2018-06-01
2 100 123123 John Smith 2022-07-05
1 90 2591 Mary Sue 2013-05-18
Code to produce this table is as follows:
create table #test (
unique_id int,
id int,
first_name varchar(255),
last_name varchar(255),
activity_date date
)
insert into #test
values (100, 12114, 'John', 'Smith', '2018-06-01')
insert into #test (id, first_name, last_name, activity_date)
values (123123, 'John', 'Smith', '2022-07-05')
insert into #test
values (90, 2591, 'Mary', 'Sue', '2013-05-18')
insert into #test
values (10, 2, 'Davy', 'Jones', '1726-11-25')
insert into #test (id, first_name, last_name, activity_date)
values (12, 'Davy', 'Jones', '1751-02-11')
insert into #test (id, first_name, last_name, activity_date)
values (43, 'Davy', 'Jones', '1811-06-15')
select
row_number() over (partition by first_name, last_name order by first_name, last_name) as row_number
,unique_id, id, first_name, last_name, activity_date
from #test

A simple method -- assuming one value per first_name/last_name pair -- is to use window functions:
select t.*, max(unique_id) over (partition by first_name, last_name) as new_unique_id
from #test t;
This can be put into an update:
with toupdate as (
select t.*, max(unique_id) over (partition by first_name, last_name) as new_unique_id
from #test t
)
update toupdate
set unique_id = new_unique_id;
Here is a rextester illustrating the syntax.

Try this:
with Dups as(
select
row_number() over (partition by first_name, last_name order by first_name, last_name) as dup_number,
-- dense_rank() over (order by first_name, last_name) as DuplicateGroupNumber, -- this allows you to see groups
max(unique_id) over (partition by first_name, last_name) as GroupUniqueID,
unique_id, id, first_name, last_name, activity_date
from #test
)
update a
set unique_id = GroupUniqueID
from #test as a
inner join Dups as b on a.id = b.id
select * from #test
Result
unique_id id first_name
----------- ----------- ------------
100 12114 John
100 123123 John
90 2591 Mary
10 2 Davy
10 12 Davy
10 43 Davy

Looks like you should join a subset of the records that has the linking id with the records that don't have the linking id using whatever fields you think appropriate and then update the id in the unlinked set from the id in the linked set.

Updating the first occurrence of the column with a value and the remaining with other value

select emp_id, emp_dept, emp_name
from employee
where emp_id in (123, 234);
emp_id emp_dept emp_name
*****************************
123 222 1234
123 222 5678
123 222 9101
234 222 1011
234 222 1112
234 222 1213
Here there are 3 records for each emp_id.
I want a query to update the emp_dept such that out of three records, only one record will be updated to 555(it can be any record doesnt matter) and the other 2 will be updated to 666.

Create a CTE (common table expression) adding a ROW_NUMBER window function partitioned by emp_id then write an update statement joining the cte and building a case statement to determine row number
The code below builds a Table Variable with Test data, selects the data to show you the "before" and then modifies with the cte method and selects the data to show you the final result.
;WITH cte AS (
SELECT
emp_id
,ROW_NUMBER() OVER (PARTITION BY emp_id ORDER BY emp_SSN) AS RowNum
FROM
#Table
)
UPDATE t
SET emp_dept = CASE WHEN RowNum = 1 THEN 555 ELSE 666 END
FROM
#Table t
INNER JOIN cte u
ON t.emp_id = u.emp_id

You can use MERGE.
Data Preparation
create table em1(
emp_id number, emp_dept number, emp_name varchar2(10));
insert into em1 values(123,1,'we');
insert into em1 values(123,1,'asd');
insert into em1 values(123,1,'rfw');
insert into em1 values(345,2,'rtg');
insert into em1 values(345,2,'bfg');
insert into em1 values(345,2,'uyi');
commit;
Query
MERGE INTO em1 e
USING (
SELECT emp_id, emp_dept, emp_name,
row_number() over (partition by emp_id order by 1) r
FROM em1
WHERE emp_id in (123,345)
) f
ON (f.emp_id = e.emp_id and f.emp_name = e.emp_name)
WHEN MATCHED THEN
UPDATE SET e.emp_dept = case when f.r = 1 then 555 else 666 end;
Result
emp_id emp_dept emp_name
-------------------------
123 555 we
123 666 asd
123 666 rfw
345 555 rtg
345 666 bfg
345 666 uyi

SQL OVER (Partiton by) - Handle nulls

I have a following scenario:
Table Employees:
First Name | Last Name | Department | Salary
-----------|-----------|------------|---------
John | Doe | Finance | 20
John | Doe | R&D | 20
John | null | Finance | 20
John | long | Finance | 20
and I want 1 row for each (First Name,Last Name),
unless we have a null in the last name, and then i want just 1 row with (First Name,null)
for the above example the result is:
First Name | Last Name | Department | Salary
-----------|-----------|------------|---------
John | null | Finance | 20
but if i didn't have that record then the result should have been:
First Name | Last Name | Department | Salary
-----------|-----------|------------|---------
John | Doe | R&D | 20
John | long | Finance | 20
I guess the answer involves some Partition By-s, but I'm not sure where.
Right now I came to this:
SELECT FirstName,LastName, DEPARTMENT,Salary,RK FROM
(
select * from
SELECT EXT.*,
ROW_NUMBER() OVER(PARTITION BY EXT.FirstName,EXT.LastName
ORDER BY rownum ASC) AS RK
FROM Employees EXT
)
WHERE RK = 1 ;
Thanks !

Your problem is in the PARTITION clause. You want every first name where there is a surname unless at least one surname with that first name is NULL, in which case you want only those first names that have a NULL surname.
The answer here is to use RANK() instead of ROW_NUMBER(). RANK() does not create a consecutive list; instead rows with equal values get the same rank.
select firstname, lastname, department, salary, rk
from ( select a.*
, rank() over ( partition by firstname
order by case when lastname is null then 0
else 1
end
) as rnk
from employees a
)
where rnk = 1
This works by making the existence of a surname relevant rather than the surname itself.
Two more points:
You had a nested select without parenthesis. This won't work.
There's no point ordering by ROWNUM. By definition rownum returns rows in the order returned by the statement, which means the rows will always be in the order of the ROWNUM.

something like this:
SQL> create table person
2 (
3 fname varchar2(10),
4 lname varchar2(10),
5 dept varchar2(10),
6 sal number
7 );
Table created.
SQL> insert into person values ('John', 'Doe', 'Finance', 20);
1 row created.
SQL> insert into person values ('John', 'Doe', 'R&D', 20);
1 row created.
SQL> insert into person values ('John', '', 'Finance', 20);
1 row created.
SQL> insert into person values ('John', 'Long', 'Finance', 20);
1 row created.
SQL> insert into person values ('Paul', 'Doe', 'R&D', 30);
1 row created.
SQL> insert into person values ('Paul', 'Doe', 'Finance', 30);
1 row created.
SQL> insert into person values ('Paul', 'Long', 'Finance', 30);
1 row created.
SQL> select fname, lname, dept, sal
2 from (select fname, lname, dept, sal,has_null,
3 row_number() over(partition by fname,
4 case when has_null = 'N' then lname else null end
5 order by lname desc nulls first) rn
6 from (select fname, lname,
7 nvl(max(case when lname is null then 'Y'
8 end) over(partition by fname), 'N') has_null, dept, sal
9 from person))
10 where rn = 1;
FNAME LNAME DEPT SAL
---------- ---------- ---------- ----------
John Finance 20
Paul Doe R&D 30
Paul Long Finance 30

That query does the (same) trick, but preforms better.
SELECT fname,
lname,
dept,
sal
FROM (SELECT fname,
lname,
dept,
sal,
First_value(lname)
OVER(
partition BY fname
ORDER BY lname nulls first) null_domain,
Row_number()
OVER (
partition BY fname, lname
ORDER BY fname) r
FROM person)
WHERE ( ( null_domain IS NULL
AND lname IS NULL )
OR null_domain IS NOT NULL )
AND r = 1;

Display the Employee Name (Boss) and number of Employee (Subordinates) in SQL

I have a table emp having foll data:
EmpID EmpName MgrID
100 King NULL
101 Smith 100
102 Shine 100
103 Racy 102
Now i want to Display the Employee Name (Boss) and number of Employee (Subordinates) something like this
BOSS SUBORDINATES
BLAKE 5
CLARK 1
FORD 1
JONES 2
KING 3
SCOTT 1
Please guide how to go about querying this table in SQL Server 2008.
Attempted query:
select e.first_name as ename,m.first_name as mname from employees e,employees m where e.manager_id=m.employee_id

Start by self-joining on EmpID=MgrID
Group by MgrID and EmpName
Select EmpName and count(*)
Translating this to SQL is mechanical:
SELECT b.EmpName, COUNT(*)
FROM Employee e
JOIN Employee b ON b.EmpID=e.MgrID
GROUP BY b.EmpID, b.EmpName

CREATE TABLE test (
EmpID INT,
EmpName VARCHAR(100),
MgrID INT)
INSERT INTO test VALUES (100, 'King', NULL),
(101, 'Smith', 100),
(102, 'Shine', 100),
(103, 'Racy', 102)
SELECT t1.EmpName AS Boss,
COUNT(*) AS Subordinates
FROM test AS t1 INNER JOIN test AS t2 ON t1.EmpID = t2.MgrID
GROUP BY t1.EmpName

dense_rank() function equivalent in access database

As we have dense_rank function in sql server do we any equivalent in access?
I have a table
employee_name employee_address
RON 23-B, TORONTO
PETER 15-C, NY
TED 23-C, LONDON
RON 23-B, TORONTO
I have to add new column to this table as follows:
employee_name employee_address employee_no
RON 23-B, TORONTO 1
PETER 15-C, NY 2
TED 23-C, LONDON 3
RON 23-B, TORONTO 1

I assume you want to port in Access a SQL Server query like this:
SELECT *
,DENSE_RANK() OVER(ORDER BY employee_name, employee_address) AS DenseRank
FROM Employee.
The main idea is to generate a list with distinct employee_name & employee_address values and then we generate row numbers without gaps for every distinct tuple. At final step, we make a JOIN between the initial data set (Employee table) and the last data set (which has row numbers for every distinct employee_name & employee_address tuple).
Solution 1
Query0
CREATE TABLE Employee
(
employee_id INT PRIMARY KEY
,employee_name VARCHAR(100) NOT NULL
,employee_address VARCHAR(100) NOT NULL
);
Query1
INSERT INTO Employee (employee_id, employee_name, employee_address)
VALUES (1,'RON','23-B, TORONTO');
Query2
INSERT INTO Employee (employee_id, employee_name, employee_address)
VALUES (2,'PETER','15-C, NY');
Query3
INSERT INTO Employee (employee_id, employee_name, employee_address)
VALUES (3,'TED','23-C, LONDON');
Query4
INSERT INTO Employee (employee_id, employee_name, employee_address)
VALUES (4,'SORIN','09-S, VASCAUTI');
Query5
INSERT INTO Employee (employee_id, employee_name, employee_address)
VALUES (5,'RON','23-B, TORONTO');
Query6
INSERT INTO Employee (employee_id, employee_name, employee_address)
VALUES (6,'PETER','15-C, NY');
Query7
INSERT INTO Employee (employee_id, employee_name, employee_address)
VALUES (7,'SORIN','09-S, VASCAUTI');
Query8
INSERT INTO Employee (employee_id, employee_name, employee_address)
VALUES (8,'PETER','15-C, NY');
So, the Employee content will be:
employee_id employee_name employee_address
----------- ---------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------
1 RON 23-B, TORONTO
2 PETER 15-C, NY
3 TED 23-C, LONDON
4 SORIN 09-S, VASCAUTI
5 RON 23-B, TORONTO
6 PETER 15-C, NY
7 SORIN 09-S, VASCAUTI
8 PETER 15-C, NY
Query9 We generate row numbers for every employee_name & employee_address tuple
CREATE TABLE TmpEmployee
(
rownumber COUNTER(1,1) PRIMARY KEY
,employee_name VARCHAR(100) NOT NULL
,employee_address VARCHAR(100) NOT NULL
);
(COUNTER(1,1) is Access/SQL AutoNumber data type; before every Query10 execution you need to recreate TmpEmployee table or you need to compact Access DB to reset rownumber counter to 1) and
Query10
INSERT INTO TmpEmployee (employee_name, employee_address)
SELECT e.employee_name, e.employee_address
FROM Employee e
GROUP BY e.employee_name, e.employee_address
Results:
rownumber employee_name employee_address
----------- ---------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------
1 PETER 15-C, NY
2 RON 23-B, TORONTO
3 SORIN 09-S, VASCAUTI
4 TED 23-C, LONDON
Query11 The final results:
SELECT e.*, t.RowNumber AS DenseRank
FROM Employee e
INNER JOIN TmpEmployee t ON e.employee_name = t.employee_name AND e.employee_address = t.employee_address
ORDER BY e.employee_name, e.employee_address
Results:
2 PETER 15-C, NY 1
6 PETER 15-C, NY 1
8 PETER 15-C, NY 1
5 RON 23-B, TORONTO 2
1 RON 23-B, TORONTO 2
4 SORIN 09-S, VASCAUTI 3
7 SORIN 09-S, VASCAUTI 3
3 TED 23-C, LONDON 4
Solution 2
Query9 The final results:
SELECT e.*, c.RowNumber
FROM Employee e INNER JOIN
(
SELECT a.employee_name, a.employee_address, COUNT(b.employee_name) AS RowNumber
FROM
(
SELECT e.employee_name, e.employee_address
FROM Employee e
GROUP BY e.employee_name, e.employee_address
) a,
(
SELECT e.employee_name, e.employee_address
FROM Employee e
GROUP BY e.employee_name, e.employee_address
) b
WHERE a.employee_name > b.employee_name OR a.employee_name = b.employee_name AND a.employee_address >= b.employee_address
GROUP BY a.employee_name, a.employee_address
) c ON e.employee_name = c.employee_name AND e.employee_address = c.employee_address
ORDER BY c.employee_name, c.employee_address

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove duplicate rows in Postgres - sql

Delete from Employee Where id not in ( Select Employee_id from employee_details ) and name in ( Select name from Employee Group by name having count(name) > 1 )

Related

SQL Server - How to make partially duplicate rows inherit values from original row

Updating the first occurrence of the column with a value and the remaining with other value

SQL OVER (Partiton by) - Handle nulls

Display the Employee Name (Boss) and number of Employee (Subordinates) in SQL

dense_rank() function equivalent in access database

Categories

Resources