Set Duplicate Values to Null in PostgresSQL retaining one of the values - sql

I have a database like this:
id name email
0 Bill bill#fakeemail.com
1 John john#fakeemail.com
2 Susan susan#fakeemail.com
3 Susan J susan#fakeemail.com
I want to remove duplicate emails by setting the value to null, but retain at least 1 email on one of the rows (doesn't really matter which one).
So that the resulting database would look like this:
id name email
0 Bill bill#fakeemail.com
1 John john#fakeemail.com
2 Susan susan#fakeemail.com
3 Susan J
I was able to target the rows like this
SELECT COUNT(email) as count FROM users WHERE count > 1
But can't figure out how to set the value to null while still retaining at least 1.

Update the rows which have the same email but greater id:
update my_table t1
set email = null
where exists (
select from my_table t2
where t1.email = t2.email and t1.id > t2.id
);
Working example in rextester.

You can use a windowed partition to assign a row number to each email group, and then use that generated row number to modify all rows except for one. Something like this:
WITH annotated_persons AS(
SELECT
id,
name,
email,
ROW_NUMBER () OVER (PARTITION BY email) AS i
FROM
persons;
)
UPDATE persons
SET email = null
WHERE id = annotated_persons.id AND annotated_persons.i <> 1
You may have to use another subquery in order to gather the IDs of persons whose row number != 1, and then change your update query to
WHERE id IN person_ids
It's been awhile since I've used a window.

Related

Removing records based on two columns

What I'm trying to do is taking these records that looks like this:
Name Enrollment_Month Premium
John 20201201 $76.00
John 20201201 $54.00
Tony 20201201 $20
and change it to look like this:
Name Enrollment_Month Premium
Tony 20201201 $20
Basically trying to remove both records where name and enrollment month are the same.
Any thought, I would be really appreciate it
You can do:
delete from t
where exists (select 1
from t t2
where t2.name = t.name and
t2.enrollment_month = t.enrollment_month and
t2.rowid <> t.rowid
);
Note: This will delete rows where there are 2 or more. If you specifically want only pairs deleted:
delete from t
where 2 = (select count(*)
from t t2
where t2.name = t.name and
t2.enrollment_month = t.enrollment_month
);
One option would be using HAVING clause to determine duplicates whenever GROUPed BY those columns(enrollment_month, name) such as
DELETE t
WHERE (enrollment_month, name) IN
(SELECT enrollment_month, name
FROM t
GROUP BY enrollment_month, name
HAVING COUNT(*) > 1)
You can use analytical function to identify duplicate records and delete them based on rowid as follows:
DELETE FROM YOUR_TABLE T
WHERE T.ROWID IN (
SELECT CASE
WHEN COUNT(1) OVER (PARTITION BY ENROLLMENT_MONTH, NAME) > 1
THEN ROWID
END AS ROWIDS
FROM YOUR_TABLE)

How to select rows with condition? sql, select sentence

I have table like this:
NAME IDENTIFICATIONR SCORE
JOHN DB 10
JOHN IT NULL
KAL DB 9
HENRY KK 3
KAL DB 10
HENRY IP 9
ALI IG 10
ALI PA 9
And with select sentence I want that my result would be like only those names whose scores are 9 or above. So basically it means, that, for exaple, Henry cannot be selected, because he has score under the value of 9 in one line , but in the other he has the score of 3 (null values also should be emitted).
My newtable should look like this:
NAME
KAL
ALI
I'm using a sas program. THANK YOU!!
The COUNT of names will be <> COUNT of scores if there is a missing score. Requesting equality in the having clause will ensure no person with a missing score is in your result set.
proc sql;
create table want as
select distinct name from have
group by name
having count(name) = count(score) and min(score) >= 9;
here the solution
select name
from table name where score >= 9
and score <> NULL;
Select NAME from YOUR_TABLE_NAME name where SCORE > 9 and score is not null
You can do aggregation :
select name
from table t
group by name
having sum(case when (score < 9 or score is null) then 1 else 0 end) = 0;
If you want full rows then you can use not exists :
select t.*
from table t
where not exists (select 1
from table t1
where t1.name = t.name and (t1.score < 9 or t1.score is null)
);
You seem to be treated NULL scores as a value less than 9. You can also just use coalesce() with min():
select name
from have
group by name
having min(coalesce(score, 0)) >= 9;
Note that select distinct is almost never useful with group by -- and SAS proc sql probably does not optimize it well.

How to select rows with exactly 2 values in a column fast within a table that has 10 million records?

I have a table (TestFI) with the following data for instance
FIID Email
---------
null a#a.com
1 a#a.com
null b#b.com
2 b#b.com
3 c#c.com
4 c#c.com
5 c#c.com
null d#d.com
null d#d.com
and I need records that appear exactly twice AND have 1 row with FIID is null and one is not. Such for the data above, only "a#a.com and b#b.com" fit the bill.
I was able to construct a multilevel query like so
Select
FIID,
Email
from
TestFI
where
Email in
(
Select
Email
from
(
Select
Email
from
TestFI
where
Email in
(
select
Email
from
TestFI
where
FIID is null or FIID is not null
group by Email
having
count(Email) = 2
)
and
FIID is null
)as Temp1
group by Email
having count(Email) = 1
)
However, it took nearly 10 minutes to go through 10 million records. Is there a better way to do this? I know I must be doing some dumb things here.
Thanks
I would try this query:
SELECT EMail, MAX(FFID)
FROM TestFI
GROUP BY EMail
HAVING COUNT(*)=2 AND COUNT(FIID)=1
It will return the EMail column, and the non-null value of FFID. The other value of FFID is null.
With an index on (email, fid), I would be tempted to try:
select tnull.*, tnotnull.*
from testfi tnull join
testfi tnotnull
on tnull.email = tnotnull.email left outer join
testfi tnothing
on tnull.email = tnothing.email
where tnothing.email is null and
tnull.fid is null and
tnotnull.fid is not null;
Performance definitely depends on the database. This will keep all the accesses within the index. In some databases, an aggregation might be faster. Performance also depends on the selectivity of the queries. For instance, if there is one NULL record and you have the index (fid, email), this should be much faster than an aggregation.
Maybe something like ...
select
a.FIID,
a.Email
from
TestFI a
inner join TestFI b on (a.Email=b.Email)
where
a.FIID is not null
and b.FIID is null
;
And make sure Email and FIID are indexed.
I need records that appear exactly twice AND have 1 row with FIID is null and one is not
1
On the innermost select, group by email having count = 2:
select email, coalesce(fiid,-1) as AdjusteFIID from T
group by email having count(email) =2
2
select email, AdjustedFIID
from
(
select email, coalesce(fiid,-1) as AdjusteFIID from T
group by email having count(email) =2
) as X
group by email
having min(adjustedFIID) = -1 and max(adjustedFIID) > -1

SQL - Removing Duplicate without 'hard' coding?

Heres my scenario.
I have a table with 3 rows I want to return within a stored procedure, rows are email, name and id. id must = 3 or 4 and email must only be per user as some have multiple entries.
I have a Select statement as follows
SELECT
DISTINCT email,
name,
id
from table
where
id = 3
or id = 4
Ok fairly simple but there are some users whose have entries that are both 3 and 4 so they appear twice, if they appear twice I want only those with ids of 4 remaining. I'll give another example below as its hard to explain.
Table -
Email Name Id
jimmy#domain.com jimmy 4
brian#domain.com brian 4
kevin#domain.com kevin 3
jimmy#domain.com jimmy 3
So in the above scenario I would want to ignore the jimmy with the id of 3, any way of doing this without hard coding?
Thanks
SELECT
email,
name,
max(id)
from table
where
id in( 3, 4 )
group by email, name
Is this what you want to achieve?
SELECT Email, Name, MAX(Id) FROM Table WHERE Id IN (3, 4) GROUP BY Email;
Sometimes using Having Count(*) > 1 may be useful to find duplicated records.
select * from table group by Email having count(*) > 1
or
select * from table group by Email having count(*) > 1 and id > 3.
The solution provided before with the select MAX(ID) from table sounds good for this case.
This maybe an alternative solution.
What RDMS are you using? This will return only one "Jimmy", using RANK():
SELECT A.email, A.name,A.id
FROM SO_Table A
INNER JOIN(
SELECT
email, name,id,RANK() OVER (Partition BY name ORDER BY ID DESC) AS COUNTER
FROM SO_Table B
) X ON X.ID = A.ID AND X.NAME = A.NAME
WHERE X.COUNTER = 1
Returns:
email name id
------------------------------
jimmy#domain.com jimmy 4
brian#domain.com brian 4
kevin#domain.com kevin 3

Using sql to keep only a single record where both name field and address field repeat in 5+ records

I am trying to delete all but one record from table where name field repeats same value more than 5 times and the address field repeats more than five times for a table. So if there are 5 records with a name field and address field that are the same for all 5, then I would like to delete 4 out of 5. An example:
id name address
1 john 6440
2 john 6440
3 john 6440
4 john 6440
5 john 6440
I would only want to return 1 record from the 5 records above.
I'm still having problems with this.
1) I create a table called KeepThese and give it a primary key id.
2) I create a query called delete_1 and copy this into it:
INSERT INTO KeepThese
SELECT ID FROM
(
SELECT Min(ID) AS ID
FROM Print_Ready
GROUP BY names_1, addresses
HAVING COUNT(*) >=5
UNION ALL
SELECT ID FROM Print_Ready as P
INNER JOIN
(SELECT Names_1, addresses
FROM Print_ready
GROUP BY Names_1, addresses
HAVING COUNT(*) < 5) as ThoseLessThan5
ON ThoseLessThan5.Names_1 = P.Names_1
AND ThoseLessThan5.addresses = P.addresses
)
3) I create a query called delete_2 and copy this into it:
DELETE P.* FROM Print_Ready as P
LEFT JOIN KeepThese as K
ON K.ID = P.ID
WHERE K.ID IS NULL
4) Then I run delete_1. I get a message that says "circular reference caused by alias ID" So I change this piece:
FROM (SELECT Min(ID) AS ID
to say this:
FROM (SELECT Min(ID) AS ID2
Then I double click again and a popup displays saying Enter Parameter Value for ID.This indicates that it doesn't know what ID is. But print_ready is only a query and while it has an id, it is in reality the id of another table that got filtered into this query.
Not sure what to do at this point.
CREATE TABLE isolate_duplicates AS dont sure it work for access, beside you should give a name for count(*) for new table.
This maybe work:
SELECT DISTINCT name, address
INTO isolate_duplicate
FROM print_ready
GROUP BY name + address
HAVING COUNT(*) > 4
DELETE print_ready
WHERE name + address
IN (SELECT name + address
FROM isolate_duplicate)
INSERT print_ready
SELECT *
FROM isolate_duplicate
DROP TABLE isolate_duplicate
Not tested.