Removing records based on two columns - sql

What I'm trying to do is taking these records that looks like this:
Name Enrollment_Month Premium
John 20201201 $76.00
John 20201201 $54.00
Tony 20201201 $20
and change it to look like this:
Name Enrollment_Month Premium
Tony 20201201 $20
Basically trying to remove both records where name and enrollment month are the same.
Any thought, I would be really appreciate it

You can do:
delete from t
where exists (select 1
from t t2
where t2.name = t.name and
t2.enrollment_month = t.enrollment_month and
t2.rowid <> t.rowid
);
Note: This will delete rows where there are 2 or more. If you specifically want only pairs deleted:
delete from t
where 2 = (select count(*)
from t t2
where t2.name = t.name and
t2.enrollment_month = t.enrollment_month
);

One option would be using HAVING clause to determine duplicates whenever GROUPed BY those columns(enrollment_month, name) such as
DELETE t
WHERE (enrollment_month, name) IN
(SELECT enrollment_month, name
FROM t
GROUP BY enrollment_month, name
HAVING COUNT(*) > 1)

You can use analytical function to identify duplicate records and delete them based on rowid as follows:
DELETE FROM YOUR_TABLE T
WHERE T.ROWID IN (
SELECT CASE
WHEN COUNT(1) OVER (PARTITION BY ENROLLMENT_MONTH, NAME) > 1
THEN ROWID
END AS ROWIDS
FROM YOUR_TABLE)

Related

deleting specific duplicate and original entries in a table based on date

i have a table called "main" which has 4 columns, ID, name, DateID and Sign.
i want to create a query that will delete entries in this table if there is the same ID record in twice within a certain DateID.
i have my where clause that searches the previous 3 weeks
where DateID =((SELECT MAX( DateID)
WHERE DateID < ( SELECT MAX( DateID )-3))
e.g of my dataset im working with:
id
name
DateID
sign
12345
Paul
1915
Up
23658
Danny
1915
Down
37868
Jake
1916
Up
37542
Elle
1917
Up
12345
Paul
1917
Down
87456
John
1918
Up
78563
Luke
1919
Up
23658
Danny
1920
Up
in the case above, both entries for ID 12345 would need to be removed.
however the entries for ID 23658 would need to be kept as the DateID > 3
how would this be possible?
You can use window functions for this.
It's not quite clear, but it seems LAG and conditional COUNT should fit what you need.
DELETE t
FROM (
SELECT *,
CountWithinDate = COUNT(CASE WHEN t.PrevDate >= t.DateId - 3 THEN 1 END) OVER (PARTITION BY t.id)
FROM (
SELECT *,
PrevDate = LAG(t.DateID) OVER (PARTITION BY t.id ORDER BY t.DateID)
FROM YourTable t
) t
) t
WHERE CountWithinDate > 0;
db<>fiddle
Note that you do not need to re-join the table, you can delete directly from the t derived table.
Hope this works:
DELETE FROM test_tbl
WHERE id IN (
SELECT T1.id
FROM test_tbl T1
WHERE EXISTS (SELECT 1 FROM test_tbl T2 WHERE T1.id = T2.id AND ABS(T2.dateid - T1.dateid) < 3 AND T1.dateid <> T2.dateid)
)
In case you need more logic for data processing, I would suggest using Stored Procedure.

How to select rows with condition? sql, select sentence

I have table like this:
NAME IDENTIFICATIONR SCORE
JOHN DB 10
JOHN IT NULL
KAL DB 9
HENRY KK 3
KAL DB 10
HENRY IP 9
ALI IG 10
ALI PA 9
And with select sentence I want that my result would be like only those names whose scores are 9 or above. So basically it means, that, for exaple, Henry cannot be selected, because he has score under the value of 9 in one line , but in the other he has the score of 3 (null values also should be emitted).
My newtable should look like this:
NAME
KAL
ALI
I'm using a sas program. THANK YOU!!
The COUNT of names will be <> COUNT of scores if there is a missing score. Requesting equality in the having clause will ensure no person with a missing score is in your result set.
proc sql;
create table want as
select distinct name from have
group by name
having count(name) = count(score) and min(score) >= 9;
here the solution
select name
from table name where score >= 9
and score <> NULL;
Select NAME from YOUR_TABLE_NAME name where SCORE > 9 and score is not null
You can do aggregation :
select name
from table t
group by name
having sum(case when (score < 9 or score is null) then 1 else 0 end) = 0;
If you want full rows then you can use not exists :
select t.*
from table t
where not exists (select 1
from table t1
where t1.name = t.name and (t1.score < 9 or t1.score is null)
);
You seem to be treated NULL scores as a value less than 9. You can also just use coalesce() with min():
select name
from have
group by name
having min(coalesce(score, 0)) >= 9;
Note that select distinct is almost never useful with group by -- and SAS proc sql probably does not optimize it well.

Set Duplicate Values to Null in PostgresSQL retaining one of the values

I have a database like this:
id name email
0 Bill bill#fakeemail.com
1 John john#fakeemail.com
2 Susan susan#fakeemail.com
3 Susan J susan#fakeemail.com
I want to remove duplicate emails by setting the value to null, but retain at least 1 email on one of the rows (doesn't really matter which one).
So that the resulting database would look like this:
id name email
0 Bill bill#fakeemail.com
1 John john#fakeemail.com
2 Susan susan#fakeemail.com
3 Susan J
I was able to target the rows like this
SELECT COUNT(email) as count FROM users WHERE count > 1
But can't figure out how to set the value to null while still retaining at least 1.
Update the rows which have the same email but greater id:
update my_table t1
set email = null
where exists (
select from my_table t2
where t1.email = t2.email and t1.id > t2.id
);
Working example in rextester.
You can use a windowed partition to assign a row number to each email group, and then use that generated row number to modify all rows except for one. Something like this:
WITH annotated_persons AS(
SELECT
id,
name,
email,
ROW_NUMBER () OVER (PARTITION BY email) AS i
FROM
persons;
)
UPDATE persons
SET email = null
WHERE id = annotated_persons.id AND annotated_persons.i <> 1
You may have to use another subquery in order to gather the IDs of persons whose row number != 1, and then change your update query to
WHERE id IN person_ids
It's been awhile since I've used a window.

Delete rows where date was least updated

How can I delete rows where dateupdated was least updated ?
My table is
Name Dateupdated ID status
john 1/02/17 JHN1 A
john 1/03/17 JHN2 A
sally 1/02/17 SLLY1 A
sally 1/03/17 SLLY2 A
Mike 1/03/17 MK1 A
Mike 1/04/17 MK2 A
I want to be left with the following after the data removal:
Name Date ID status
john 1/03/17 JHN2 A
sally 1/03/17 SLLY2 A
Mike 1/04/17 MK2 A
If you really want to "delete rows where dateupdated was least updated" then a simple single-row subquery should do the trick.
DELETE MyTable
WHERE Date = (SELECT MIN(Date) From MyTable)
If on the other hand you just want to delete the row with the earliest Date per person (as identified by their ID) you could use:
DELETE MyTable
FROM MyTable a
JOIN (SELECT ID, MIN(Date) MinDate FROM MyTable GROUP BY ID) b
ON a.ID = b.ID AND a.Date = b.MinDate
The idea here is you create an aggregate query that returns rows containing the columns that would match the rows you want deleted, then join to it. Because it's an inner join, rows that do not match the criteria will be excluded.
If people are uniquely identified by something else (e.g. Name then you can just substitute that for the ID in my example above.
I am thinking though that you don't want either of these. I think you want to delete everything except for each person's latest row. If that is the case, try this:
DELETE MyTable
WHERE EXISTS (SELECT 0 FROM MyTable b WHERE b.ID = MyTable.ID AND b.Date > MyTable.Date)
The idea here is you check for existence of another data row with the same ID and a later date. If there is a later record, delete this one.
The nice thing about the last example is you can run it over and over and every person will still be left with exactly one row. The other two queries, if run over and over, will nibble away at the table until it is empty.
P.S. As these are significantly different solutions, I suggest you spend some effort learning how to articulate unambiguous requirements. This is an extremely important skill for any developer.
This deletes rows where the name is a duplicate, and deletes all but the latest row for each name. This is different from your stated question.
Using a common table expression (cte) and row_number():
;with cte as (
select *
, rn = row_number() over (
partition by Name
order by Dateupdated desc
)
from t
)
/* ------------------------------------------------
-- Remove duplicates by deleting rows
-- where the row number (rn) is greater than 1
-- leaving the first row for each partition
------------------------------------------------ */
delete
from cte
where cte.rn > 1
select * from t
rextester: http://rextester.com/HZBQ50469
returns:
+-------+-------------+-------+--------+
| Name | Dateupdated | ID | status |
+-------+-------------+-------+--------+
| john | 2017-01-03 | JHN2 | A |
| sally | 2017-01-03 | SLLY2 | A |
| Mike | 2017-01-04 | MK2 | A |
+-------+-------------+-------+--------+
Without using the cte it can be written as:
delete d
from (
select *
, rn = row_number() over (
partition by Name
order by Dateupdated desc
)
from t
) as d
where d.rn > 1
This should do the trick:
delete
from MyTable a
where not exists (
select top 1 1
from MyTable b
where b.name = a.name
and b.DateUpdated < a.DateUpdated
)
i.e. remove any entries from the table for which there is no record on the same name with a date earlier than the record to be deleted's.
Your Name column has Mike and Mik2 which is different for each other.
So, if you did not make a mistake, standard column to group by must be ID column without last digit.
I think following is more accurate if you did not mistaken.
delete a
from MyTable a
inner join
(select substring(ID, 1, len(ID) - 1) as ID, min(Dateupdated) as MinDate
from MyTable
group by substring(ID, 1, len(ID) - 1)
) b
on substring(a.ID, 1, len(a.ID) - 1) = b.ID and a.Dateupdated = b.MinDate
You can test it at SQLFiddle: http://sqlfiddle.com/#!6/9c440/1

write a query to identify discrepancy

I have a table with Student ID's and Student Names. There has been issues with assigning unique Student Id's to students and Hence I want to find the duplicates
Here is the sample Table:
Student ID Student Name
1 Jack
1 John
1 Bill
2 Amanda
2 Molly
3 Ron
4 Matt
5 James
6 Kathy
6 Will
Here I want a third column "Duplicate_Count" to display count of duplicate records.
For e.g. "Duplicate_Count" would display "3" for Student ID = 1 and so on. How can I do this?
Thanks in advance
Select StudentId, Count(*) DupCount
From Table
Group By StudentId
Having Count(*) > 1
Order By Count(*) desc,
Select
aa.StudentId, aa.StudentName, bb.DupCount
from
Table as aa
join
(
Select StudentId, Count(*) as DupCount from Table group by StudentId
) as bb
on aa.StudentId = bb.StudentId
The virtual table gives the count for each StudentId, this is joined back to the original table to add the count to each student record.
If you want to add a column to the table to hold dupcount, this query can be used in an update statement to update that column in the table
This should work:
update mytable
set duplicate_count = (select count(*) from mytable t where t.id = mytable.id)
UPDATE:
As mentioned by #HansUp, adding a new column with the duplicate count probably doesn't make sense, but that really depends on what the OP originally thought of using it for. I'm leaving the answer in case it is of help for someone else.