Removing duplicates in SQL based on column

Removing duplicates in SQL based on column - sql

I'm trying to remove orphaned rows from a database. Let's say I have this table t:
session | name | record_date | uniqueid
1 | a | 2019-04-03 | 1x
2 | a | 2019-09-19 | 1x
3 | b | 2019-08-09 | zr
4 | c | 2019-09-19 | ww
5 | d | 2019-09-03 | yy
6 | d | 2019-09-25 | rr
7 | e | 2019-09-28 | dd
8 | e | 2019-04-19 |
I'm trying to remove duplicate entries based on oldest record_date, while evaluating both name and uniqueid to ensure they're actual duplicates (not a duplicate just based on name). The catch for not simply evaluating based on uniqueid alone is that some rows have null value in for uniqueid. So in my example table, I'd want to remove the first and last rows.

You can use delete:
delete from t
where t.record_date < (select max(t2.record_date)
from t t2
where t2.name = t.name and
t2.uniqueid = t.uniqueid
);
Note: The above keeps only the most recent record for name/uniqueid pairs.
If you want the unique rows in a query, I would recommend:
select t.*
from t
where t.record_date = (select max(t2.record_date)
from t t2
where t2.name = t.name and
t2.uniqueid = t.uniqueid
);

I think you are finding max() and does not want to consider null uniqueid so use where for filter null values
name,max(record_date),uniqueid from table_name
where uniqueid is not null
group by name,uniqueid

You can try below - with aggregration and group by with where clause to filter out null uniqueid
select name,uniqueid,max(record_date)
from tablename
where uniqueid is not null
group by name,uniqueid

Related

Find number of rows identical one some, but different on another column

Say I have the following table:
CREATE TABLE data (
PROJECT_ID VARCHAR,
TASK_ID VARCHAR,
REF_ID VARCHAR,
REF_VALUE VARCHAR
);
I want to identify rows where
PROJECT_ID, REF_ID, REF_VALUE are the same
but TASK_ID are different.
The desired output is a list of TASK_ID_1, TASK_ID_2 and COUNT(*) of such conflicts. So, for example,
DATA
+------------+---------+--------+-----------+
| PROJECT_ID | TASK_ID | REF_ID | REF_VALUE |
+------------+---------+--------+-----------+
| 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 1 | 2 | 1 | 1 |
| 1 | 2 | 1 | 2 |
+------------+---------+--------+-----------+
OUTPUT
+-----------+-----------+----------+
| TASK_ID_1 | TASK_ID_2 | COUNT(*) |
+-----------+-----------+----------+
| 1 | 2 | 2 |
| 2 | 1 | 2 |
+-----------+-----------+----------+
would mean that there are two entries with TASK_ID == 1 and two entries with TASK_ID == 2 that share the same values for the other three columns. The inherent symmetry in the output is fine.
How would I go about finding this information? I've tried joining the table onto itself and grouping, but this turned up more results for a single task than the table had rows altogether, so it's clearly wrong.
The database used is PostgreSQL, though a solution that applies to most common SQL systems would be preferable.

You want a self join and aggregation:
select d1.task_id as task_id_1, d2.task_id as task_id_2, count(*)
from data d1 join
data d2
on d1.project_id = d2.project_id and
d1.ref_id = d2.ref_id and
d1.ref_value = d2.ref_value and
d1.task_id <> d2.task_id
group by d1.task_id, d2.task_id;
Notes:
Add the condition d1.task_id < d2.task_id if you want each pair to occur only once in the result set.
This does not handle NULL values, although that is easy enough to handle. Use is not distinct from instead of =.
You can also simplify this a bit with the using clause:
select d1.task_id as task_id_1, d2.task_id as task_id_2, count(*)
from data d1 join
data d2
using (project_id, ref_id, ref_value)
where d1.task_id <> d2.task_id
group by d1.task_id, d2.task_id;
You can get an idea of how many rows might be returned by using:
select d.project_id, d.ref_id, d.ref_value, count(distinct d.task_id), count(*)
from data d
group by d.project_id, d.ref_id, d.ref_value;

This is how I understand your question. This assume there are only two task for the same combination.
SQL DEMO
SELECT "PROJECT_ID", "REF_ID", "REF_VALUE",
MIN("TASK_ID") as TASK_ID_1,
MAX("TASK_ID") as TASK_ID_2,
COUNT(*) as cnt
FROM Table1
GROUP BY "PROJECT_ID", "REF_ID", "REF_VALUE"
HAVING MIN("TASK_ID") != MAX("TASK_ID")
-- COUNT(*) > 1 also should work
OUTPUT
I add more column to make clear what are the same elements:
| PROJECT_ID | REF_ID | REF_VALUE | task_id_1 | task_id_2 | cnt |
|------------|--------|-----------|-----------|-----------|-----|
| 1 | 1 | 2 | 1 | 2 | 2 |
| 1 | 1 | 1 | 1 | 2 | 2 |

Filter query with a GROUP BY based on column not in GROUP BY statement

Given the following table structure and sample data:
+-------------+------+-------------+
| EmployeeID | Name | WorkWeek |
+--------------+-------+-----------+
| 1 | A | 1 |
| 2 | B | 1 |
| 2 | B | 2 |
| 3 | C | 1 |
| 3 | C | 2 |
| 4 | D | 2 |
+--------------+-------+-----------+
I am looking to select all employees that only worked week 1 (so in this example, only employeeid = 1 would be returned. I am able to get the data with the following query:
SELECT EmployeeId, Name
FROM SomeTable
GROUP BY EmployeeId, Name
HAVING SUM ( WorkWeek ) = 1;
To me, the HAVING SUM( WorkWeek ) = 1 is a hack and this should be handled with some form of a GROUP BY and COUNT but I cannot wrap my head around how that query would be structured.
Any help would be useful and enlightening.

HAVING SUM( WorkWeek ) = 1 may work for week 1 or 2, but will fail for week 3 (since 1+2 = 3).
Use NOT EXISTS operator with a subquery instead:
SELECT EmployeeId, Name
FROM SomeTable t1
WHERE NOT EXISTS (
SELECT * FROM SomeTable t2
WHERE t1.EmployeeId = t2.EmployeeId
AND t2.WorkWeek <> 1
)

Actually, that's exactly why the having clause is for - to filter records according to the aggregated values.
From w3schools sql tutorial:
The HAVING clause was added to SQL because the WHERE keyword could not be used with aggregate functions.

sql query to find unique records

I am new to sql and need your help to achieve the below , I have tried using group and count functions but I am getting all the rows in the unique group which are duplicated.
Below is my source data.
CDR_ID,TelephoneNo,Call_ID,call_Duration,Call_Plan
543,xxx-23,12,12,500
543,xxx-23,12,12,501
543,xxx-23,12,12,510
643,xxx-33,11,17,700
343,xxx-33,11,17,700
766,xxx-74,32,1,300
766,xxx-74,32,1,300
877,xxx-32,12,2,300
877,xxx-32,12,2,300
877,xxx-32,12,2,301
Please note :-the source has multiple combinations of unique records, so when I do the count the unique set is not appearing as count =1
example :- the below data in source have 60 records for each combination
877,xxx-32,12,2,300 -- 60 records
877,xxx-32,12,2,301 -- 60 records
I am trying to get the unique unique records, but the duplicate records are also getting in
Below are the rows which should come up in the unique group. i.e. there will be multiple call_Plans for the same combinations of CDR_ID,TelephoneNo,Call_ID,call_Duration. I want to read records for which there is only one call plan for each unique combination of CDR_ID,TelephoneNo,Call_ID,call_Duration,
CDR_ID,TelephoneNo,Call_ID,call_Duration,Call_Plan
643,xxx-33,11,17,700
343,xxx-33,11,17,700
766,xxx-74,32,1,300
Please advice on this.
Thanks and Regards

To do more complex groupings you could also use a Common Table Expression/Derived Table along with windowed functions:
declare #t table(CDR_ID int,TelephoneNo nvarchar(20),Call_ID int,call_Duration int,Call_Plan int);
insert into #t values (543,'xxx-23',12,12,500),(543,'xxx-23',12,12,501),(543,'xxx-23',12,12,510),(643,'xxx-33',11,17,700),(343,'xxx-33',11,17,700),(766,'xxx-74',32,1,300),(766,'xxx-74',32,1,300),(877,'xxx-32',12,2,300),(877,'xxx-32',12,2,300),(877,'xxx-32',12,2,301);
with cte as
(
select CDR_ID
,TelephoneNo
,Call_ID
,call_Duration
,Call_Plan
,count(*) over (partition by CDR_ID,TelephoneNo,Call_ID,call_Duration) as c
from (select distinct * from #t) a
)
select *
from cte
where c = 1;
Output:
+--------+-------------+---------+---------------+-----------+---+
| CDR_ID | TelephoneNo | Call_ID | call_Duration | Call_Plan | c |
+--------+-------------+---------+---------------+-----------+---+
| 343 | xxx-33 | 11 | 17 | 700 | 1 |
| 643 | xxx-33 | 11 | 17 | 700 | 1 |
| 766 | xxx-74 | 32 | 1 | 300 | 1 |
+--------+-------------+---------+---------------+-----------+---+

using not exists()
select distinct *
from t
where not exists (
select 1
from t as i
where i.cdr_id = t.cdr_id
and i.telephoneno = t.telephoneno
and i.call_id = t.call_id
and i.call_duration = t.call_duration
and i.call_plan <> t.call_plan
)
rextester demo: http://rextester.com/RRNNE20636
returns:
+--------+-------------+---------+---------------+-----------+-----+
| cdr_id | TelephoneNo | Call_id | call_Duration | Call_Plan | cnt |
+--------+-------------+---------+---------------+-----------+-----+
| 343 | xxx-33 | 11 | 17 | 700 | 1 |
| 643 | xxx-33 | 11 | 17 | 700 | 1 |
| 766 | xxx-74 | 32 | 1 | 300 | 1 |
+--------+-------------+---------+---------------+-----------+-----+

Basically you should try this:
SELECT A.CDR_ID, A.TelephoneNo, A.Call_ID, A.call_Duration, A.Call_Plan
FROM YOUR_TABLE A
INNER JOIN (SELECT CDR_ID,TelephoneNo,Call_ID,call_Duration
FROM YOUR_TABLE
GROUP BY CDR_ID,TelephoneNo,Call_ID,call_Duration
HAVING COUNT(*)=1
) B ON A.CDR_ID= B.CDR_ID AND A.TelephoneNo=B.TelephoneNo AND A.Call_ID=B.Call_ID AND A.call_Duration=B.call_Duration
You can do a shorter query using Windows Function COUNT(*) OVER ...

Below query will provide you the result
SELECT CDR_ID,TelephoneNo,Call_ID,call_Duration,Call_Plan, COUNT(*)
FROM TABLE_NAME GROUP BY CDR_ID,TelephoneNo,Call_ID,call_Duration,Call_Plan
HAVING COUNT(*) < 2;
It gives you with the count as well. If not required you can remove it.

Select *, count(CDR_ID)
from table
group by CDR_ID, TelephoneNo, Call_ID, call_Duration, Call_Plan
having count(CDR_ID) = 1

SQL DELETE group of records based on opposite group being empty

In table T, I'm trying to delete all records in a groups having same value of A, but only if all members of this group have B set to 'x'.
Given the Table T:
+-------+--------+
| A | B |
+-------+--------+
| 2 | '' |
| 2 | 'x' |
| 2 | '' |
| 8 | 'x' |
| 8 | 'x' |
| 15 | '' |
| 15 | '' |
+-------+--------+
The two records with A == 8 have to be deleted as all two of them have B==1. The group of A==2 has mixed value of B so it stays. And group of A==15 doesn't have all of it's B equal to 1 it also stays.
Is this possible to do by one query?
If not, any other way that is fast enough for a table with a lot of records?

you can try this query:
delete from T
where A in (
select A
from T
group by A
having sum(B) = count(*)
)
if column b can contain non 0/1 values, you can add additional conditions:
having sum(B) = count(*) and min(b)=1 and max(b)=1
if you can't use numeric values, you can just use min/max, like
having min(b)='x' and max(b)='x'

Try this. Group by and Having with some aggregate should work
DELETE FROM tablename
WHERE a IN(SELECT a
FROM tablename
GROUP BY a
HAVING count(case when b='x' then 1 end) = Count(b)

microsoft sql server - calculate return between every row and the last row

I have a table like the following:
+-------+--------------+
| Value | Date |
+-------+--------------+
| 14 | 10/11/2010 |
| 12 | 10/12/2010 |
| 12 | 10/13/2010 |
| 10 | 10/14/2010 |
| 8 | 10/15/2010 |
| 6 | 10/16/2010 |
| 4 | 10/17/2010 |
| 2 | 10/18/2010 |
+-------+--------------+
I would like to calculate the return (the quotient) between every row and the last row (which is with the latest date). e.g for the row with date "10/16/2010", the result should be 6/2=3
Hence, the resulting table should be
+-------+--------------+
| result| Date |
+-------+--------------+
| 7 | 10/11/2010 |
| 6 | 10/12/2010 |
| 6 | 10/13/2010 |
| 5 | 10/14/2010 |
| 4 | 10/15/2010 |
| 3 | 10/16/2010 |
| 2 | 10/17/2010 |
| 1 | 10/18/2010 |
+-------+--------------+
Is it possible to complete this? thanks you!

You can get the value you want to divide by. Since that's always going to be a single row, you can just use a cross join to join to that and perform your division. SQL Fiddle
with maxdate as
(select max([Date]) as maxdate from table1),
divby as
(select
value as divby
from
table1
inner join maxdate md
on md.maxdate = table1.[date])
select
value / divby
,[date]
from
table1
cross join divby
To break it down a bit, the first CTE (cleverly named maxdate) gets the maximum date for the whole thing. The second CTE (divby) get the value (that you will be dividing by) for that max date. As long as you only get one row back from that, you can safely use a cross join, resulting in each row in your table being divided by that one value.

Another possible solution JOIN the the table to itself.
SQL Fiddle Example
select (t1b.value / t1a.value) as result,
t1b.date from table1 t1a
join table1 t1b on t1a.date = (select max(date) from table1)

Thanks for the fiddle, Andrew! Can be accomplished like this as well if 2008 and above (fiddle: http://sqlfiddle.com/#!3/ecda1/11):
SELECT [Value] / MIN([Value]) OVER () AS result,
[Date]
FROM Table1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Removing duplicates in SQL based on column - sql

I think you are finding max() and does not want to consider null uniqueid so use where for filter null values name,max(record_date),uniqueid from table_name where uniqueid is not null group by name,uniqueid

You can try below - with aggregration and group by with where clause to filter out null uniqueid select name,uniqueid,max(record_date) from tablename where uniqueid is not null group by name,uniqueid

Related

Find number of rows identical one some, but different on another column

Filter query with a GROUP BY based on column not in GROUP BY statement

sql query to find unique records

SQL DELETE group of records based on opposite group being empty

microsoft sql server - calculate return between every row and the last row

Categories

Resources