Remove duplicates using multiple criteria in SQL - sql

I've tried all afternoon to dedup a table that looks like this:
ID1 | ID2 | Date | Time |Status | Price
----+-----+------------+-----------------+--------+-------
01 | A | 01/01/2022 | 10:41:47.000000 | DDD | 55
01 | B | 02/01/2022 | 16:22:31.000000 | DDD | 53
02 | C | 01/01/2022 | 08:54:03.000000 | AAA | 72
02 | D | 03/01/2022 | 11:12:35.000000 | DDD |
03 | E | 01/01/2022 | 17:15:41.000000 | DDD | 67
03 | F | 01/01/2022 | 19:27:22.000000 | DDD | 69
03 | G | 02/01/2022 | 06:45:52.000000 | DDD | 78
Basically, I need to dedup based on two conditions:
Status: where AAA > BBB > CCC > DDD. So, pick the highest one.
When the Status is the same given the same ID1, pick the latest one based on Date and Time.
The final table should look like:
ID1 | ID2 | Date | Time |Status | Price
----+-----+------------+-----------------+--------+-------
01 | B | 02/01/2022 | 16:22:31.000000 | DDD | 53
02 | C | 01/01/2022 | 08:54:03.000000 | AAA | 72
03 | G | 02/01/2022 | 06:45:52.000000 | DDD | 78
Is there a way to do this in Redshift SQL / PostgreSQL?
I tried variations of this, but everytime it doesn't work because it demands that I add all columns to the group by, so then it defeats the purpose
select a.id1,
b.id2,
b.date,
b.time,
b.status,
b.price,
case when (status = 'AAA') then 4
when (status = 'BBB') then 3
when (status= 'CCC') then 2
when (status = 'DDD') then 1
when (status = 'EEE') then 0
else null end as row_order
from table1 a
left join table2 b
on a.id1=b.id1
group by id1
having row_order = max(row_order)
and date=max(date)
and time=max(time)
Any help at all is appreciated!

Windowing functions are good at this:
SELECT ID1, ID2, Date, Time, Status, Price
FROM (
SELECT *,
row_number() OVER (PARTITION BY ID1 ORDER BY Status, Date DESC, Time DESC) rn
FROM MyTable
) t
WHERE rn = 1
See it work here:
https://dbfiddle.uk/uAvDz1Qn

You can use ROW_NUMBER() like so:
with cte as (
select a.id1,
b.id2,
b.date,
b.time,
b.status,
b.price,
ROW_NUMBER() OVER (PARTITION BY a.id1 ORDER BY b.status ASC, b.date DESC, b.time DESC) RN
from table1 a
left join table2 b on a.id1=b.id1
)
select * from cte where rn = 1

This is a typical top-1-per-group problem. The canonical solution indeed involves window functions, as demonstrated by Joel Coehoorn and Aaron Dietz.
But Postgres has a specific extension, called distinct on, which is built exactly for the purpose of solving top-1-per-group problems. The syntax is neater, and you benefit built-in optimizations:
select distinct on (id1) t.*
from mytable t
order by id1, status, "Date" desc, "Time" desc
Here is a demo on DB Fiddle based on that of Joel Coehoorn.

Related

Detect changes for each ID

Suppose I have the following data
ID | year_month | Department
1233 | 2020-01-01 | A
1123 | 2020-02-01 | A
1123 | 2020-03-01 | NULL
1123 | 2020-04-01 | B
1123 | 2020-05-01 | B
1123 | 2020-06-01 | B
1123 | 2020-07-01 | NULL
9999 | 2020-01-01 | A
9999 | 2020-02-01 | A
9999 | 2020-03-01 | B
9999 | 2020-04-01 | B
9999 | 2020-05-01 | B
9999 | 2020-06-01 | A
9999 | 2020-07-01 | B
I want to identify the changes in department. , including going to NA/NULL. The desired output is:
ID | Change_year_month | Old_Department | New_Department
1123 | 2020-03-01 | A | NULL
1123 | 2020-04-01 | NULL | B
1123 | 2020-07-01 | B | NULL
9999 | 2020-03-01 | A | B
9999 | 2020-06-01 | B | A
Ideas I've already tried to pursue:
with x as(
SELECT T1.ID, T1.Department, MIN(T1.year_month) AS Change_year_month FROM dbo.Source
GROUP BY T1.ID, Department),
y as (
SELECT ID, year_month,
rown = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY year_month) FROM x
)
select y.ID, T2.Department, year_month AS Change_year_month FROM y
right join (SELECT T1.ID,
MAX(Department) as Old_Department,
Min(Department) AS New_Department
FROM dbo.Source
GROUP BY T1.ID HAVING COUNT(DISTINCT(Department)) >= 2) T2 on y.ID = T2.ID
where rown = 1
However, this does not yield the desired result. Whenever a NULL is involved, the query does not see the change. Whenever I change the NULL to something else (like: 'outside the scope'), then the ordering is wrong as the Old_department is never 'outside the scope', but the New_department always is. Also, I feel like the code is inefficient and not durable.
Does anyone have suggestions how to proceed or to construct of durable query?
Here is a pretty simple method using lag():
select s.id, s.year_month, s.prev_department, s.department
from (select s.*,
lag(year_month) over (partition id order by year_month) as prev_ym,
lag(year_month) over (partition id, department order by year_month) as prev_ym_dept,
lag(department) over (partition by id order by year_month) as prev_department
from dbo.source s
) s
where prev_ym_dept <> prev_ym;
This looks at the dates for the comparison, so it just handles NULL values.
Of course, you can use more complicated comparisons:
select s.id, s.year_month, s.prev_department, s.department
from (select s.*,
lag(year_month) over (partition id order by year_month) as prev_ym,
min(year_month) over (partition by id) as min_year_month
from dbo.source s
) s
where prev_department <> department or
(department is null and
prev_department is not null
) or
(prev_department is null and
department is not null and
year_month <> min_year_month
)
But that is rather tricky to express. And that might even have a mistake in filtering out the first row.

Gather the max of a set of data by 2 columns in SQL?

I am trying to get the latest of a set of columns by a PersonID out of a set of YearIDs.
If I have a table like this:
| DataID | PersonID | YearID | Data1A | Data1B | Data2A | Data2B |
|--------|----------|--------|--------|--------|--------|--------|
| 1 | 888 | A100 | d | 0.00 | a | 1.00 |
| 2 | 888 | A101 | NULL | NULL | b | 2.00 |
| 3 | 888 | A102 | c | 3.00 | NULL | NULL |
| 4 | 333 | A100 | a | 3.40 | e | 4.00 |
| 5 | 333 | A101 | d | 0.00 | NULL | NULL |
| 6 | 333 | A102 | NULL | NULL | NULL | NULL |
How do I get the latest of column sets Data1A, Data1B and Data2A, Data2B sorted by YearID per PersonID?
This is given that Data1A and Data1B are related and Data2A and Data2B are related and can not be separated, and most recent year is A102. DataID is just an incremental PK column.
My resulting table should look like this, with Year being removed as it's no longer necessary. It should ignore NULLs but not 0's:
| DataID | PersonID | Data1A | Data1B | Data2A | Data2B |
|--------|----------|--------|--------|--------|--------|
| 1 | 888 | c | 3.00 | b | 2.00 |
| 2 | 333 | d | 0.00 | e | 4.00 |
This is what I have so far, but I don't know how to take into account the fact that I want the 'max'/latest of a set of Years by PersonID. Right now it gets the max of each column but I want the most recent valid data by latest year, and it also has Data1 and Data2 not being related at all but I need them to be.
SELECT DISTINCT
T1.SID,
GroupedT1.Data1,
GroupedT1.Data2,
FROM #Table1 T1
INNER JOIN
(SELECT SID,
MAX(Data1) AS Data1,
MAX(Data2) AS Data2,
FROM #Table1
GROUP BY PersonID) GroupedT1
ON T1.PersonID = GroupedT1.PersonID
Editing thanks to Gordon for the previous answer, this is how I tried to fix my new problem:
With this solution I'm trying to get the latest for Data1 and Data2, ignoring as many NULL columns as there is, and picking data from any YearID as long as it's the latest. So if in the year A102, Data1A is NULL then it should pick year A101's Data1A, and if Data2A is null for many years, it should pick the latest (in this case, year A100). At the moment it's close but it only picks by row, and needs to pick by year and with any number of NULL data.
select t1.PersonID, t1.Data1A, t1.Data1B, t1.Data2A, t1.Data2B
from (select t1.*,
row_number() over (partition by SID order by
(case when Data1A is not null then 1 else 2 end),
(case when Data2A is not null then 1 else 2 end),
YearID desc) as seqnum
from #Table1 t1
) t1
where seqnum = 1
This answers the original question.
I think you want a simple filtering before applying logic such as row_number():
select t1.*
from (select t1.*,
row_number() over (partition by personid order by yearid desc) as seqnum
from #table1 t1
where data1 is not null and data2 is not null
) t1
where seqnum = 1;
EDIT:
To answer the revised question, you need to handle each columns separately. You can do this using outer apply:
select p.personid, d1.data1, d2.data2, . . .
from (select distinct personid from #table1) p outer apply
(select top (1) t1.data1
from #table1 t1
where t1.personid = p.personid and t1.data1 is not null
order by t1.yearid desc
) d1 outer apply
(select top (1) t1.data2
from #table1 t1
where t1.personid = p.personid and t1.data2 is not null
order by t1.yearid desc
) d2 . . .
You can use the not exists keywork
SELECT DataID, PersonID, Data1, Data2
FROM #Table1 T1
where not exists(select 1 from #Table1 T2
where T1.DataID = T2.DataID and T2.YearID > T1.YearID)

If rows have repeated names only return the row with the repeat

To elaborate, say I have this table:
NAME | ID | EMAIL | TYPE
------+----+-------------+------
Joe | 1 | NULL | 01
Joe | 1 | joe#email | 02
Henry | 2 | NULL | 01
Jane | 3 | jane#email | 01
Jane | 3 | jane#email | 02
Larry | 4 | larry#email | 01
Sue | 5 | NULL | 02
I want to return this:
Joe | 1 | joe#email | 02
Henry | 2 | NULL | 01
Jane | 3 | jane#email | 02
Larry | 4 | larry#email | 01
Sue | 5 | NULL | 02
I've tried Select Distinct but that returns the original table. I have not found anything else that seems to tackle what I'm asking since the rows aren't total repeats, just the first two columns.
Select *
From Table_Name
You seem to want the record from each person with the highest TYPE value. One straightforward approach uses ROW_NUMBER to identify the records you want to retain:
SELECT NAME, ID, EMAIL, TYPE
FROM
(
SELECT NAME, ID, EMAIL, TYPE,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY TYPE DESC) rn
FROM yourTable
) t
WHERE rn = 1;
Demo
I think you can do what you want using aggregation:
select name, id, max(email) as email, max(type) as type
from tablename
group by name, id;
I would use GROUP BY and JOIN
select t1.*
from table_name t1
join (
select id, max(type) max_type
from table_name
group by id
) t2 on t1.id = t2.id and
t1.type = t2.max_type

Have a column with the lowest possible next value (self-joining a table)

I am looking for a way to get the lowest next value in a sequence. Basically, I have a dataset of Dates and I want it to return the next day unless it's the latest date in the database, then I want it to return this instead.
My current query looks like this and almost works - of course up to the point where I want the latest possible value instead of the next one:
SELECT
a.date,
a.key,
a.description,
b.date NextDate
FROM
my_table a
CROSS APPLY (SELECT TOP 1
b.date
FROM
my_table b
WHERE
a.key = b.key AND
a.date < b.date) b
Sample data:
+----------+-----+-------------+
| date | key | description |
+----------+-----+-------------+
| 20170101 | atx | xxx |
| 20161228 | hfn | xxx |
| 20161222 | ktn | xxx |
| 20161214 | yqe | xxx |
| 20161204 | olp | xxx |
| 20161122 | bux | xxx |
+----------+-----+-------------+
What the result should look like:
+----------+-----+-------------+----------+
| date | key | description | NextDate |
+----------+-----+-------------+----------+
| 20170101 | atx | xxx | 20170101 |
| 20161228 | hfn | xxx | 20170101 |
| 20161222 | ktn | xxx | 20161228 |
| 20161214 | yqe | xxx | 20161222 |
| 20161204 | olp | xxx | 20161214 |
| 20161122 | bux | xxx | 20161204 |
+----------+-----+-------------+----------+
You can use a case expression to do this.
SELECT
a.date,
a.key,
a.description,
case when date = max(a.date) over() then date
else (select min(date) from mytable b where a.date < b.date) end as NextDate
FROM
my_table a
You can use lag on date column
select t.*,
lag(date, 1, date) over (order by date desc) nextdate
from
(SELECT
a.date,
a.key,
a.description,
b.date NextDate
FROM
my_table a
CROSS APPLY (SELECT TOP 1
b.date
FROM
my_table b
WHERE
a.key = b.key AND
a.date < b.date) b) t
I believe you want:
select a.*,
coalesce(lead(date) over (order by date),
max(date) over ()
)
from my_table a;
If your table never has a missing date the following would work.
SELECT CONVERT(DATE,CONVERT(CHAR(10),a.date,120))
,a.key,
,a.description,
,CASE
WHEN (SELECT MAX(a.date) FROM my_table a) <> AsAtDateID
THEN DATEADD(DAY,1,CONVERT(DATE,CONVERT(CHAR(10),a.date,120)))--This could be a select statement
ELSE CONVERT(DATE,CONVERT(CHAR(10),a.date,120))
END
FROM my_table a
ORDER BY Date DESC
Alternatively if there are missing dates then you could use a SQL statement in the CASE to get the next highest date.
SELECT MIN(Date) FROM my_table WHERE Date > a.Date
Not the most performant code, but seeing as we are talking date tables it would work. I'm sure a CTE could be used to do this as well, if you need a bit more performance
Using SQL 2008 without LEAD & LAG etc...
Try this
;with cte as
(
SELECT [DATE] = Cast([date] AS DATE),
[key],
[description],
Lag([date])OVER(ORDER BY Cast([date] AS DATE) DESC) AS prev_date
FROM ( VALUES ('20170101','atx','xxx'),
('20161228','hfn','xxx'),
('20161222','ktn','xxx'),
('20161214','yqe','xxx'),
('20161204','olp','xxx'),
('20161122','bux','xxx')) tc ([date], [key], [description])
)
SELECT [date],
[Key],
[Description],
NextDate = Iif([date] < prev_date, prev_date, [date])
FROM cte
Result :
+------------+-----+-------------+------------+
| date | Key | Description | NextDate |
+------------+-----+-------------+------------+
| 2017-01-01 | atx | xxx | 2017-01-01 |
| 2016-12-28 | hfn | xxx | 2017-01-01 |
| 2016-12-22 | ktn | xxx | 2016-12-28 |
| 2016-12-14 | yqe | xxx | 2016-12-22 |
| 2016-12-04 | olp | xxx | 2016-12-14 |
| 2016-11-22 | bux | xxx | 2016-12-04 |
+------------+-----+-------------+------------+

Getting the most recent record of a group

I'm trying to find the most recent record of a group after doing a inner join.
Say I have the following two tables:
dateCreated | id
2011-12-27 | 1
2011-12-15 | 2
2011-12-17 | 6
2011-12-26 | 15
2011-12-15 | 18
2011-12-07 | 22
2011-12-09 | 23
2011-12-27 | 24
code | id
EFG | 1
ABC | 2
BCD | 6
BCD | 15
ABC | 18
BCD | 22
EFG | 23
EFG | 24
I want to display only the most recent of the groupings:
So the result would be:
dateCreated | code
2011-12-27 | EFG
2011-12-15 | ABC
2011-12-26 | BCD
I know this can be achieved using the max and group by functions, but I can't seem to get the desired result.
I think this should get you there:
select max(a.dateCreated) as dateCreated
, b.code
from table1 a
join table2 b on a.id = b.id
group by b.code
Assuming your tables are called a and b, try this:
select max(a.dateCreated) as dateCreated, b.code
from a join b on a.id = b.id
group by b.code
You can use analytical functions for this. This way, you are still choosing only one result for every code, even if they are two with the same last dateCreated (this may or may not be what you actually want as a result)
SELECT Code, dateCreated
FROM ( SELECT T2.Code, T1.dateCreated, ROW_NUMBER() OVER(PARTITION BY T2.Code ORDER BY T1.dateCreated DESC) Corr
FROM Table1 T1
INNER JOIN Table2 T2
ON T1.id = T2.id) A
WHERE Corr = 1