Detect changes for each ID - sql

Suppose I have the following data
ID | year_month | Department
1233 | 2020-01-01 | A
1123 | 2020-02-01 | A
1123 | 2020-03-01 | NULL
1123 | 2020-04-01 | B
1123 | 2020-05-01 | B
1123 | 2020-06-01 | B
1123 | 2020-07-01 | NULL
9999 | 2020-01-01 | A
9999 | 2020-02-01 | A
9999 | 2020-03-01 | B
9999 | 2020-04-01 | B
9999 | 2020-05-01 | B
9999 | 2020-06-01 | A
9999 | 2020-07-01 | B
I want to identify the changes in department. , including going to NA/NULL. The desired output is:
ID | Change_year_month | Old_Department | New_Department
1123 | 2020-03-01 | A | NULL
1123 | 2020-04-01 | NULL | B
1123 | 2020-07-01 | B | NULL
9999 | 2020-03-01 | A | B
9999 | 2020-06-01 | B | A
Ideas I've already tried to pursue:
with x as(
SELECT T1.ID, T1.Department, MIN(T1.year_month) AS Change_year_month FROM dbo.Source
GROUP BY T1.ID, Department),
y as (
SELECT ID, year_month,
rown = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY year_month) FROM x
)
select y.ID, T2.Department, year_month AS Change_year_month FROM y
right join (SELECT T1.ID,
MAX(Department) as Old_Department,
Min(Department) AS New_Department
FROM dbo.Source
GROUP BY T1.ID HAVING COUNT(DISTINCT(Department)) >= 2) T2 on y.ID = T2.ID
where rown = 1
However, this does not yield the desired result. Whenever a NULL is involved, the query does not see the change. Whenever I change the NULL to something else (like: 'outside the scope'), then the ordering is wrong as the Old_department is never 'outside the scope', but the New_department always is. Also, I feel like the code is inefficient and not durable.
Does anyone have suggestions how to proceed or to construct of durable query?

Here is a pretty simple method using lag():
select s.id, s.year_month, s.prev_department, s.department
from (select s.*,
lag(year_month) over (partition id order by year_month) as prev_ym,
lag(year_month) over (partition id, department order by year_month) as prev_ym_dept,
lag(department) over (partition by id order by year_month) as prev_department
from dbo.source s
) s
where prev_ym_dept <> prev_ym;
This looks at the dates for the comparison, so it just handles NULL values.
Of course, you can use more complicated comparisons:
select s.id, s.year_month, s.prev_department, s.department
from (select s.*,
lag(year_month) over (partition id order by year_month) as prev_ym,
min(year_month) over (partition by id) as min_year_month
from dbo.source s
) s
where prev_department <> department or
(department is null and
prev_department is not null
) or
(prev_department is null and
department is not null and
year_month <> min_year_month
)
But that is rather tricky to express. And that might even have a mistake in filtering out the first row.

Related

Remove duplicates using multiple criteria in SQL

I've tried all afternoon to dedup a table that looks like this:
ID1 | ID2 | Date | Time |Status | Price
----+-----+------------+-----------------+--------+-------
01 | A | 01/01/2022 | 10:41:47.000000 | DDD | 55
01 | B | 02/01/2022 | 16:22:31.000000 | DDD | 53
02 | C | 01/01/2022 | 08:54:03.000000 | AAA | 72
02 | D | 03/01/2022 | 11:12:35.000000 | DDD |
03 | E | 01/01/2022 | 17:15:41.000000 | DDD | 67
03 | F | 01/01/2022 | 19:27:22.000000 | DDD | 69
03 | G | 02/01/2022 | 06:45:52.000000 | DDD | 78
Basically, I need to dedup based on two conditions:
Status: where AAA > BBB > CCC > DDD. So, pick the highest one.
When the Status is the same given the same ID1, pick the latest one based on Date and Time.
The final table should look like:
ID1 | ID2 | Date | Time |Status | Price
----+-----+------------+-----------------+--------+-------
01 | B | 02/01/2022 | 16:22:31.000000 | DDD | 53
02 | C | 01/01/2022 | 08:54:03.000000 | AAA | 72
03 | G | 02/01/2022 | 06:45:52.000000 | DDD | 78
Is there a way to do this in Redshift SQL / PostgreSQL?
I tried variations of this, but everytime it doesn't work because it demands that I add all columns to the group by, so then it defeats the purpose
select a.id1,
b.id2,
b.date,
b.time,
b.status,
b.price,
case when (status = 'AAA') then 4
when (status = 'BBB') then 3
when (status= 'CCC') then 2
when (status = 'DDD') then 1
when (status = 'EEE') then 0
else null end as row_order
from table1 a
left join table2 b
on a.id1=b.id1
group by id1
having row_order = max(row_order)
and date=max(date)
and time=max(time)
Any help at all is appreciated!
Windowing functions are good at this:
SELECT ID1, ID2, Date, Time, Status, Price
FROM (
SELECT *,
row_number() OVER (PARTITION BY ID1 ORDER BY Status, Date DESC, Time DESC) rn
FROM MyTable
) t
WHERE rn = 1
See it work here:
https://dbfiddle.uk/uAvDz1Qn
You can use ROW_NUMBER() like so:
with cte as (
select a.id1,
b.id2,
b.date,
b.time,
b.status,
b.price,
ROW_NUMBER() OVER (PARTITION BY a.id1 ORDER BY b.status ASC, b.date DESC, b.time DESC) RN
from table1 a
left join table2 b on a.id1=b.id1
)
select * from cte where rn = 1
This is a typical top-1-per-group problem. The canonical solution indeed involves window functions, as demonstrated by Joel Coehoorn and Aaron Dietz.
But Postgres has a specific extension, called distinct on, which is built exactly for the purpose of solving top-1-per-group problems. The syntax is neater, and you benefit built-in optimizations:
select distinct on (id1) t.*
from mytable t
order by id1, status, "Date" desc, "Time" desc
Here is a demo on DB Fiddle based on that of Joel Coehoorn.

Gather the max of a set of data by 2 columns in SQL?

I am trying to get the latest of a set of columns by a PersonID out of a set of YearIDs.
If I have a table like this:
| DataID | PersonID | YearID | Data1A | Data1B | Data2A | Data2B |
|--------|----------|--------|--------|--------|--------|--------|
| 1 | 888 | A100 | d | 0.00 | a | 1.00 |
| 2 | 888 | A101 | NULL | NULL | b | 2.00 |
| 3 | 888 | A102 | c | 3.00 | NULL | NULL |
| 4 | 333 | A100 | a | 3.40 | e | 4.00 |
| 5 | 333 | A101 | d | 0.00 | NULL | NULL |
| 6 | 333 | A102 | NULL | NULL | NULL | NULL |
How do I get the latest of column sets Data1A, Data1B and Data2A, Data2B sorted by YearID per PersonID?
This is given that Data1A and Data1B are related and Data2A and Data2B are related and can not be separated, and most recent year is A102. DataID is just an incremental PK column.
My resulting table should look like this, with Year being removed as it's no longer necessary. It should ignore NULLs but not 0's:
| DataID | PersonID | Data1A | Data1B | Data2A | Data2B |
|--------|----------|--------|--------|--------|--------|
| 1 | 888 | c | 3.00 | b | 2.00 |
| 2 | 333 | d | 0.00 | e | 4.00 |
This is what I have so far, but I don't know how to take into account the fact that I want the 'max'/latest of a set of Years by PersonID. Right now it gets the max of each column but I want the most recent valid data by latest year, and it also has Data1 and Data2 not being related at all but I need them to be.
SELECT DISTINCT
T1.SID,
GroupedT1.Data1,
GroupedT1.Data2,
FROM #Table1 T1
INNER JOIN
(SELECT SID,
MAX(Data1) AS Data1,
MAX(Data2) AS Data2,
FROM #Table1
GROUP BY PersonID) GroupedT1
ON T1.PersonID = GroupedT1.PersonID
Editing thanks to Gordon for the previous answer, this is how I tried to fix my new problem:
With this solution I'm trying to get the latest for Data1 and Data2, ignoring as many NULL columns as there is, and picking data from any YearID as long as it's the latest. So if in the year A102, Data1A is NULL then it should pick year A101's Data1A, and if Data2A is null for many years, it should pick the latest (in this case, year A100). At the moment it's close but it only picks by row, and needs to pick by year and with any number of NULL data.
select t1.PersonID, t1.Data1A, t1.Data1B, t1.Data2A, t1.Data2B
from (select t1.*,
row_number() over (partition by SID order by
(case when Data1A is not null then 1 else 2 end),
(case when Data2A is not null then 1 else 2 end),
YearID desc) as seqnum
from #Table1 t1
) t1
where seqnum = 1
This answers the original question.
I think you want a simple filtering before applying logic such as row_number():
select t1.*
from (select t1.*,
row_number() over (partition by personid order by yearid desc) as seqnum
from #table1 t1
where data1 is not null and data2 is not null
) t1
where seqnum = 1;
EDIT:
To answer the revised question, you need to handle each columns separately. You can do this using outer apply:
select p.personid, d1.data1, d2.data2, . . .
from (select distinct personid from #table1) p outer apply
(select top (1) t1.data1
from #table1 t1
where t1.personid = p.personid and t1.data1 is not null
order by t1.yearid desc
) d1 outer apply
(select top (1) t1.data2
from #table1 t1
where t1.personid = p.personid and t1.data2 is not null
order by t1.yearid desc
) d2 . . .
You can use the not exists keywork
SELECT DataID, PersonID, Data1, Data2
FROM #Table1 T1
where not exists(select 1 from #Table1 T2
where T1.DataID = T2.DataID and T2.YearID > T1.YearID)

Retrieve the minimal create date with multiple rows

I have an issue with an SQL query that I am trying to write. I am trying to retrieve the row that has the minimal create_dt for each inst (see table) and amount (which isn't unique).
Unfortunately I can't use group by as the amount column isn't unique.
+--------------+--------+------+-------------+
| Company_Name | Amount | inst | Create Date |
+--------------+--------+------+-------------+
| Company A | 1000 | 4545 | 01/10/2018 |
| Company A | 400 | 4545 | 01/11/2018 |
| Company A | 200 | 4545 | 31/10/2018 |
| Company B | 2000 | 4893 | 01/10/2016 |
| Company B | 212 | 4893 | 04/10/2016 |
| Company B | 100 | 4893 | 10/10/2017 |
| Company B | 20 | 4893 | 04/10/2018 |
+--------------+--------+------+-------------+
In the above example I expect to see:
+--------------+--------+------+-------------+
| Company_Name | Amount | inst | Create Date |
+--------------+--------+------+-------------+
| Company A | 1000 | 4545 | 01/10/2018 |
| Company B | 2000 | 4893 | 01/10/2016 |
+--------------+--------+------+-------------+
Code:
SELECT
bill_company, bill_name, account_no
FROM
dbo.customer_information;
SELECT
balance_id, balance_id2, minus_balance,new_balance,
create_date, account_no
FROM
dbo.btr
SELECT
balance_id, balance_id2, expired_Date, amount, balance_type, account_no
FROM
dbo.btr_balance
SELECT
balance_ist, expired_date, account_no, balance_type
FROM
dbo.BALANCE_inst
Retrieve the minimal create data for a balance instance with the lowest balance for a balance inst.
(SELECT
bill_company,
bill_name,
account_no,
balance_ist,
amount,
MIN(create_date)
FROM
dbo.mtr btr
LEFT JOIN
btr_balance btrb ON btr.balance_id = btrb.balance_id
AND btr.balance_id2 = btrb.balance_id2
LEFT JOIN
balance_inst bali ON btr.account_no = bali.account_no
AND btrb.expired_date = bali.expired_date
GROUP BY
bill_company, bill_name, account_no,amount, balance_ist)
I have seen some solutions about using correlated query but can't see to get my head around it.
Common Table Expression (CTE) will help you.
;with cte as (
select *, row_number() over(partition by company_name order by create_date) rn
from dbo.myTable
)
select * from cte
where rn = 1;
use row_number() i assumed bill_company is your company name
select * from
( SELECT bill_company,
bill_name,
account_no,
balance_ist,
amount,
create_date,
row_number() over(partition by bill_company order by create_date) rn
FROM dbo.mtr btr left join btr_balance btrb
on btr.balance_id = btrb.balance_id and btr.balance_id2 = btrb.balance_id2
left join balance_inst bali
on btr.account_no = bali.account_no and btrb.expired_date = bali.expired_date
) t where t.rn=1

SELECT based on multiple fields in MS-SQL

I have a table with 4 columns:
AcctNumb | PeriodEndingDate | WaterConsumption | ReadingType
There are multiple records for each AcctNumb, with the date that each record was recorded.
What I want to do is grab the most recent date, consumption reading, and reading type for each account.
I have tried using MAX(PeriodEndingDate) and GROUP BY AcctNumb, but I would need to aggregate all the other values, and none of the aggregate functions help me for the WaterConsumption, etc.
Can anyone point me in the right direction?
Thanks
EDIT
Here is a sample table
+----------+------------------+------------------+-------------+
| AcctNumb | PeriodEndingDate | WaterConsumption | ReadingType |
+----------+------------------+------------------+-------------+
| 1000 | 2018-03-31 | 122230 | A |
| 1001 | 2018-03-31 | 24850 | A |
| 1002 | 2018-03-31 | 88540 | A |
| 1000 | 2017-12-31 | 123800 | A |
| 1001 | 2017-12-31 | 3000 | E |
+----------+------------------+------------------+-------------+
The ReadingType is whether it's an actual (A) reading, or an estimate (E).
Try this
SELECT
AcctNumb,
PeriodEndingDate,
WaterConsumption,
ReadingType
FROM (SELECT
AcctNumb,
PeriodEndingDate,
WaterConsumption,
ReadingType,
ROW_NUMBER() OVER (PARTITION BY AcctNumb ORDER BY PeriodEndingDate DESC) AS MostrecentRecord
FROM <TableName>) dt
WHERE MostrecentRecord= 1
This can be done using ROW_NUMBER. It has been asked an answered thousands of times but the query is easier to write than find a duplicate.
select *
from
(
select *
, RowNum = ROW_NUMBER() over(partition by AcctNumb order by PeriodEndingDate)
from YourTable
) x
where x.RowNum = 1
SELECT DQ.* FROM
(SELECT *,
Row_Number() OVER (PARTITION BY AcctNumb ORDER BY PeriodEndingDate DESC) AS RN
FROM YourTable
) AS DQ
WHERE DQ.RN = 1

Have a column with the lowest possible next value (self-joining a table)

I am looking for a way to get the lowest next value in a sequence. Basically, I have a dataset of Dates and I want it to return the next day unless it's the latest date in the database, then I want it to return this instead.
My current query looks like this and almost works - of course up to the point where I want the latest possible value instead of the next one:
SELECT
a.date,
a.key,
a.description,
b.date NextDate
FROM
my_table a
CROSS APPLY (SELECT TOP 1
b.date
FROM
my_table b
WHERE
a.key = b.key AND
a.date < b.date) b
Sample data:
+----------+-----+-------------+
| date | key | description |
+----------+-----+-------------+
| 20170101 | atx | xxx |
| 20161228 | hfn | xxx |
| 20161222 | ktn | xxx |
| 20161214 | yqe | xxx |
| 20161204 | olp | xxx |
| 20161122 | bux | xxx |
+----------+-----+-------------+
What the result should look like:
+----------+-----+-------------+----------+
| date | key | description | NextDate |
+----------+-----+-------------+----------+
| 20170101 | atx | xxx | 20170101 |
| 20161228 | hfn | xxx | 20170101 |
| 20161222 | ktn | xxx | 20161228 |
| 20161214 | yqe | xxx | 20161222 |
| 20161204 | olp | xxx | 20161214 |
| 20161122 | bux | xxx | 20161204 |
+----------+-----+-------------+----------+
You can use a case expression to do this.
SELECT
a.date,
a.key,
a.description,
case when date = max(a.date) over() then date
else (select min(date) from mytable b where a.date < b.date) end as NextDate
FROM
my_table a
You can use lag on date column
select t.*,
lag(date, 1, date) over (order by date desc) nextdate
from
(SELECT
a.date,
a.key,
a.description,
b.date NextDate
FROM
my_table a
CROSS APPLY (SELECT TOP 1
b.date
FROM
my_table b
WHERE
a.key = b.key AND
a.date < b.date) b) t
I believe you want:
select a.*,
coalesce(lead(date) over (order by date),
max(date) over ()
)
from my_table a;
If your table never has a missing date the following would work.
SELECT CONVERT(DATE,CONVERT(CHAR(10),a.date,120))
,a.key,
,a.description,
,CASE
WHEN (SELECT MAX(a.date) FROM my_table a) <> AsAtDateID
THEN DATEADD(DAY,1,CONVERT(DATE,CONVERT(CHAR(10),a.date,120)))--This could be a select statement
ELSE CONVERT(DATE,CONVERT(CHAR(10),a.date,120))
END
FROM my_table a
ORDER BY Date DESC
Alternatively if there are missing dates then you could use a SQL statement in the CASE to get the next highest date.
SELECT MIN(Date) FROM my_table WHERE Date > a.Date
Not the most performant code, but seeing as we are talking date tables it would work. I'm sure a CTE could be used to do this as well, if you need a bit more performance
Using SQL 2008 without LEAD & LAG etc...
Try this
;with cte as
(
SELECT [DATE] = Cast([date] AS DATE),
[key],
[description],
Lag([date])OVER(ORDER BY Cast([date] AS DATE) DESC) AS prev_date
FROM ( VALUES ('20170101','atx','xxx'),
('20161228','hfn','xxx'),
('20161222','ktn','xxx'),
('20161214','yqe','xxx'),
('20161204','olp','xxx'),
('20161122','bux','xxx')) tc ([date], [key], [description])
)
SELECT [date],
[Key],
[Description],
NextDate = Iif([date] < prev_date, prev_date, [date])
FROM cte
Result :
+------------+-----+-------------+------------+
| date | Key | Description | NextDate |
+------------+-----+-------------+------------+
| 2017-01-01 | atx | xxx | 2017-01-01 |
| 2016-12-28 | hfn | xxx | 2017-01-01 |
| 2016-12-22 | ktn | xxx | 2016-12-28 |
| 2016-12-14 | yqe | xxx | 2016-12-22 |
| 2016-12-04 | olp | xxx | 2016-12-14 |
| 2016-11-22 | bux | xxx | 2016-12-04 |
+------------+-----+-------------+------------+