Detect Value Changes beyond a threshold in Time Series data in SQL - sql

In PostgreSQL, I am trying to find subjects that have a sequence of values below 60 followed by two consecutive values above 60 that occur afterwards. I'm also interested in the length of time between the first recorded value below 60 and the second value above 60. This event can occur multiple times for each subject.
I am struggling to find out how to search for an unlimited amount of values < 60 followed by 2 values >= 60.
RowID SubjectID Value TimeStamp
1 1 65 2142-04-29 12:00:00
2 1 58 2142-04-30 03:00:00
3 1 55 2142-04-30 04:00:00
4 1 54 2142-04-30 05:00:00
5 1 55 2142-04-30 06:15:00
6 1 56 2142-04-30 06:45:00
7 1 65 2142-04-30 07:00:00
8 1 65 2142-04-30 08:00:00
9 2 48 2142-05-04 03:30:00
10 2 48 2142-05-04 04:00:00
11 2 50 2142-05-04 05:00:00
12 2 69 2142-05-04 06:00:00
13 2 68 2142-05-04 07:00:00
14 2 69 2142-05-04 08:00:00
15 2 50 2142-05-04 09:00:00
16 2 55 2142-05-04 10:00:00
17 2 50 2142-05-04 10:30:00
18 2 67 2142-05-04 11:00:00
19 2 67 2142-05-04 12:00:00
My current attempt uses the lag and lead functions, but I am unsure about how to use these functions when I am unsure how far I need to look ahead. This is an example of looking ahead one value and behind one value. My problem is I do not know how to partition by subjectID to look "t" time points ahead where "t" may be different for every subject.
select t.subjectId, t.didEventOccur,
(next_timestamp - timestamp) as duration
from (select t.*,
lag(t.value) over (partition by t.subjectid order by t.timestamp)
as prev_value,
lead(t.value) over (partition by t.subjectid order by
t.timestamp) as next_value,
lead(t.timestamp) over (partition by t.subjectid order by
t.timestamp) as next_timestamp
from t
) t
where value < 60 and next_value < 60 and
(prev_value is null or prev_value >= 60);
I hope to get an output such as:
SubjectID DidEventOccur Duration
1 1 05:00:00
2 1 03:30:00
2 1 03:00:00

A pure SQL solution like you have been asking for:
SELECT subjectid, start_at, next_end_at - start_at AS duration
FROM (
SELECT *
, lead(end_at) OVER (PARTITION BY subjectid ORDER BY start_at) AS next_end_at
FROM (
SELECT subjectid, grp, big
, min(ts) AS start_at
, max(ts) FILTER (WHERE big AND big_rn = 2) AS end_at -- 2nd timestamp
FROM (
SELECT subjectid, ts, grp, big
, row_number() OVER (PARTITION BY subjectid, grp, big ORDER BY ts) AS big_rn
FROM (
SELECT subjectid, ts
, row_number() OVER (PARTITION BY subjectid ORDER BY ts)
- row_number() OVER (PARTITION BY subjectid, (value > 60) ORDER BY ts) AS grp
, (value > 60) AS big
FROM tbl
) sub1
) sub2
GROUP BY subjectid, grp, big
) sub3
) sub4
WHERE NOT big -- identifies block of values <= 60 ...
AND next_end_at IS NOT NULL -- ...followed by at least 2 values > 60
ORDER BY subjectid, start_at;
I omitted the useless column DidEventOccur and added start_at instead. Otherwise exactly your desired result.
db<>fiddle here
Consider a procedural solution in plpgsql (or any PL) instead, should be faster. Simpler? I'd say yes, but that depends on who's judging. See (with explanation for the technique and links to more):
How to number consecutive records per island?

Related

Snowflake SQL - Count Distinct Users within descending time interval

I want to count the distinct amount of users over the last 60 days, and then, count the distinct amount of users over the last 59 days, and so on and so forth.
Ideally, the output would look like this (TARGET OUTPUT)
Day Distinct Users
60 200
59 200
58 188
57 185
56 180
[...] [...]
where 60 days is the max total possible distinct users, and then 59 would have a little less and so on and so forth.
my query looks like this.
select
count(distinct (case when datediff(day,DATE,current_date) <= 60 then USER_ID end)) as day_60,
count(distinct (case when datediff(day,DATE,current_date) <= 59 then USER_ID end)) as day_59,
count(distinct (case when datediff(day,DATE,current_date) <= 58 then USER_ID end)) as day_58
FROM Table
The issue with my query is that This outputs the data by column instead of by rows (like shown below) AND, most importantly, I have to write out this logic 60x for each of the 60 days.
Current Output:
Day_60 Day_59 Day_58
209 207 207
Is it possible to write the SQL in a way that creates the target as shown initially above?
Using below data in CTE format -
with data_cte(dates,userid) as
(select * from values
('2022-05-01'::date,'UID1'),
('2022-05-01'::date,'UID2'),
('2022-05-02'::date,'UID1'),
('2022-05-02'::date,'UID2'),
('2022-05-03'::date,'UID1'),
('2022-05-03'::date,'UID2'),
('2022-05-03'::date,'UID3'),
('2022-05-04'::date,'UID1'),
('2022-05-04'::date,'UID1'),
('2022-05-04'::date,'UID2'),
('2022-05-04'::date,'UID3'),
('2022-05-04'::date,'UID4'),
('2022-05-05'::date,'UID1'),
('2022-05-06'::date,'UID1'),
('2022-05-07'::date,'UID1'),
('2022-05-07'::date,'UID2'),
('2022-05-08'::date,'UID1')
)
Query to get all dates and count and distinct counts -
select dates,count(userid) cnt, count(distinct userid) cnt_d
from data_cte
group by dates;
DATES
CNT
CNT_D
2022-05-01
2
2
2022-05-02
2
2
2022-05-03
3
3
2022-05-04
5
4
2022-05-05
1
1
2022-05-06
1
1
2022-05-08
1
1
2022-05-07
2
2
Query to get difference of date from current date
select dates,datediff(day,dates,current_date()) ddiff,
count(userid) cnt,
count(distinct userid) cnt_d
from data_cte
group by dates;
DATES
DDIFF
CNT
CNT_D
2022-05-01
45
2
2
2022-05-02
44
2
2
2022-05-03
43
3
3
2022-05-04
42
5
4
2022-05-05
41
1
1
2022-05-06
40
1
1
2022-05-08
38
1
1
2022-05-07
39
2
2
Get records with date difference beyond a certain range only -
include clause having
select datediff(day,dates,current_date()) ddiff,
count(userid) cnt,
count(distinct userid) cnt_d
from data_cte
group by dates
having ddiff<=43;
DDIFF
CNT
CNT_D
43
3
3
42
5
4
41
1
1
39
2
2
38
1
1
40
1
1
If you need to prefix 'day' to each date diff count, you can
add and outer query to previously fetched data-set and add the needed prefix to the date diff column as following -
I am using CTE syntax, but you may use sub-query given you will select from table -
,cte_1 as (
select datediff(day,dates,current_date()) ddiff,
count(userid) cnt,
count(distinct userid) cnt_d
from data_cte
group by dates
having ddiff<=43)
select 'day_'||to_char(ddiff) days,
cnt,
cnt_d
from cte_1;
DAYS
CNT
CNT_D
day_43
3
3
day_42
5
4
day_41
1
1
day_39
2
2
day_38
1
1
day_40
1
1
Updated the answer to get distinct user count for number of days range.
A clause can be included in the final query to limit to number of days needed.
with data_cte(dates,userid) as
(select * from values
('2022-05-01'::date,'UID1'),
('2022-05-01'::date,'UID2'),
('2022-05-02'::date,'UID1'),
('2022-05-02'::date,'UID2'),
('2022-05-03'::date,'UID5'),
('2022-05-03'::date,'UID2'),
('2022-05-03'::date,'UID3'),
('2022-05-04'::date,'UID1'),
('2022-05-04'::date,'UID6'),
('2022-05-04'::date,'UID2'),
('2022-05-04'::date,'UID3'),
('2022-05-04'::date,'UID4'),
('2022-05-05'::date,'UID7'),
('2022-05-06'::date,'UID1'),
('2022-05-07'::date,'UID8'),
('2022-05-07'::date,'UID2'),
('2022-05-08'::date,'UID9')
),cte_1 as
(select datediff(day,dates,current_date()) ddiff,userid
from data_cte), cte_2 as
(select distinct ddiff from cte_1 )
select cte_2.ddiff,
(select count(distinct userid)
from cte_1 where cte_1.ddiff <= cte_2.ddiff) cnt
from cte_2
order by cte_2.ddiff desc
DDIFF
CNT
47
9
46
9
45
9
44
8
43
5
42
4
41
3
40
1
You can do unpivot after getting your current output.
sample one.
select
*
from (
select
209 Day_60,
207 Day_59,
207 Day_58
)unpivot ( cnt for days in (Day_60,Day_59,Day_58));

How to create a start and end date with no gaps from one date column and to sum a value within the dates

I am new SQL coding using in SQL developer.
I have a table that has 4 columns: Patient ID (ptid), service date (dt), insurance payment amount (insr_amt), out of pocket payment amount (op_amt). (see table 1 below)
What I would like to do is (1) create two columns "start_dt" and "end_dt" using the "dt" column where if there are no gaps in the date by the patient ID then populate the start and end date with the first and last date by patient ID, however if there is a gap in service date within the patient ID then to create the separate start and end date rows per patient ID, along with (2) summing the two payment amounts by patient ID with in the one set of start and end date visits (see table 2 below).
What would be the way to run this using SQL code in SQL developer?
Thank you!
Table 1:
Ptid
dt
insr_amt
op_amt
A
1/1/2021
30
20
A
1/2/2021
30
10
A
1/3/2021
30
10
A
1/4/2021
30
30
B
1/6/2021
10
10
B
1/7/2021
20
10
C
2/1/2021
15
30
C
2/2/2021
15
30
C
2/6/2021
60
30
Table 2:
Ptid
start_dt
end_dt
total_insr_amt
total_op_amt
A
1/1/2021
1/4/2021
120
70
B
1/6/2021
1/7/2021
30
20
C
2/1/2021
2/2/2021
30
60
C
2/6/2021
2/6/2021
60
30
You didn't mention the specific database so this solution works in PostgreSQL. You can do:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select *,
sum(inc) over(partition by ptid order by dt) as grp
from (
select *,
case when dt - interval '1 day' = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
Result:
ptid start_dt end_dt total_insr_amt total_op_amt
----- ---------- ---------- -------------- -----------
A 2021-01-01 2021-01-04 120 70
B 2021-01-06 2021-01-07 30 20
C 2021-02-01 2021-02-02 30 60
C 2021-02-06 2021-02-06 60 30
See running example at DB Fiddle 1.
EDIT for Oracle
As requested, the modified query that works in Oracle is:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select x.*,
sum(inc) over(partition by ptid order by dt) as grp
from (
select t.*,
case when dt - 1 = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
See running example at db<>fiddle 2.

Project data and cumulative sum forward

I am trying to push the last value of a cumulative dataset forward to present time.
Initialise test data:
drop table if exists test_table;
create table test_table
as select data_date::date, floor(random() * 10) as data_value
from
generate_series('2021-08-25'::date, '2021-08-31'::date, '1 day') data_date;
The above test data produces something like this:
data_date data_value cumulative_value
2021-08-25 1 1
2021-08-26 7 8
2021-08-27 8 16
2021-08-28 7 23
2021-08-29 2 25
2021-08-30 2 27
2021-08-31 7 34
What I wish to do, is push the last data value (2021-08-31 7) forward to present time. For example, say today's date was 2021-09-03, I would want the result to be something like:
data_date data_value cumulative_value
2021-08-25 1 1
2021-08-26 7 8
2021-08-27 8 16
2021-08-28 7 23
2021-08-29 2 25
2021-08-30 2 27
2021-08-31 7 34
2021-09-01 7 41
2021-09-02 7 48
2021-09-03 7 55
You need to get the value of the last date in the table. Common table expression is a good way to do that:
with cte as (
select data_value as last_val
from test_table
order by data_date desc
limit 1)
select
gen_date::date as data_date,
coalesce(data_value, last_val) as data_value,
sum(coalesce(data_value, last_val)) over (order by gen_date) as cumulative_sum
from generate_series('2021-08-25'::date, '2021-09-03', '1 day') as gen_date
left join test_table on gen_date = data_date
cross join cte
Test it in db<>fiddle.
You may use union and a scalar subquery to find the latest value of data_value for for the new rows. cumulative_value is re-evaluated.
select *, sum(data_value) over (rows between unbounded preceding and current row) as cumulative_value
from
(
select data_date, data_value from test_table
UNION all
select rd, (select data_value from test_table where data_date = '2021-08-31')
from generate_series('2021-09-01'::date, '2021-09-03', '1 day') rd
) t
order by data_date;
And here it is a bit smarter w/o fixed date literals.
with cte(latest_date) as (select max(data_date) from test_table)
select *, sum(data_value) over (rows between unbounded preceding and current row) as cumulative_value
from
(
select data_date, data_value from test_table
UNION ALL
select rd::date, (select data_value from test_table, cte where data_date = latest_date)
from generate_series((select latest_date from cte) + 1, CURRENT_DATE, '1 day') rd
) t
order by data_date;
SQL Fiddle here.

cumlative sum missing values of the month in sql

i have input data below
date amount
01-01-2020 10
01-02-2020 15
01-03-2020 10
01-05-2020 20
01-06-2020 30
01-08-2020 5
01-09-2020 6
01-10-2020 10
select sum(date),over(partition date) from table;
after add the missing month values i need output
output
Date amount cum_sum
01-01-2020 10 10
01-02-2020 15 25
01-03-2020 10 35
01-04-2020 0 35
01-05-2020 20 55
01-06-2020 30 85
01-07-2020 0 85
01-08-2020 5 90
01-09-2020 6 96
01-10-2020 10 106
You would typically generate the dates with a recursive query, then use window functions.
You don't tell which database you use. The exact syntax of recursive queries and date artithmetics varies across vendors, but here is what it would look like:
with recursive all_dates (dt, max_dt) as (
select min(date) dt, max(date) max_dt from mytable
union all
select dt + interval '1' day, max_dt from all_dates where dt < max_dt
)
select d.dt, sum(t.amount) over(order by c.dt) amount
from all_dates d
left join mytable t on t.date = d.dt
order by d.dt
You simply want a window function:
select t.*, sum(amount) over (order by date)
from table t

How to do a x-days grouped sum in redshift?

I have the following table,
that shows how many items from different units entered the inventory, in different dates.
ID Date Unit Quantity
---------------------------------
1 2017-08-01 A_red 05
2 2017-08-13 A_red 10
3 2017-09-20 A_red 20
4 2017-09-22 A_red 40
5 2017-10-05 A_red 40
6 2017-10-25 A_red 30
7 2017-10-24 A_blue 60
The problem is: entries within a time interval of 30 days of the same unit should be grouped.
So I want the following result:
ID Date Unit Quantity fst_entry30 Quantity30
-----------------------------------------------------
1 2017-08-01 A_red 05 T 15
2 2017-08-13 A_red 10 F 15
3 2017-09-20 A_red 20 T 100
4 2017-09-22 A_red 40 F 100
5 2017-10-05 A_red 40 F 100
6 2017-10-25 A_red 30 T 30
7 2017-10-24 A_blue 60 T 60
where fst_entry30 is a flag that points if the entry was the first, of this unit, in the last 30 days. Note that if i have a different unit (A_blue instead of A_red), it won't be grouped.
And quantity30 is the grouped sum of quantity.
For example, between 5 october and 20 september there is less than 30 days, so it was grouped.
Remembering that Redshift does not allow recursive common table expressions.
I already tried self-joins, but that turned out to be cumbersome.
You would just use lag() to define the groups:
select t.*,
(case when date >= lag(date) over (partition by unit order by date) + interval '30 day'
then 0 else 1
end) as grp_start
from t;
Then you can do a cumulative sum to assign a number to the group . . . and finally add them up using a window function:
select t.*, sum(quantity) over (partition by unit, grp)
from (select t.*,
sum(grp_start) over (partition by unit order by date) as grp
from (select t.*,
(case when date >= lag(date) over (partition by unit order by date) + interval '30 day'
then 0 else 1
end) as grp_start
from t
) t
) t