Counting running total on time series where condition is X

Counting running total on time series where condition is X - sql

I have a SQL table of date entries with three columns: date, item, and status. The table appears like this:
date
item
status
2023-01-01
A
on
2023-01-01
B
on
2023-01-01
C
off
2023-01-02
A
on
2023-01-02
B
off
2023-01-02
C
off
2023-01-02
D
on
2023-01-03
A
on
2023-01-03
B
off
2023-01-03
C
off
2023-01-03
D
off
Looking at the most recent entries, I need grouped by item, the latest date and status, and a count on the running total of entries where status has not changed. For example, the output I am looking for would be:
latest_date
item
current_status
number_of_days_on_current
2023-01-03
A
on
3
2023-01-03
B
off
2
2023-01-03
C
off
3
2023-01-03
D
off
1
How would I get the output I want in PostgreSQL 13.7?
This returns the latest date, item, and current status, but does not correctly count the number of days the item has been on the current status:
WITH CTE AS (
SELECT
item,
date,
status,
LAG(status) OVER (PARTITION BY item ORDER BY date) AS prev_status,
ROW_NUMBER() OVER (PARTITION BY item ORDER BY date DESC) AS rn
FROM
schema.table
)
SELECT
MAX(date) AS latest_date,
item,
status AS current_status,
SUM(CASE WHEN prev_status = status THEN 0 ELSE 1 END)
OVER (PARTITION BY item ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS number_of_days
FROM
CTE
WHERE
rn = 1
GROUP BY
item, status, prev_status, date
ORDER BY
item

Using a cte to build the runs of consecutive statuses:
with recursive cte(s_date, date, item, status, s_count, result) as (
select e.date, e.date, e.item, e.status, 1, '[]'::jsonb from entries e
left join entries e1 on e1.item = e.item and e.date - interval '1 day' = e1.date where e1.date is null
union all
select c.s_date, e.date, c.item, e.status,
case when e.status = c.status then c.s_count + 1 else 1 end,
case when e.status = c.status then c.result else c.result || jsonb_build_object('s', c.status, 'c', c.s_count) end
from cte c join entries e on e.item = c.item and c.date + interval '1 day' = e.date
)
select date(t1.md), t1.item, e.status, (select max(((v -> 'c')#>>'{}')::int)
from jsonb_array_elements(r::jsonb) v where (v -> 's')#>>'{}' = e.status)
from (select t.s_date, t.item, max(t.date) md, max(t.result::text) r
from (select c.s_date, c.date, c.item, c.result || jsonb_build_object('s', c.status, 'c', c.s_count) result from cte c) t
group by t.s_date, t.item) t1
join entries e on e.item = t1.item and date(e.date) = date(t1.md)
See fiddle.

According to your comment, you want to find the max count of consecutive statuses where status = last status value, this became a gaps and islands problem. This can be solved using a difference between two row_numbers and the last_value function as the following:
with last_status as
(
select *,
last_value(status) over (partition by item order by date_
range between unbounded preceding and unbounded following) current_status,
max(date_) over (partition by item) latest_date,
row_number() over (partition by item order by date_) -
row_number() over (partition by item, status order by date_) grp
from table_name
),
consecutive_status_counts as
(
select latest_date, item, current_status, status, count(*) cnt
from last_status
where current_status = status
group by latest_date, item, current_status, status, grp
)
select latest_date,
item,
current_status,
max(cnt) number_of_days_on_current
from consecutive_status_counts
group by latest_date, item, current_status
order by item
See demo

Related

How to find the time and step between status change

I'm trying to query a dataset about user status changes. and I want to find out the time it takes for the status to change, and the steps in between(number of rows).
Example data:
user_id
Status
date
1
a
2001-01-01
1
a
2001-01-08
1
b
2001-01-15
1
b
2001-01-28
1
a
2001-01-31
1
b
2001-02-01
2
a
2001-01-08
2
a
2001-01-18
2
a
2001-01-28
3
b
2001-03-08
3
b
2001-03-18
3
b
2001-03-19
3
a
2001-03-20
Desired output:
user_id
From
to
days in between
Steps in between
1
a
b
14
2
1
b
a
16
2
1
a
b
1
1
3
b
a
12
3

You might consider below another approach.
WITH partitions AS (
SELECT *, COUNTIF(flag) OVER w AS part FROM (
SELECT *, ROW_NUMBER() OVER w AS rn, status <> LAG(status) OVER w AS flag,
FROM sample_data
WINDOW w AS (PARTITION BY user_id ORDER BY date)
) WINDOW w AS (PARTITION BY user_id ORDER BY date)
)
SELECT user_id,
LAG(ANY_VALUE(status)) OVER w AS `from`,
ANY_VALUE(status) AS `to`,
EXTRACT(DAY FROM MIN(date) - LAG(MIN(date)) OVER w) AS days_in_between,
MIN(rn) - LAG(MIN(rn)) OVER w AS steps_in_between
FROM partitions
GROUP BY user_id, part
QUALIFY `from` IS NOT NULL
WINDOW w AS (PARTITION BY user_id ORDER BY MIN(date));
Query results

with main as (
select
*,
dense_rank() over(partition by user_id order by date) as rank_,
row_number() over(partition by user_id, status order by date) as rank_2,
row_number() over(partition by user_id, status order by date) - dense_rank() over(partition by id order by date) as diff,
row_number() over(partition by user_id order by date) as row_num,
lag(status) over(partition by user_id order by date) as prev_status,
concat(lag(status) over(partition by user_id order by date) , ' to ' , status) as status_change
from table
),
new_rank as (
select
*,
rown_num - diff as row_num_diff,
min(date) over(partition by user_id, status, rown_num - diff) as min_date
from main
),
prev_date as (
select
*,
lag(min_date) over(partition by user_id order by date) as prev_min_date
from new_rank
)
select
status as from,
prev_status as to,
date_diff(prev_min_date, min_date, DAY) as days_in_between
from prev_date
where status !=prev_status and prev_status is not null
Does this seem to work? I tried to solve this but it's very hard to solve it without a fiddle plus:
you may remove the extra steps/ranks that I have added, I left them there so you can visually see what they are doing
I don't get your steps logic so it is missing from the code

How to get difference in value over a sliding time window?

I'm attempting to write a SQL query which returns every product where the most recent price on an order within the last 30 days is different than the most recent price in the previous 30 days, and that calculated variance. I'm currently using PostgreSQL 11.
Data Model
Right now, the data is structured into three tables: orders, products, and a pivot table, order_product. Here is the simplified version of the table structure:
Orders
id
order_date
1
2022-01-15
2
2022-02-15
3
2022-03-08
Products
id
name
1
Some product
2
Another product
3
Yet another product
Order_Product
order_id
product_id
unit_price
1
1
10
1
2
20
1
3
10
2
1
12
2
2
20
2
3
5
3
1
15
Desired Output
The desired output would be something like the following:
id
name
order_date
latest_unit_price
previous_unit_price
variance
1
Some product
2022-03-08
15
10
5
3
Yet another product
2022-02-15
5
10
-5
What I've done so far
I've been able to write a join that combines the Orders and Products via the order_product table, within the 60-day window, which is seemingly the easy part:
SELECT
"products"."id",
"products"."name",
"order_product"."unit_price",
"orders"."order_date"
FROM
products
JOIN order_product ON products.id = order_product.product_id
JOIN orders ON order_product.order_id = orders.id
WHERE
order_date BETWEEN now() - INTERVAL '60 days'
AND now()
I've been trying to work with RANK() and LAG(); however, where I'm getting stuck is being able to find the rank the rows within the 30-day time windows, and then calculate the variance between the two windows.
Any help would be much appreciated!
Update: Added solution
Building off of the answer by D-Shih, I had to tweak this to work based on the time window starting from the current date:
WITH CTE AS (
SELECT
"products"."id",
"products"."name",
"order_product"."unit_price",
"orders"."order_date"
FROM
products
JOIN order_product ON products.id = order_product.product_id
JOIN orders ON order_product.order_id = orders.id
WHERE
order_date BETWEEN now() - INTERVAL '60 days' AND now()
),
CTE2 AS (
SELECT
*,
EXTRACT(DAYS FROM now() - order_date :: timestamp) gap_days
FROM
CTE
),
CTE3 AS (
SELECT
*,
(CASE WHEN gap_days < 30 THEN 1 ELSE 0 END) grp
FROM
CTE2
)
SELECT
id,
name,
MAX(CASE WHEN grp = 1 THEN order_date END) order_date,
MAX(CASE WHEN grp = 1 THEN unit_price END) latest_unit_price,
MAX(CASE WHEN grp = 0 THEN unit_price END) previous_unit_price,
SUM(CASE WHEN grp = 1 THEN unit_price ELSE - unit_price END) variance
FROM
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY ID, grp ORDER BY order_date DESC) rn
FROM
CTE3
) t1
WHERE
rn = 1
GROUP BY
id,
name
HAVING
MAX(CASE WHEN grp = 1 THEN unit_price END) <> MAX(CASE WHEN grp = 0 THEN unit_price END)
sqlfiddle

You can try to use EXTRACT with LAG window function to get days difference from order_date and previous order_date each productId.
Then use SUM aggregate condition window function to calculate the group
grp = 0 within the last 30 days
grp = 1 most recent price in the previous 30 days,
the query would be look like as below.
WITH CTE AS (
SELECT "products"."id",
"products"."name",
"order_product"."unit_price",
"orders"."order_date"
FROM
products
JOIN order_product ON products.id = order_product.product_id
JOIN orders ON order_product.order_id = orders.id
WHERE
order_date BETWEEN now() - INTERVAL '60 days'
AND now()
), CTE2 AS (
SELECT *,EXTRACT(DAYS FROM order_date - LAG(order_date,1,order_date) OVER(PARTITION BY id ORDER BY order_date)) gap_seconds
FROM CTE
), CTE3 AS (
SELECT *,(CASE WHEN SUM(gap_seconds) OVER(PARTITION BY id ORDER BY order_date) > 30 THEN 1 ELSE 0 END) grp
FROM CTE2
)
SELECT id,
name,
MAX(CASE WHEN grp = 1 THEN order_date END) order_date,
MAX(CASE WHEN grp = 1 THEN unit_price END) latest_unit_price,
MAX(CASE WHEN grp = 0 THEN unit_price END) previous_unit_price,
SUM(CASE WHEN grp = 1 THEN unit_price ELSE - unit_price END) variance
FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY ID,grp ORDER BY order_date DESC) rn
FROM CTE3
) t1
WHERE rn = 1
GROUP BY id,
name
HAVING MAX(CASE WHEN grp = 1 THEN unit_price END) <> MAX(CASE WHEN grp = 0 THEN unit_price END)
sqlfiddle

Select data where sum for last 7 from max-date is greater than x

I have a data set as such:
Date Value Type
2020-06-01 103 B
2020-06-01 100 A
2020-06-01 133 A
2020-06-11 150 A
2020-07-01 1000 A
2020-07-21 104 A
2020-07-25 140 A
2020-07-28 1600 A
2020-08-01 100 A
Like this:
Type ISHIGH
A 1
B 0
Here's the query i tried,
select type, case when sum(value) > 10 then 1 else 0 end as total_usage
from table_a
where (select sum(value) as usage from tableA where date = max(date)-7)
group by type, date
This is clearly not right. What is a simple way to do this?

It is a simply group by except that you need to be able to access max date before grouping:
select type
, max(date) as last_usage_date
, sum(value) as total_usage
, case when sum(case when date >= cutoff_date then value end) >= 1000 then 'y' end as [is high!]
from t
cross apply (
select dateadd(day, -6, max(date))
from t as x
where x.type = t.type
) as ca(cutoff_date)
group by type, cutoff_date
If you want just those two columns then a simpler approach is:
select t.type, case when sum(value) >= 1000 then 'y' end as [is high!]
from t
left join (
select type, dateadd(day, -6, max(date)) as cutoff_date
from t
group by type
) as a on t.type = a.type and t.date >= a.cutoff_date
group by t.type

Find the max date by type. Then used it to find last 7 days and sum() the value.
with
cte as
(
select [type], max([Date]) as MaxDate
from tableA
group by [type]
)
select c.[type], sum(a.Value),
case when SUM(a.Value) > 1000 then 1 else 0 end as ISHIGH
from cte c
inner join tableA a on a.[type] = c.[type]
and a.[Date] >= DATEADD(DAY, -7, c.MaxDate)
group by c.[type]

This can be done through a cumulative total as follows:
;With CTE As (
Select [type], [date],
SUM([value]) Over (Partition by [type] Order by [date] Desc) As Total,
Row_Number() Over (Partition by [type] Order by [date] Desc) As Row_Num
From Tbl)
Select Distinct CTE.[type], Case When C.[type] Is Not Null Then 1 Else 0 End As ISHIGH
From CTE Left Join CTE As C On (CTE.[type]=C.[type]
And DateDiff(dd,CTE.[date],C.[date])<=7
And C.Total>1000)
Where CTE.Row_Num=1

I think you are quite close with you initial attempt to solve this. Just a tiny edit:
select type, case when sum(value) > 1000 then 1 else 0 end as total_usage
from tableA
where date > (select max(date)-7 from tableA)
group by type

How to cross join but using latest value in BIGQUERY

I have this table below
date
id
value
2021-01-01
1
3
2021-01-04
1
5
2021-01-05
1
10
And I expect output like this, where the date column is always increase daily and value column will generate the last value on an id
date
id
value
2021-01-01
1
3
2021-01-02
1
3
2021-01-03
1
3
2021-01-04
1
5
2021-01-05
1
10
2021-01-06
1
10
I think I can use cross join but I can't get my expected output and think that there are a special syntax/logic to solve this

Consider below approach
select * from `project.dataset.table`
union all
select missing_date, prev_row.id, prev_row.value
from (
select *, lag(t) over(partition by id order by date) prev_row
from `project.dataset.table` t
), unnest(generate_date_array(prev_row.date + 1, date - 1)) missing_date

I would write this using:
select dte, t.id, t.value
from (select t.*,
lead(date, 1, date '2021-01-06') over (partition by id order by date) as next_day
from `table` t
) t cross join
unnest(generate_date_array(
date,
ifnull(
date_add(next_date, interval -1 day), -- generate missing date rows
(select max(date) from `table`) -- add last row
)
)) dte;
Note that this requires neither union all nor window function to fill in the values.

alternative solution using last_value. You may explore the following query and customize your logic to generate days (if needed)
WITH
query AS (
SELECT
date,
id,
value
FROM
`mydataset.newtable`
ORDER BY
date ),
generated_days AS (
SELECT
day
FROM (
SELECT
MIN(date) min_dt,
MAX(date) max_dt
FROM
query),
UNNEST(GENERATE_DATE_ARRAY(min_dt, max_dt)) day )
SELECT
g.day,
LAST_VALUE(q.id IGNORE NULLS) OVER(ORDER BY g.day) id,
LAST_VALUE(q.value IGNORE NULLS) OVER(ORDER BY g.day) value,
FROM
generated_days g
LEFT OUTER JOIN
query q
ON
g.day = q.date
ORDER BY
g.day

Additional condition withing partition over

https://www.db-fiddle.com/f/rgLXTu3VysD3kRwBAQK3a4/3
My problem here is that I want function partition over to start counting the rows only from certain time range.
In this example, if I would add rn = 1 at the end, order_id = 5 would be excluded from the results (because partition is ordering by paid_date and there's order_id = 6 with earlier date) but it shouldn't be as I want that time range for partition starts from '2019-01-10'.
Adding condition rn = 1expected output should be order_id 3,5,11,15, now its only 3,11,15
it should include only orders with is_paid = 0 that are the first one within given time range (if there's preceeding order with is_paid = 1 it shouldn't be counted)

use correlated subquery with not exists
DEMO
SELECT order_id, customer_id, amount, is_paid, paid_date, rn FROM (
SELECT o.*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY paid_date,order_id) rn
FROM orders o
WHERE paid_date between '2019-01-10'
and '2019-01-15'
) x where rn=1 and not exists (select 1 from orders o1 where x.order_id=o1.order_id
and is_paid=1)
OUTPUT:
order_id customer_id amount is_paid paid_date rn
3 101 30 0 10/01/2019 00:00:00 1
5 102 15 0 10/01/2019 00:00:00 1
11 104 31 0 10/01/2019 00:00:00 1
15 105 11 0 10/01/2019 00:00:00 1

If priority should be given to order_id then put that before paid date in the partition function order by clause, this will solve your issue.
SELECT order_id, customer_id, amount, is_paid, paid_date, rn FROM (
SELECT o.*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_id,paid_date) rn
FROM orders o
) x WHERE is_paid = 0 and paid_date between
'2019-01-10' and '2019-01-15' and rn=1
Since you need the paid date to be ordered first you need to imply a where condition in the partitioning table in order to avoid unnecessary dates interrupting the partition function.
SELECT order_id, customer_id, amount, is_paid, paid_date, rn FROM (
SELECT o.*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY paid_date, order_id) rn
FROM orders o
where paid_date between '2019-01-10' and '2019-01-15'
) x WHERE is_paid = 0 and rn=1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Counting running total on time series where condition is X - sql

Related

How to find the time and step between status change

How to get difference in value over a sliding time window?

Select data where sum for last 7 from max-date is greater than x

How to cross join but using latest value in BIGQUERY

Additional condition withing partition over

Categories

Resources