How to find the max time difference in PostgreSQL that a value stayed on the same state? - sql

I am working for a university project and this came up to me:
I have a table like this:
And I want to get the max duration that an actuator was on a state. For example, cool0 Was for 18 minutes.
The result table should look like this:
NAME COOL0
State False
Duration 18

This is a gaps and islands problem. Your data is a bit hard to follow, but I think:
select actuator, state, min(actuator_time), max(actuator_time)
from (select t.*,
row_number() over (partition by actuator order by actuator_time) as seqnum,
row_number() over (partition by actuator, state order by actuator_time) as seqnum_s
from t
) t
group by actuator, (seqnum- seqnum_s)
For the maximum per actuator, use distinct on:
select distinct on (actuator) actuator, state, min(actuator_time), max(actuator_time)
from (select t.*,
row_number() over (partition by actuator order by actuator_time) as seqnum,
row_number() over (partition by actuator, state order by actuator_time) as seqnum_s
from t
) t
group by actuator, (seqnum- seqnum_s)
order by actuator, max(actuator_time) - min(actuator_time) desc;

Related

Lag functions and SUM

I need to get the list of users that have been offline for at least 20 min every day. Here's my data
I have this starting query but am stuck on how to sum the difference in offline_mins i.e. need to add "and sum(offline_mins)>=20" to the where clause
SELECT
userid,
connected,
LAG(recordeddt) OVER(PARTITION BY userid
ORDER BY userid,
recordeddt) AS offline_period,
DATEDIFF(minute, LAG(recordeddt) OVER(PARTITION BY userid
ORDER BY userid,
recordeddt),recordeddt) offline_mins
FROM device_data where connected=0;
My expected results :
Thanks in advance.
This reads like a gaps-and-island problem, where you want to group together adjacent rows having the same userid and status.
As a starter, here is a query that computes the islands:
select userid, connected, min(recordeddt) startdt, max(lead_recordeddt) enddt,
datediff(min(recordeddt), max(lead_recordeddt)) duration
from (
select dd.*,
row_number() over(partition by userid order by recordeddt) rn1,
row_number() over(partition by userid, connected order by recordeddt) rn2,
lead(recordeddt) over(partition by userid order by recordeddt) lead_recordeddt
from device_data dd
) dd
group by userid, connected, rn1 - rn2
Now, say you want users that were offline for at least 20 minutes every day. You can breakdown the islands per day, and use a having clause for filtering:
select userid
from (
select recordedday, userid, connected,
datediff(min(recordeddt), max(lead_recordeddt)) duration
from (
select dd.*, v.*,
row_number() over(partition by v.recordedday, userid order by recordeddt) rn1,
row_number() over(partition by v.recordedday, userid, connected order by recordeddt) rn2,
lead(recordeddt) over(partition by v.recordedday, userid order by recordeddt) lead_recordeddt
from device_data dd
cross apply (values (convert(date, recordeddt))) v(recordedday)
) dd
group by convert(date, recordeddt), userid, connected, rn1 - rn2
) dd
group by userid
having count(distinct case when connected = 0 and duration >= 20 then recordedday end) = count(distinct recordedday)
As noted this is a gaps and island problem. This is my take on it using a simple lag function to create groups, filter out the connected rows and then work on the date ranges.
CREATE TABLE #tmp(ID int, UserID int, dt datetime, connected int)
INSERT INTO #tmp VALUES
(1,1,'11/2/20 10:00:00',1),
(2,1,'11/2/20 10:05:00',0),
(3,1,'11/2/20 10:10:00',0),
(4,1,'11/2/20 10:15:00',0),
(5,1,'11/2/20 10:20:00',0),
(6,2,'11/2/20 10:00:00',1),
(7,2,'11/2/20 10:05:00',1),
(8,2,'11/2/20 10:10:00',0),
(9,2,'11/2/20 10:15:00',0),
(10,2,'11/2/20 10:20:00',0),
(11,2,'11/2/20 10:25:00',0),
(12,2,'11/2/20 10:30:00',0)
SELECT UserID, connected,DATEDIFF(minute,MIN(DT), MAX(DT)) OFFLINE_MINUTES
FROM
(
SELECT *, SUM(CASE WHEN connected <> LG THEN 1 ELSE 0 END) OVER (ORDER BY UserID,dt) grp
FROM
(
select *, LAG(connected,1,connected) OVER(PARTITION BY UserID ORDER BY UserID,dt) LG
from #tmp
) x
) y
WHERE connected <> 1
GROUP BY UserID,grp,connected
HAVING DATEDIFF(minute,MIN(DT), MAX(DT)) >= 20

sql how can I do get the first and last date when in two columns different rows (islands problem)

I think this problem is called islands and I'm looking on the net but not getting it.
I have a table where I need to get the start date and end date (different columns) in a range.
The table has 100,000 rows and I want to group it down so result will be
I have created a http://sqlfiddle.com/#!18/f4800/1
From the internet I think I need to create rows so have this now:
But I'm stuck thinking over what my next step will be.
You need row_number() instead of dense_rank() & use the difference of sequences :
select [CodeID], min([DATE_START]) as DATE_START,
max(DATE_FINISH) as DATE_FINISH, state
from (select [CodeID],[DATE_START],[DATE_FINISH],[STATE],
row_number() over(partition by [CodeID] order by [DATE_START]) as seq1,
row_number() over(partition by [CodeID],[STATE] order by [DATE_START]) as seq2
from Row_State
--where codeid = 'code1'
) t
group by [CodeID], state, (seq1-seq2)
order by CodeID, DATE_START;
Here is db fiddle.
If you know that the final result will be tiled in time with no gaps, then you can also use lag() and lead() like this:
select code_id, state, date_start,
lead(date_start) over (partition by code_id order by date_start) - interval '1 day' as day_end
from (select rs.*,
lag(state) over (partition by code_id order by date_start) as prev_state
from row_state rs
) rs
where prev_state is null or prev_state <> state;
The only issue with this version is that it does not correctly calculate the final date. But for that:
select code_id, state, date_start,
coalesce(dateadd(day, -1, lead(date_start) over (partition by code_id order by date_start)),
max_date_end
) as day_end
from (select rs.*,
lag(state) over (partition by code_id order by date_start) as prev_state,
max(date_end) over (partition by code_id) as max_date_end
from row_state rs
) rs
where prev_state is null or prev_state <> state;
This could be faster than an approach that uses aggregation.

Calculate percent changes in contiguous ranges in Postgresql

I need to calculate price percent change in contiguous ranges. For example if price start moving up or down and I have sequence of decreasing or increasing values I need to grab first and last value of that sequence and calculate the change.
I'm using window lag function to calculate direction, my problem- I can't generate unique RANK for the sequences to calculate percent changes.
I tired combination of RANK, ROW_NUMBER, etc. with no luck.
Here's my query
WITH partitioned AS (
SELECT
*,
lag(price, 1) over(ORDER BY time) AS lag_price
FROM prices
),
sequenced AS (
SELECT
*,
CASE
WHEN price > lag_price THEN 'up'
WHEN price < lag_price THEN 'down'
ELSE 'equal'
END
AS direction
FROM partitioned
),
ranked AS (
SELECT
*,
-- Here's is the problem
-- I need to calculate unique rnk value for specific sequence
DENSE_RANK() OVER ( PARTITION BY direction ORDER BY time) + ROW_NUMBER() OVER ( ORDER BY time DESC) AS rnk
-- DENSE_RANK() OVER ( PARTITION BY seq ORDER BY time),
-- ROW_NUMBER() OVER ( ORDER BY seq, time DESC),
-- ROW_NUMBER() OVER ( ORDER BY seq),
-- RANK() OVER ( ORDER BY seq)
FROM sequenced
),
changed AS (
SELECT *,
FIRST_VALUE(price) OVER(PARTITION BY rnk ) first_price,
LAST_VALUE(price) OVER(PARTITION BY rnk ) last_price,
(LAST_VALUE(price) OVER(PARTITION BY rnk ) / FIRST_VALUE(price) OVER(PARTITION BY rnk ) - 1) * 100 AS percent_change
FROM ranked
)
SELECT
*
FROM changed
ORDER BY time DESC;
and SQLFiddle with sample data
If anyone interested here's solution, form another forum:
with ct1 as /* detecting direction: up, down, equal */
(
select
price, time,
case
when lag(price) over (order by time) < price then 'down'
when lag(price) over (order by time) > price then 'up'
else 'equal'
end as dir
from
prices
)
, ct2 as /* setting reset points */
(
select
price, time, dir,
case
when coalesce(lag(dir) over (order by time), 'none') <> dir
then 1 else 0
end as rst
from
ct1
)
, ct3 as /* making groups */
(
select
price, time, dir,
sum(rst) over (order by time) as grp
from
ct2
)
select /* calculates min, max price per group */
price, time, dir,
min(price) over (partition by grp) as min_price,
max(price) over (partition by grp) as max_price
from
ct3
order by
time desc;

Optimizing sum() over(order by...) clause throwing 'resources exceeded' error

I'm computing a sessions table from event data from out website in BigQuery. The events table has around 12 million events (pretty small). After I add in the logic to create sessions, I want to sum all sessions and assign a global_session_id. I'm doing that using a sum()over(order by...) clause which is throwing a resources exceeded error. I know that the order by clause is causing all the data to be processed on a single node and that is causing the compute resources to be exceeded, but I'm not sure what changes I can make to my code to achieve the same result. Any work arounds, advice, or explanations are greatly appreciated.
with sessions_1 as ( /* Tie a visitor's last event and last campaign to current event. */
select visitor_id as session_user_id,
sent_at,
context_campaign_name,
event,
id,
LAG(sent_at,1) OVER (PARTITION BY visitor_id ORDER BY sent_at) as last_event,
LAG(context_campaign_name,1) OVER (PARTITION BY visitor_id ORDER BY sent_at) as last_event_campaign_name
from tracks_2
),
sessions_2 as ( /* Flag events that begin a new session. */
select *,
case
when context_campaign_name != last_event_campaign_name
or context_campaign_name is null and last_event_campaign_name is not null
or context_campaign_name is not null and last_event_campaign_name is null
then 1
when unix_seconds(sent_at)
- unix_seconds(last_event) >= (60 * 30)
or last_event is null
then 1
else 0
end as is_new_session
from sessions_1
),
sessions_3 as ( /* Assign events sessions numbers for total sessions and total user sessions. */
select id as event_id,
sum(is_new_session) over (order by session_user_id, sent_at) as global_session_id
#sum(is_new_session) over (partition by session_user_id order by sent_at) as user_session_id
from materialized_result_of_sessions_2_query
)
select * from sessions_3
If might help if you defined a CTE with just the sessions, rather than at the event level. If this works:
select session_user_id, sent_at,
row_number() over (order by session_user_id, sent_at) as global_session_id
from materialized_result_of_sessions_2_query
where is_new_session
group by session_user_id, sent_at;
If that doesn't work, you can construct the global id:
You can join this back to the original event-level data and then use a max() window function to assign it to all events. Something like:
select e.*,
max(s.global_session_id) over (partition by e.session_user_id order by e.event_at) as global_session_id
from events e left join
(<above query>) s
on s.session_user_id = e.session_user_id and s.sent_at = e.event_at;
If not, you can do:
select us.*, us.user_session_id + s.offset as global_session_id
from (select session_user_id, sent_at,
row_number() over (partition by session_user_id order by sent_at) as user_session_id
from materialized_result_of_sessions_2_query
where is_new_session
) us join
(select session_user_id, count(*) as cnt,
sum(count(*)) over (order by session_user_id) - count(*) as offset
from materialized_result_of_sessions_2_query
where is_new_session
group by session_user_id
) s
on us.session_user_id = s.session_user_id;
This might still fail if almost all users are unique and the sessions are short.

Max dates for each sequence within partitions

I would like to see if somebody has an idea how to get the max and min dates within each 'id' using the 'row_num' column as an indicator when the sequence starts/ends in SQL Server 2016.
The screenshot below shows the desired output in columns 'min_date' and 'max_date'.
Any help would be appreciated.
You could use windowed MIN/MAX:
WITH cte AS (
SELECT *,SUM(CASE WHEN row_num > 1 THEN 0 ELSE 1 END)
OVER(PARTITION BY id, cat ORDER BY date_col) AS grp
FROM tab
)
SELECT *, MIN(date_col) OVER(PARTITION BY id, cat, grp) AS min_date,
MAX(date_col) OVER(PARTITION BY id, cat, grp) AS max_date
FROM cte
ORDER BY id, date_col, cat;
Rextester Demo
Try something like
SELECT
Q1.id, Q1.cat,
MIN(Q1.date) AS min_dat,
MAX(Q1.date) AS max_dat
FROM
(SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id, cat ORDER BY [date]) AS r1,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY [date]) AS r2
) AS Q1
GROUP BY
Q1.id, Q1.r2 - Q1.r1