Introduction
I have IoT devices that are constantly sending data to the server. The data consists of those fields:
state;
state_valid;
progress;
timestamp;
There is no guarantee that data will be received in correct time order, meaning that sometimes it might send data captured in the past, that removes the option to analyze and enrich data at the time of ingestion.
Received data is stored in BigQuery table. Each device has a separate table. The table structure looks like this:
state: INTEGER, REQUIRED
state_valid: BOOLEAN, NULLABLE
progress: INTEGER, REQUIRED
timestamp: TIMESTAMP, REQUIRED
Requirements
After data collection, I need to analyze data adhering to those rules:
Device is in received state value until different state is received;
If record's state_valid is false - state value should be ignored and 0 should be used instead of it;
If record's state_valid is NULL, last received state_valid value should be used;
In analyzation phase, data should be viewed in one minute intervals;
For example there shouldn't be a final record that starts at 20:51:07. Start date should be 20:51:00.
The state that was on for most of the time of one minute interval - should be used for the whole minute.
For example, if device had state 0 from 20:51:01 to 20:51:18 and state 2 for 20:51:18 to 20:52:12, 20:51:00 to 20:51:59 should be marked as state 2.
The resulting data should group all consecutive intervals with same state value and represent it as one record with start and end timestamps
The grouped intervals of same state should have calculated progress difference (max_progress - min_progress)
Example
Let's say I receive this data from device:
state
state_valid
progress
timestamp
2
1
2451
20:50:00
0
1
2451
20:50:20
2
1
2451
20:52:29
3
1
2451
20:53:51
3
1
2500
20:54:20
2
0
2500
20:55:09
Below I provide a visualization of that data on a timeline to better understand the next procedures:
So the received data should be processed in one minute intervals, assigning each minute the state that device was in for the better part of that minute. So the above data becomes:
Then, consecutive intervals of same state value should be merged:
Result
So, I need a query that would, adhering to the requirements described in Requirements section and given the data shown in the Example section provide me such result:
group_id
state
progress
start_timestamp
end_timestamp
duration
0
0
0
20:50:00
20:52:00
120s
1
2
0
20:52:00
20:54:00
120s
2
3
49
20:54:00
20:55:00
60s
3
0
0
20:55:00
20:56:00
60s
Sample data
Consider those two data sets as sample data
Sample data 1
Data:
WITH data as (
SELECT * FROM UNNEST([
STRUCT(NULL AS state, 0 AS state_valid, 0 as progress, CURRENT_TIMESTAMP() as timestamp),
(2, 1, 2451, TIMESTAMP('2022-07-01 20:50:00 UTC')),
(0, 1, 2451, TIMESTAMP('2022-07-01 20:50:20 UTC')),
(2, 1, 2451, TIMESTAMP('2022-07-01 20:52:29 UTC')),
(3, 1, 2451, TIMESTAMP('2022-07-01 20:53:51 UTC')),
(3, 1, 2500, TIMESTAMP('2022-07-01 20:54:20 UTC')),
(2, 0, 2500, TIMESTAMP('2022-07-01 20:55:09 UTC')),
])
WHERE NOT state IS NULL
)
Expected outcome:
group_id
state
progress
start_timestamp
end_timestamp
duration
0
0
0
20:50:00
20:52:00
120s
1
2
0
20:52:00
20:54:00
120s
2
3
49
20:54:00
20:55:00
60s
3
0
0
20:55:00
current_timestamp
current_timestamp - 20:55:00
Sample data 2
Data:
WITH data as (
SELECT * FROM UNNEST([
STRUCT(NULL AS state, 0 AS state_valid, 0 as progress, CURRENT_TIMESTAMP() as timestamp),
(2, 1, 2451, TIMESTAMP('2022-07-01 20:50:00 UTC')),
(0, 1, 2451, TIMESTAMP('2022-07-01 20:50:20 UTC')),
(2, 1, 2451, TIMESTAMP('2022-07-01 20:52:29 UTC')),
(3, 1, 2451, TIMESTAMP('2022-07-01 20:53:51 UTC')),
(3, 1, 2500, TIMESTAMP('2022-07-01 20:54:20 UTC')),
(3, 1, 2580, TIMESTAMP('2022-07-01 20:55:09 UTC')),
(3, 1, 2600, TIMESTAMP('2022-07-01 20:59:09 UTC')),
(3, 1, 2700, TIMESTAMP('2022-07-01 21:20:09 UTC')),
(2, 0, 2700, TIMESTAMP('2022-07-01 22:11:09 UTC'))
])
WHERE NOT state IS NULL
)
Expected outcome:
group_id
state
progress
start_timestamp
end_timestamp
duration
0
0
0
20:50:00
20:52:00
120s
1
2
0
20:52:00
20:54:00
120s
2
3
249
20:54:00
22:11:00
4620s
3
0
0
22:11:00
current_timestamp
current_timestamp - 22:11:00
Consider below approach
with by_second as (
select if(state_valid = 0, 0, state) state, progress, ts, timestamp_trunc(ts, minute) ts_minute
from (
select *, timestamp_sub(lead(timestamp) over(order by timestamp), interval 1 second) as next_timestamp
from your_table
), unnest(generate_timestamp_array(
timestamp, ifnull(next_timestamp, timestamp_trunc(timestamp_add(timestamp, interval 60 second), minute)), interval 1 second
)) ts
), by_minute as (
select ts_minute, array_agg(struct(state, progress) order by weight desc limit 1)[offset(0)].*
from (
select state, progress, ts_minute, count(*) weight
from by_second
group by state, progress, ts_minute
)
group by ts_minute
having sum(weight) > 59
)
select group_id, any_value(state) state, max(progress) progress,
min(ts_minute) start_timestamp,
timestamp_add(max(ts_minute), interval 1 minute) end_timestamp,
60 * count(*) duration
from (
select countif(new_group) over(order by ts_minute) group_id, state, progress, ts_minute
from (
select ts_minute, state, progress - lag(progress) over(order by ts_minute) as progress,
ifnull((state, progress) != lag((state, progress)) over(order by ts_minute), true) new_group,
from by_minute
)
)
group by group_id
if applied to dummy data as in your question
output is
For some reason I feel that updating existing answer will be confusing - so see fixed solution here - there are two fixes in two lines at the very final select statement - hey are commented so you can easily locate them
with by_second as (
select if(state_valid = 0, 0, state) state, progress, ts, timestamp_trunc(ts, minute) ts_minute
from (
select *, timestamp_sub(lead(timestamp) over(order by timestamp), interval 1 second) as next_timestamp
from your_table
), unnest(generate_timestamp_array(
timestamp, ifnull(next_timestamp, timestamp_trunc(timestamp_add(timestamp, interval 60 second), minute)), interval 1 second
)) ts
), by_minute as (
select ts_minute, array_agg(struct(state, progress) order by weight desc limit 1)[offset(0)].*
from (
select state, progress, ts_minute, count(*) weight
from by_second
group by state, progress, ts_minute
)
group by ts_minute
having sum(weight) > 59
)
select group_id, any_value(state) state, sum(progress) progress,
# here changed max(progress) to sum(progress)
min(ts_minute) start_timestamp,
timestamp_add(max(ts_minute), interval 1 minute) end_timestamp,
60 * count(*) duration
from (
select countif(new_group) over(order by ts_minute) group_id, state, progress, ts_minute
from (
select ts_minute, state, progress - lag(progress) over(order by ts_minute) as progress,
-- ifnull((state, progress) != lag((state, progress)) over(order by ts_minute), true) new_group,
# fixed this line with below one
ifnull((state) != lag(state) over(order by ts_minute), true) new_group,
from by_minute
)
)
group by group_id
Yet another approach:
WITH preprocessing AS (
SELECT IF (LAST_VALUE(state_valid IGNORE NULLS) OVER (ORDER BY ts) = 0, 0, state) AS state,
LAST_VALUE(state_valid IGNORE NULLS) OVER (ORDER BY ts) AS state_valid,
progress, ts
FROM sample
),
intervals_added AS (
( SELECT *, 0 src FROM preprocessing UNION ALL
SELECT null, null, null, ts, 1
FROM (SELECT MIN(ts) min_ts FROM sample), (SELECT MAX(ts) max_ts FROM sample),
UNNEST (GENERATE_TIMESTAMP_ARRAY(min_ts, max_ts + INTERVAL 1 MINUTE, INTERVAL 1 MINUTE)) ts
) EXCEPT DISTINCT
SELECT null, null, null, ts, 1 FROM (SELECT ts FROM preprocessing)
),
analysis AS (
SELECT *, SUM(grp) OVER (ORDER BY ts) AS group_id FROM (
SELECT * EXCEPT(progress),
TIMESTAMP_TRUNC(ts, MINUTE) AS start_timestamp,
progress - LAST_VALUE(progress IGNORE NULLS) OVER w AS progress,
IF (LAST_VALUE(state IGNORE NULLS) OVER w <> state, 1, 0) AS grp,
TIMESTAMP_DIFF(LEAD(ts) OVER (ORDER BY ts, src), ts, SECOND) AS diff,
FROM intervals_added
WINDOW w AS (ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
) QUALIFY MAX(diff) OVER (PARTITION BY TIMESTAMP_TRUNC(ts, MINUTE)) = diff
)
SELECT group_id, MIN(state) AS sate, SUM(progress) AS progress,
MIN(start_timestamp) AS start_timestamp,
MIN(start_timestamp) + INTERVAL COUNT(1) MINUTE AS end_timestamp,
60 * COUNT(1) AS duration,
FROM analysis GROUP BY 1 ORDER BY 1;
output:
Related
Description
I have PostgreSQL table that looks like this:
identifier
state
card_presence
progress
timestamp
V000000000000123
0
true
1000
2022-12-01 12:45:02
V000000000000123
2
true
1022
2022-12-01 12:45:03
V000000000000123
3
true
1024
2022-12-01 12:48:03
V000000000000124
2
true
974
2022-12-01 12:43:00
V000000000000124
6
true
982
2022-12-01 12:55:00
I have to analyze this data quite frequently (at ~60s) intervals. First stage of analysis is a complex query which processes the data in multiple steps. At the moment the I execute the query for each identifier individually.
Basically what the query does is somewhat what is described in: Time intervals analysis in BigQuery
The query looks like:
with real_data as (
SELECT
(CASE WHEN card_presence != false THEN state ELSE -1 END) as state,
progress,
lead(timestamp) over(order by timestamp) - interval '1 second' as next_timestamp,
timestamp
FROM telemetry_tacho
WHERE driver_identifier = 'V100000165676000' AND state IS NOT NULL AND timestamp >= CURRENT_TIMESTAMP - INTERVAL '2 weeks'
), sample_by_second as (
SELECT
state,
progress,
ts,
date_trunc('minute', ts) ts_minute
FROM
real_data,
generate_series(
timestamp,
coalesce(
next_timestamp,
date_trunc('minute', timestamp + interval '60 seconds')
),
interval '1 second'
) ts
), sample_by_second_with_weight as (
SELECT
state,
MIN(progress) as min_progress,
MAX(progress) as max_progress,
ts_minute,
count(*) weight
FROM sample_by_second
GROUP BY state, ts_minute
), sample_by_minute as (
SELECT
ts_minute,
(array_agg(state ORDER BY weight DESC))[1] as state,
MIN(min_progress) as min_progress,
MAX(max_progress) as max_progress
FROM sample_by_second_with_weight
GROUP BY ts_minute
), add_previous_state as (
SELECT
ts_minute,
state,
min_progress,
max_progress,
lag(state) OVER (ORDER BY ts_minute) as prev_state
FROM sample_by_minute
), add_group_indication as (
SELECT
ts_minute,
state,
min_progress,
max_progress,
SUM(CASE
WHEN state = 0 AND prev_state = -1 THEN 0
WHEN state = -1 AND prev_state = 0 THEN 0
WHEN state != prev_state THEN 1
ELSE 0
END) over (order by ts_minute) as group_id
FROM add_previous_state
), computed as (
select
group_id,
min(ts_minute) as ts_minute_min,
max(ts_minute) as ts_minute_max,
min(state) as state,
MIN(min_progress) as min_progress,
MAX(max_progress) as max_progress,
min(ts_minute) as start_timestamp,
max(ts_minute) + interval '1 minute' end_timestamp,
60 * count(*) as duration
from add_group_indication
group by group_id
), include_surrounding_states as (
select
*,
lag(state) over(order by start_timestamp) prev_state,
lead(state) over(order by start_timestamp) next_state
from computed
), filter_out_invalid_states as (
select
state,
min_progress,
max_progress,
start_timestamp,
end_timestamp,
lag(state) over(order by start_timestamp) prev_state,
lead(state) over(order by start_timestamp) next_state
from include_surrounding_states
where not (state = 2 AND prev_state = 3 AND next_state = 3 AND duration = 60)
), recalculate_group_id as (
select
SUM(CASE WHEN state != prev_state THEN 1 ELSE 0 END) over (order by start_timestamp) as group_id,
state,
min_progress,
max_progress,
COALESCE(start_timestamp, CURRENT_TIMESTAMP - INTERVAL '2 weeks') as start_timestamp, -- Add period start timestamp for the first entry
COALESCE(end_timestamp, CURRENT_TIMESTAMP) as end_timestamp
from filter_out_invalid_states
), final_data as (
SELECT
MAX(state) AS state,
MIN(min_progress) AS min_progress,
MAX(max_progress) AS max_progress,
MAX(max_progress) - MIN(min_progress) AS progress_diff,
EXTRACT('epoch' FROM min(start_timestamp))::integer AS start_timestamp,
EXTRACT('epoch' FROM max(end_timestamp))::integer AS end_timestamp,
EXTRACT('epoch' FROM (max(end_timestamp) - min(start_timestamp))::interval)::integer AS duration
FROM recalculate_group_id
GROUP BY group_id
ORDER BY start_timestamp ASC
)
select * from final_data;
Sample data
Input
"identifier","card_presence","state","progress","timestamp"
"0000000000000123",TRUE,0,100000,"2022-12-01 00:00:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-01 10:00:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-01 10:05:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-01 15:00:02+00"
"0000000000000123",TRUE,3,100000,"2022-12-01 15:45:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-01 20:15:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-01 20:15:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 05:14:45+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 05:15:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 05:15:01+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 06:10:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 07:11:20+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 07:11:28+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 07:13:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 08:01:06+00"
"0000000000000123",TRUE,0,100000,"2022-12-02 08:30:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 08:30:10+00"
"0000000000000123",TRUE,0,100000,"2022-12-02 09:45:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 10:30:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-02 15:00:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-02 15:45:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-02 16:45:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-03 01:45:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-03 02:25:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-03 05:18:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-03 06:15:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-03 07:00:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-03 11:30:00+00"
"0000000000000123",TRUE,3,100000,"2022-12-03 12:15:00+00"
"0000000000000123",TRUE,0,100000,"2022-12-03 13:15:00+00"
Output
"state","min_progress","max_progress","progress_diff","start_timestamp","end_timestamp","duration"
0,100000,100000,0,1669852800,1669889100,36300
3,100000,100000,0,1669889100,1669906800,17700
0,100000,100000,0,1669906800,1669909500,2700
3,100000,100000,0,1669909500,1669925700,16200
0,100000,100000,0,1669925700,1669958100,32400
3,100000,100000,0,1669958100,1669974300,16200
0,100000,100000,0,1669974300,1669977000,2700
3,100000,100000,0,1669977000,1669993200,16200
0,100000,100000,0,1669993200,1669995900,2700
3,100000,100000,0,1669995900,1669999500,3600
0,100000,100000,0,1669999500,1670031900,32400
3,100000,100000,0,1670031900,1670048100,16200
0,100000,100000,0,1670048100,1670050800,2700
3,100000,100000,0,1670050800,1670067000,16200
0,100000,100000,0,1670067000,1670069700,2700
3,100000,100000,0,1670069700,1670073300,3600
0,100000,100000,0,1670073300,1670073420,120
Question
The query usually takes some time to process for each device, and, I find that constantly querying for and analysing that data for each identifier separately is quite time consuming, so I thought, maybe it would be possible to pre-process that data for all devices periodically and store analysed results in separate table or materialized view.
Now the thing of running the query periodically and saving the results to a separate table or a materialized view isn't that hard, but is it possible to do that for all identifier values that exist on the table at once?
I believe that the query could be updated to do that, but I fail to grasp the concept on how to do so.
Without delving into your logic of analysis I may suggest this:
extract the list of distinct driver_identifier-s or have it stored in a materialized view too;
select from this list lateral join with your query.
Your query shall be changed a bit too, replace driver_identifier = 'V100000165676000' with driver_identifier = dil.drid to correlate it with the identifiers' list.
with driver_identifier_list(drid) as
(
select distinct driver_identifier from telemetry_tacho
)
select l.*
from driver_identifier_list as dil
cross join lateral
(
-- your query (where driver_identifier = dil.drid) here
) as l;
Effectively this is a loop that runs your query for every driver_identifier value. However the view(s) are to be refreshed on every telemetry_tacho mutation which makes the effectiveness of the materialized view approach questionable.
I have a table History with the columns date, person and status and I need to know what is the total amount of time spent since it started until it reaches the finished status ( Finished status can occur multiples times). I need to get the datediff from the first time it's created until the first time it's with status finished, afterwards I need to get the next date were it's not finished and get again the datediff using the date it was again finished and so on. Another condition is to do this calculation only if Person who changed the status is not null. After that I need to sum all times and get the total.
I tried with Lead and Lag function but was not getting the results that I need.
First let's talk about providing demo data. Here's a good way to do it:
Create a table variable similar to your actual object(s) and then populate them:
DECLARE #statusTable TABLE (Date DATETIME, Person INT, Status NVARCHAR(10), KeyID NVARCHAR(7))
INSERT INTO #statusTable (Date, Person, Status, KeyID) VALUES
('2022-10-07 07:01:17.463', 1, 'Start', 'AAA-111'),
('2022-10-07 07:01:17.463', 1, 'Waiting', 'AAA-111'),
('2022-10-11 14:01:44.463', 1, 'Waiting', 'AAA-111'),
('2022-10-14 10:04:17.463', 1, 'Waiting', 'AAA-111'),
('2022-10-14 10:04:17.463', 1, 'Finished','AAA-111'),
('2022-10-14 10:04:17.463', 1, 'Waiting', 'AAA-111'),
('2022-10-17 17:01:17.463', 1, 'Waiting', 'AAA-111'),
('2022-10-21 11:03:17.463', 1, 'Waiting', 'AAA-111'),
('2022-10-21 11:03:17.463', 1, 'Finished','AAA-111'),
('2022-10-21 11:03:17.463', 1, 'Waiting', 'AAA-111'),
('2022-10-21 11:04:17.463', NULL, 'Waiting', 'AAA-111'),
('2022-10-21 11:05:17.463', 1, 'Finished','AAA-111')
Your problem is recursive, so we can use a rCTE to resolve it.
;WITH base AS (
SELECT *, CASE WHEN LAG(Status,1) OVER (PARTITION BY KeyID ORDER BY Date) <> 'Waiting' AND Status = 'Waiting' THEN 1 END AS isStart, ROW_NUMBER() OVER (PARTITION BY KeyID ORDER BY Date) AS rn
FROM #statusTable
), rCTE AS (
SELECT date AS startDate, date, Person, Status, KeyID, IsStart, rn
FROM base
WHERE isStart = 1
UNION ALL
SELECT a.startDate, r.date, r.Person, r.Status, a.KeyID, r.IsStart, r.rn
FROM rCTE a
INNER JOIN base r
ON a.rn+1 = r.rn
AND a.KeyID = r.KeyID
AND r.IsStart IS NULL
)
SELECT StartDate, MAX(date) AS FinishDate, KeyID, DATEDIFF(MINUTE,StartDate,MAX(Date)) AS Minutes
FROM rCTE
GROUP BY rCTE.startDate, KeyID
HAVING COUNT(Person) = COUNT(KeyID)
StartDate FinishDate KeyID Minutes
---------------------------------------------------------------
2022-10-07 07:01:17.463 2022-10-14 10:04:17.463 AAA-111 10263
2022-10-14 10:04:17.463 2022-10-21 11:03:17.463 AAA-111 10139
What we're doing here is finding, and marking the starts. Since when there is a Start row, the timestamp matches the first Waiting row and there isn't always a start row, we're gonna use the first waiting row as the start marker.
Then, we go through and find the next Finish row for that KeyID.
Using this we can now group on the StartDate, Max the StatusDate (as FinishDate) and then use a DATEDIFF to calculate the difference.
Finally, we compare the count of KeyIDs to the count of Person. If there is a NULL value for Person the counts will not match, and we just discard the data.
select min(date) as start
,max(date) as finish
,datediff(millisecond, min(date), max(date)) as diff_in_millisecond
,sum(datediff(millisecond, min(date), max(date))) over() as total_diff_in_millisecond
from
(
select *
,count(case when Status = 'Finished' then 1 end) over(order by date desc, status desc) as grp
,case when person is null then 0 else 1 end as flg
from t
) t
group by grp
having min(flg) = 1
order by start
start
finish
diff_in_millisecond
total_diff_in_millisecond
2022-10-07 07:01:17.4630000
2022-10-14 10:04:28.4730000
615791010
1242093518
2022-10-14 10:04:28.4730000
2022-10-21 11:03:06.7170000
608318244
1242093518
2022-10-26 12:46:14.7730000
2022-10-26 17:45:59.0370000
17984264
1242093518
Fiddle
There are many similar questions and answers already posted but I could not find one with these differences. 1) The count of NULLs starts over, and 2) there is a math function applied to the replaced value.
An event either takes place or not (NULL or 1), by date by customer. Can assume that a customer has one and only one row for every date.
I want to replace the NULLs with a decay function based on number of consecutive NULLs (time from event). A customer can have the event every day, skip a day, skip multiple days. But once the event takes place, the decay starts over. Currently my decay is divide by 2 but that is for example.
DT
CUSTOMER
EVENT
DESIRED
2022-01-01
a
1
1
2022-01-02
a
1
1
2022-01-03
a
1
1
2022-01-04
a
1
1
2022-01-05
a
1
1
2022-01-01
b
1
1
2022-01-02
b
0.5
2022-01-03
b
0.25
2022-01-04
b
1
1
2022-01-05
b
0.5
I can produce the desired result, but it is very unwieldy. Looking if there is a better way. This will need to be extended for multiple event columns.
create or replace temporary table the_data (
dt date,
customer char(10),
event int,
desired float)
;
insert into the_data values ('2022-01-01', 'a', 1, 1);
insert into the_data values ('2022-01-02', 'a', 1, 1);
insert into the_data values ('2022-01-03', 'a', 1, 1);
insert into the_data values ('2022-01-04', 'a', 1, 1);
insert into the_data values ('2022-01-05', 'a', 1, 1);
insert into the_data values ('2022-01-01', 'b', 1, 1);
insert into the_data values ('2022-01-02', 'b', NULL, 0.5);
insert into the_data values ('2022-01-03', 'b', NULL, 0.25);
insert into the_data values ('2022-01-04', 'b', 1, 1);
insert into the_data values ('2022-01-05', 'b', NULL, 0.5);
with
base as (
select * from the_data
),
find_nan as (
select *, case when event is null then 1 else 0 end as event_is_nan from base
),
find_nan_diff as (
select *, event_is_nan - coalesce(lag(event_is_nan) over (partition by customer order by dt), 0) as event_is_nan_diff from find_nan
),
find_nan_group as (
select *, sum(case when event_is_nan_diff = -1 then 1 else 0 end) over (partition by customer order by dt) as nan_group from find_nan_diff
),
consec_nans as (
select *, sum(event_is_nan) over (partition by customer, nan_group order by dt) as n_consec_nans from find_nan_group
),
decay as (
select *, case when n_consec_nans > 0 then 0.5 / n_consec_nans else 1 end as decay_factor from consec_nans
),
ffill as (
select *, first_value(event) over (partition by customer order by dt) as ffill_value from decay
),
final as (
select *, ffill_value * decay_factor as the_answer from ffill
)
select * from final
order by customer, dt
;
Thanks
The query could be simplified by using CONDITIONAL_CHANGE_EVENT to generate subgrp helper column:
WITH cte AS (
SELECT *, CONDITIONAL_CHANGE_EVENT(event IS NULL) OVER(PARTITION BY CUSTOMER
ORDER BY DT) AS subgrp
FROM the_data
)
SELECT *, COALESCE(EVENT, 0.5 / ROW_NUMBER() OVER(PARTITION BY CUSTOMER, SUBGRP
ORDER BY DT)) AS computed_decay
FROM cte
ORDER BY CUSTOMER, DT;
Output:
EDIT:
Without using CONDITIONAL_CHANGE_EVENT:
WITH cte AS (
SELECT *,
CASE WHEN
event = LAG(event,1, event) OVER(PARTITION BY customer ORDER BY dt)
OR (event IS NULL AND LAG(event) OVER(PARTITION BY customer ORDER BY dt) IS NULL)
THEN 0 ELSE 1 END AS l
FROM the_data
), cte2 AS (
SELECT *, SUM(l) OVER(PARTITION BY customer ORDER BY dt) AS SUBGRP
FROM cte
)
SELECT *, COALESCE(EVENT, 0.5 / ROW_NUMBER() OVER(PARTITION BY CUSTOMER, SUBGRP
ORDER BY DT)) AS computed_decay
FROM cte2
ORDER BY CUSTOMER, DT;
db<>fiddle demo
Sample data:
[Sample data][1]
Time Latitude Start Longitude Start Latitude End Longitude End
Motorcycle 1 13:12 2.28079 77.70193 2.23239 33.72323
Motorcycle 1 14:40 2.23239 33.72323 9.23079 78.4289
Motorcycle 2 08:34 9.23079 78.4289 8.13433 12.70871
Motorcycle 2 18:20 8.13433 12.70871 7.23578 99.00093
Motorcycle 3 06:18 7.23578 99.00093 2.34079 75.44866
Motorcycle 3 10:00 2.34079 75.44866 1.25459 17.23253
Motorcycle 3 17:54 1.25459 17.23253 8.78088 99.00123
Essentially, I want to work out where the motorcycles spend most of their time parked (stationery). So, I want rank the coordinates of each motorcycle spends stationery spends time park in- by length ( time between trips= time parked/ stationery).
Stationery time is BETWEEN each trip. So, the end coordinates from one trip are where the motorcycle would have been stationery. They remain there until the next trip begins.
No idea how to query using SQL. Any ideas? Not really sure where to start- very much a beginner with SQL
So there are two steps one is to filter "has moved"/"stationary" rows, which to be honest I am not 100% if that is what you are presenting in your data so we will start with a filter to get those (within a threshold)
SELECT name, duration, start_lat, start_lon
FROM table
/* where start is within 100m of start */
WHERE haversine(start_lat, start_lon, end_lat,end_lon) > 0.1
and now we have our "stationary" rows we are want to rank the rows per name to find the longest of each, which can be done via a row_number
SELECT name, duration, start_lat, start_lon
FROM table
/* where start is within 100m of start */
WHERE haversine(start_lat, start_lon, end_lat,end_lon) > 0.1
QUALIFY row_number() over( partition by name order by duration desc) = 1
which is the same as:
SELECT name, duration, start_lat, start_lon
FROM (
SELECT name, duration, start_lat, start_lon
,row_number() over( partition by name order by duration desc) as rn
FROM table
/* where start is within 100m of start */
WHERE haversine(start_lat, start_lon, end_lat,end_lon) > 0.1
)
WHERE rn = 1
New facts: If each row is the "moving section" to get the details of the between rows is.
SELECT vehicle_id,
lag(time) over (partition by vehicle_id order by time) as stat_start_time,
time as stat_end_time,
date_diff('seconds', stat_start_time, stat_end_time) as stat_duration_s,
lag(end_lat) over (partition by vehicle_id order by time) as stat_start_lat,
lag(end_lon) over (partition by vehicle_id order by time) as stat_start_lon,
start_lat as stat_end_lat,
start_lon as stat_end_lon
FROM table
now we have a table of "stationary" entries, you can do what you wish with it.
thus the longest startionary period per vehicle becomes:
WITH data (vehicle_id, time, start_lat, start_lon, end_lat, end_lon) AS (
SELECT column1, column2::time, column3, column4, column5, column6 FROM VALUES
('Motorcycle 1', '13:12', 2.28079, 77.70193, 2.23239, 33.72323),
('Motorcycle 1', '14:40', 2.23239, 33.72323, 9.23079, 78.42890),
('Motorcycle 2', '08:34', 9.23079, 78.42890, 8.13433, 12.70871),
('Motorcycle 2', '18:20', 8.13433, 12.70871, 7.23578, 99.00093),
('Motorcycle 3', '06:18', 7.23578, 99.00093, 2.34079, 75.44866),
('Motorcycle 3', '10:00', 2.34079, 75.44866, 1.25459, 17.23253),
('Motorcycle 3', '17:54', 1.25459, 17.23253, 8.78088, 99.00123)
)
SELECT vehicle_id,
stat_duration_s,
stat_start_lon,
stat_start_lat
FROM (
SELECT vehicle_id,
lag(time) over (partition by vehicle_id order by time) as stat_start_time,
time as stat_end_time,
datediff('seconds', stat_start_time, stat_end_time) as stat_duration_s,
lag(end_lat) over (partition by vehicle_id order by time) as stat_start_lat,
lag(end_lon) over (partition by vehicle_id order by time) as stat_start_lon,
start_lat as stat_end_lat,
start_lon as stat_end_lon
FROM data
)
QUALIFY row_number() over( partition by vehicle_id order by stat_duration_s desc nulls last) = 1
gives:
VEHICLE_ID
STAT_DURATION_S
STAT_START_LON
STAT_START_LAT
Motorcycle 1
5280
33.72323
2.23239
Motorcycle 2
35160
12.70871
8.13433
Motorcycle 3
28440
17.23253
1.25459
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I need your help to create a view in SQL Server (v12.0.6024.0). One of my customer has a table in which some time slot are saved in this format:
ID
ID_EVENT
Time Slot
1000
24
08:30:00.0000
1000
24
09:00:00.0000
1000
24
09:30:00.0000
Every time slot lasts 30 minutes, the example above means that event with ID 24 (saved in another table) lasted form 8:30 to 10:00 (3rd slot started at 9:30, lasted 30 minutes so it finished at 10:00). The problem is that in some cases the time values are not consecutive and there may be a pause in the middle, so I would have something like this:
ID
ID_EVENT
Time Slot
1000
24
08:30:00.0000
1000
24
09:00:00.0000
1000
24
09:30:00.0000
1000
24
11:30:00.0000
1000
24
12:00:00.0000
1000
24
12:30:00.0000
In this case event with ID 24 lasted from 8:30 to 10, stopped, then started again from 11:30 to 13:00. I have been asked to prepare a view for an external developer in which I have to report not only the time the event started (in my example, 8:30) and the time it stopped for good (in my example 13:00) but also the time the pause started (in my example 10:00) and the time the pause finished (in my example 11:30).
I have no problem with the first 2 values but I don't know how to extract the other two. I think we can consider a pause happening when 2 time slots are not consecutive, there cannot be more than periods for the same event. I suppose I need a procedure but find it difficult to write it; I need to have a view that says
ID
ID_EVENT
Time1
Time2
Time3
Time4
1000
24
08:30:00.0000
10:00:00.0000
11:30:00.0000
13:00:00.0000
Any help?
declare #t table(ID int, ID_EVENT int, TimeSlot time)
insert into #t
values
(1000, 24, '08:30:00.0000'),
(1000, 24, '09:00:00.0000'),
(1000, 24, '09:30:00.0000'),
--
(1000, 24, '11:30:00.0000'),
(1000, 24, '12:00:00.0000'),
(1000, 24, '12:30:00.0000'),
--
(1000, 24, '15:00:00.0000'),
(1000, 24, '15:30:00.0000'),
(1000, 24, '16:00:00.0000'),
--
(1000, 25, '15:30:00.0000'),
(1000, 25, '16:30:00.0000');
select Id, ID_EVENT,
min(TimeSlot) as StartTimeSlot,
dateadd(minute, 30, max(TimeSlot)) as EndTimeSlot
from
(
select *,
datediff(minute, '00:00:00', Timeslot)/30 - row_number() over(partition by Id, ID_EVENT order by TimeSlot) as grpid
from #t
) as t
group by Id, ID_EVENT, grpid;
--first two groups per event&id row
select Id, ID_EVENT,
--1
min(case when grpordinal = 1 then TimeSlot end) as StartSlot1,
dateadd(minute, 30, max(case when grpordinal = 1 then TimeSlot end)) as EndSlot1,
--2
min(case when grpordinal = 2 then TimeSlot end) as StartSlot2,
dateadd(minute, 30, max(case when grpordinal = 2 then TimeSlot end)) as EndSlot2
from
(
select Id, ID_EVENT, TimeSlot,
dense_rank() over(partition by Id, ID_EVENT order by grpid) as grpordinal
from
(
select *,
datediff(minute, '00:00:00', Timeslot)/30 - row_number() over(partition by Id, ID_EVENT order by TimeSlot) as grpid
from #t
) as t
) as src
--where grpordinal <= 2 --not really needed
group by Id, ID_EVENT;
--!!!!only when there are max two groups/periods
--if there could be more than 2 periods this will not work
select Id, ID_EVENT,
--1
min(case when grpid = 0 then TimeSlot end) as StartSlot1,
dateadd(minute, 30, max(case when grpid = 0 then TimeSlot end)) as EndSlot1,
--2
min(case when grpid <> 0 then TimeSlot end) as StartSlot2,
dateadd(minute, 30, max(case when grpid <> 0 then TimeSlot end)) as EndSlot2
from
(
select *,
/*
1
+ datediff(minute, '00:00:00', Timeslot)/30 - row_number() over(partition by Id, ID_EVENT order by TimeSlot)
- datediff(minute, '00:00:00', min(Timeslot) over(partition by Id, ID_EVENT)) /30
*/
1
+ datediff(minute, min(Timeslot) over(partition by Id, ID_EVENT), TimeSlot)/30
- row_number() over(partition by Id, ID_EVENT order by TimeSlot)
as grpid --1st groupid is always 0
from #t
) as t
group by Id, ID_EVENT;
This reads like a gaps-and-islands problem, where you want to identify and group together "adjacent" time slots.
I would suggest putting the ranges in rows rather than in column. For this, you can use window functions like that:
select id, id_event,
min(timeslot) as timeslot_start, max(timeslot) as timeslot_end
from (
select t.*,
row_number() over(partition by id, id_event order by timeslot) rn
from mytable t
) t
group by id, id_event, datediff(minute, - rn * 30, timeslot)
If want to see the first two ranges per event only - both on the same row in the resultset - then we can use conditional aggregation on top of that query:
select id, id_event,
max(case when rn = 1 then timeslot_start end) as timeslot_start_1,
max(case when rn = 1 then timeslot_end end) as timeslot_end_1,
max(case when rn = 2 then timeslot_start end) as timeslot_start_2,
max(case when rn = 2 then timeslot_end end) as timeslot_end_2
from (
select id, id_event,
min(timeslot) as timeslot_start, max(timeslot) as timeslot_end,
row_number() over(partition by id, id_event order by min(timeslot)) rn
from (
select t.*,
row_number() over(partition by id, id_event order by timeslot) rn
from mytable t
) t
group by id, id_event, datediff(minute, - rn * 30, timeslot)
) t
where rn <= 2
group by id, id_event