Calculating concurrency from a set of ranges - sql

I have a set of rows containing a start timestamp and a duration. I want to perform various summaries using the overlap or concurrency.
For example: peak daily concurrency, peak concurrency grouped on another column.
Example data:
timestamp,duration
2016-01-01 12:00:00,300
2016-01-01 12:01:00,300
2016-01-01 12:06:00,300
I would like to know that peak for the period was 12:01:00-12:05:00 at 2 concurrent.
Any ideas on how to achieve this using BigQuery or, less exciting, a Map/Reduce job?

For a per-minute resolution, with session lengths of up to 255 minutes:
SELECT session_minute, COUNT(*) c
FROM (
SELECT start, DATE_ADD(start, i, 'MINUTE') session_minute FROM (
SELECT * FROM (
SELECT TIMESTAMP("2015-04-30 10:14") start, 7 minutes
),(
SELECT TIMESTAMP("2015-04-30 10:15") start, 12 minutes
),(
SELECT TIMESTAMP("2015-04-30 10:15") start, 12 minutes
),(
SELECT TIMESTAMP("2015-04-30 10:18") start, 12 minutes
),(
SELECT TIMESTAMP("2015-04-30 10:23") start, 3 minutes
)
) a
CROSS JOIN [fh-bigquery:public_dump.numbers_255] b
WHERE a.minutes>b.i
)
GROUP BY 1
ORDER BY 1

STEP 1 - First you need find all periods (start and end) with
respective concurrent entries
SELECT ts AS start, LEAD(ts) OVER(ORDER BY ts) AS finish,
SUM(entry) OVER(ORDER BY ts) AS concurrent_entries
FROM (
SELECT ts, SUM(entry)AS entry
FROM
(SELECT ts, 1 AS entry FROM yourTable),
(SELECT DATE_ADD(ts, duration, 'second') AS ts, -1 AS entry FROM yourTable)
GROUP BY ts
HAVING entry != 0
)
ORDER BY ts
Assuming input as below
(SELECT TIMESTAMP('2016-01-01 12:00:00') AS ts, 300 AS duration),
(SELECT TIMESTAMP('2016-01-01 12:01:00') AS ts, 300 AS duration),
(SELECT TIMESTAMP('2016-01-01 12:06:00') AS ts, 300 AS duration),
(SELECT TIMESTAMP('2016-01-01 12:07:00') AS ts, 300 AS duration),
(SELECT TIMESTAMP('2016-01-01 12:10:00') AS ts, 300 AS duration),
(SELECT TIMESTAMP('2016-01-01 12:11:00') AS ts, 300 AS duration)
the output of above query will look somehow like this:
start finish concurrent_entries
2016-01-01 12:00:00 UTC 2016-01-01 12:01:00 UTC 1
2016-01-01 12:01:00 UTC 2016-01-01 12:05:00 UTC 2
2016-01-01 12:05:00 UTC 2016-01-01 12:07:00 UTC 1
2016-01-01 12:07:00 UTC 2016-01-01 12:10:00 UTC 2
2016-01-01 12:10:00 UTC 2016-01-01 12:12:00 UTC 3
2016-01-01 12:12:00 UTC 2016-01-01 12:15:00 UTC 2
2016-01-01 12:15:00 UTC 2016-01-01 12:16:00 UTC 1
2016-01-01 12:16:00 UTC null 0
You might still want to polish above query a little - but mainly it does what you need
STEP 2 - now you can do any stats off of above result
For example peak on whole period:
SELECT
start, finish, concurrent_entries, RANK() OVER(ORDER BY concurrent_entries DESC) AS peak
FROM (
SELECT ts AS start, LEAD(ts) OVER(ORDER BY ts) AS finish,
SUM(entry) OVER(ORDER BY ts) AS concurrent_entries
FROM (
SELECT ts, SUM(entry)AS entry FROM
(SELECT ts, 1 AS entry FROM yourTable),
(SELECT DATE_ADD(ts, duration, 'second') AS ts, -1 AS entry FROM yourTable)
GROUP BY ts
HAVING entry != 0
)
)
ORDER BY peak

Related

Find max overlapping intervals in periods of time

I want to find maximum number of overlapping intervals I have in given period of time. Data have always start and end timestamp. And for given period of time (i.e. hourly) I want to get how many in total unique rows I had that was in given time, and a bit more troublesome maximum of concurrent ones in it.
Sample data:
id
start
end
1
2011-12-19 06:00:00
2011-12-19 08:45:00
2
2011-12-19 06:15:00
2011-12-19 06:30:00
3
2011-12-19 06:30:00
2011-12-19 06:45:00
4
2011-12-19 06:40:00
2011-12-19 07:15:00
5
2011-12-19 07:15:00
2011-12-19 08:45:00
6
2011-12-19 07:30:00
2011-12-19 07:50:00
7
2011-12-19 08:00:00
2011-12-19 08:30:00
8
2011-12-19 08:00:00
2011-12-19 08:15:00
9
2011-12-19 08:30:00
2011-12-19 08:45:00
For this data hourly result would look like:
id
period
max
total
1
2011-12-18 06:00:00 - 2011-12-19 07:00:00
3
4
2
2011-12-19 07:00:00 - 2011-12-19 08:00:00
3
4
3
2011-12-19 08:00:00 - 2011-12-19 09:00:00
4
5
Where max (max concurrent) would be:
2011-12-18 06:00:00 - Concurrent sessions: (2,1), (3,1,4) Total: 1,2,3,4
2011-12-18 07:00:00 - Concurrent sessions: (1,4), (5,1,6) Total: 1,4,5,6
2011-12-18 08:00:00 - Concurrent sessions: (1,5,7,8), (9,1,5) Total: 1,5,7,8,9
Any ideas how I could achieve something like this using SQL (BigQuery)?
This is a little complicated, but here is a query:
with t as (
select 1 as id, timestamp('2011-12-19 06:00:00') as startt, timestamp('2011-12-19 08:45:00') as endt union all
select 2 as id, timestamp('2011-12-19 06:15:00') as startt, timestamp('2011-12-19 06:30:00') as endt union all
select 3 as id, timestamp('2011-12-19 06:30:00') as startt, timestamp('2011-12-19 06:45:00') as endt union all
select 4 as id, timestamp('2011-12-19 06:40:00') as startt, timestamp('2011-12-19 07:15:00') as endt union all
select 5 as id, timestamp('2011-12-19 07:15:00') as startt, timestamp('2011-12-19 08:45:00') as endt union all
select 6 as id, timestamp('2011-12-19 07:30:00') as startt, timestamp('2011-12-19 07:50:00') as endt union all
select 7 as id, timestamp('2011-12-19 08:00:00') as startt, timestamp('2011-12-19 08:30:00') as endt union all
select 8 as id, timestamp('2011-12-19 08:00:00') as startt, timestamp('2011-12-19 08:15:00') as endt union all
select 9 as id, timestamp('2011-12-19 08:30:00') as startt, timestamp('2011-12-19 08:45:00') as endt
),
se as (
select id, startt as ts, 1 as inc
from t union all
select id, endt as ts, -1 as inc
from t union all
select null, ts, 0
from unnest(generate_timestamp_array(timestamp('2011-12-19 06:00:00'),
timestamp('2011-12-19 08:00:00'),
interval 1 hour)
) ts
),
p as (
select ts, (inc = 0) as col, sum(inc) as value_at,
countif(inc = 1) as num_starts,
sum(sum(inc)) over (order by ts, max(inc = 0) desc) as active_at,
sum(countif(inc = 0)) over (order by ts, max(inc = 0) desc) as period_grp
from se
group by 1, 2
)
select period_grp, min(ts) as period,
max(active_at) as max_in_period,
(array_agg(active_at order by ts limit 1)[ordinal(1)] +
sum(num_starts)
) as total
from p
group by period_grp;
The key idea is to split the starts and stops into separate rows with an "increment" of +1 or -1. This is then augmented with the hourly breaks that you want.
The code then does the following:
Calculate the cumulative sum of the increment to get the number of concurrent ids at each timestamp.
Calculates the "period" for each timestamp by taking a cumulative sum of the generated rows.
Then the two calculations you want are:
The max is simply the max of the concurrent in a group by.
The total is the concurrent at the beginning of the time period (not including any that start at the beginning of the time period) plus any starts during the time period.
Let's start with a resultset containing all the distinct timestamps in your ev (event) table. (UNION strips duplicates.)
SELECT start t FROM ev
UNION
SELECT end t FROM ev
Next let's figure out how many sessions are active at each of these points in time. We can do that by using a JOIN to check whether each session is active at the point in time. fiddle.
SELECT COUNT(*) concurrent, t.t
FROM ev
JOIN (
SELECT start t FROM ev
UNION
SELECT end t FROM ev
) t ON ev.start <= t.t AND ev.end > t.t
GROUP BY t.t
If you have many many sessions, this query can do a lot of heavy lifting. You'd be smart, in production, to restrict it by date range, and to put a compound index on (start, end).
Finally, group that result set by hour and take the maximum concurrency. fiddle
SELECT DATE_FORMAT(t, '%Y-%m-%d %H:00') hour_beginning,
MAX(concurrent) concurrent
FROM (
SELECT COUNT(*) concurrent, t.t
FROM ev
JOIN (
SELECT start t FROM ev
UNION
SELECT end t FROM ev
) t ON ev.start <= t.t AND ev.end > t.t
GROUP BY t.t
) q
GROUP BY DATE_FORMAT(t, '%Y-%m-%d %H:00')
Notice a couple of things.
The expression DATE_FORMAT(t, '%Y-%m-%d %H:00') gets you a timestamp that's the beginning of the hour of t.
To work perfectly, this assumes the end columns in your table record the first moment the session became inactive, not the last moment the session was active. (There are two kinds of hard problems in computer science: naming things, caching things, and off-by-one errors. :-)
This is tested on MySQL. BigQuery may vary in its syntax.
Consider below approach - seems to me most simple and least verbose
select
timestamp_trunc(ts, hour) hour,
max(concurrent) `max`,
hll_count.merge(ids) total
from (
select ts, count(distinct id) concurrent, hll_count.init(id) ids
from `project.dataset.table`,
unnest(generate_timestamp_array(start, `end`, interval 1 minute)) ts
group by ts
)
group by hour
if applied to sample data in your question - output is

Select continuous time intervals in SQL

I have table with datetimes and i need to select continuous time intervals
My table:
Id
Time
1
2021-01-01 10:00:00
1
2021-01-01 10:01:00
1
2021-01-01 10:02:00
1
2021-01-01 10:04:00
2
2021-01-01 10:03:00
2
2021-01-01 10:04:00
2
2021-01-01 10:06:00
2
2021-01-01 10:07:00
Result i need:
id
date_from
date_to
1
2021-01-01 10:00:00
2021-01-01 10:02:00
1
2021-01-01 10:04:00
2021-01-01 10:04:00
2
2021-01-01 10:03:00
2021-01-01 10:04:00
2
2021-01-01 10:06:00
2021-01-01 10:07:00
I tried like this, but can't do that right
select id,
min(date_from) over
(partition by id, date_to
order by id)
as date_from,
max(date_to) over
(partition by id, date_from
order by id)
as date_to
from (
select id,
MIN(time) over
(PARTITION by id,
diff2 between 0 and 60
ORDER BY id, time)
as date_from,
max(MINUTE) over
(PARTITION by id,
diff between 0 and 60
ORDER BY id, time)
as date_to
from (
select *,
unix_timestamp(date_lead) - unix_timestamp(time)
as diff,
unix_timestamp(time) - unix_timestamp(date_lag)
as diff2
from (
select id, time,
NVL(LEAD(time) over
(PARTITION by id
ORDER BY id, time), time)
as date_lead,
NVL(LAG(time) over
(PARTITION by id
ORDER BY id, time), time)
as date_lag
from my_table)
)
)
select id, MIN(time), max(time) from
(SELECT * interval '1 minutes' *-1* (DENSE_RANK() OVER (PARTITION BY id ORDER by time) ) + TO_TIMESTAMP(time, 'YYYY-MM-DD HH24:MI:SS') as drank
from event) t1
GROUP by drank, id
order by id
Assuming your time stamps are precise (no seconds or fractions of a second), you can subtract an enumerated number of minutes from the time column. This is constant for "adjacent" rows:
select id, min(time), max(time)
from (select t.*,
row_number() over (partition by id order by time) as seqnum
from t
) t
group by id, time - seqnum * interval '1 minute';
If you have seconds and fractional seconds, then you might want to adjust the logic using date_trunc(). If this is an issue, I would suggest that you ask a new question with appropriate sample data and desired results.

How to fill the time gap after grouping date record for months in postgres

I have table records as -
date n_count
2020-02-19 00:00:00 4
2020-07-14 00:00:00 1
2020-07-17 00:00:00 1
2020-07-30 00:00:00 2
2020-08-03 00:00:00 1
2020-08-04 00:00:00 2
2020-08-25 00:00:00 2
2020-09-23 00:00:00 2
2020-09-30 00:00:00 3
2020-10-01 00:00:00 11
2020-10-05 00:00:00 12
2020-10-19 00:00:00 1
2020-10-20 00:00:00 1
2020-10-22 00:00:00 1
2020-11-02 00:00:00 376
2020-11-04 00:00:00 72
2020-11-11 00:00:00 1
I want to be grouped all the records into months for finding month total count which is working, but there is a missing of month. how to fill this gap.
time month_count
"2020-02-01" 4
"2020-07-01" 4
"2020-08-01" 5
"2020-09-01" 5
"2020-10-01" 26
"2020-11-01" 449
This is what I have tried.
SELECT (date_trunc('month', date))::date AS time,
sum(n_count) as month_count
FROM table1
group by time
order by time asc
You can use generate_series() to generate all starts of months between the earliest and latest date available in the table, then bring the table with a left join:
select d.dt, coalesce(sum(t.n_count), 0) as month_count
from (
select generate_series(date_trunc('month', min(date)), date_trunc('month', max(date)), '1 month') as dt
from table1
) as d(dt)
left join table1 t on t.date >= d.dt and t.date < d.dt + interval '1 month'
group by d.dt
order by d.dt
I would simply UNION a date series, generated from MIN and MAX date:
demo:db<>fiddle
WITH cte AS ( -- 1
SELECT
*,
date_trunc('month', date)::date AS time
FROM
t
)
SELECT
time,
SUM(n_count) as month_count --3
FROM (
SELECT
time,
n_count
FROM cte
UNION
SELECT -- 2
generate_series(
(SELECT MIN(time) FROM cte),
(SELECT MAX(time) FROM cte),
interval '1 month'
)::date,
0
) s
GROUP BY time
ORDER BY time
Use CTE to calculate date_trunc only once. Could be left out if you like to call your table twice in the UNION below
Generate monthly date series from MIN to MAX date containing your n_count value = 0. Add it to the table
Do your calculation

BigQuery: Computing the timestamp diff in time ordered rows in a group

Given a table like this, I would like to compute the time duration of each state before changing to a different state:
id state timestamp
1 1 2018-08-17 10:40:00
1 2 2018-08-17 12:40:00
1 1 2018-08-17 14:40:00
2 1 2018-08-17 09:00:00
2 2 2018-08-17 12:00:00
The output I want is:
id state date duration
1 1 2018-08-17 2 hours
1 2 2018-08-17 2 hours
1 1 2018-08-17 9 hours 20 minutes (until the end of the day in this case)
2 1 2018-08-17 3 hours
2 2 2018-08-17 12 hours (until the end of the day in this case)
I am not so sure whether this is doable in SQL. I feel like I have to write a UDF against aggregated state and timestamp (grouped by id and ordered by ts) which outputs an array of struct (id, state, date, and duration). This array can be flattened.
Below is for BigQuery Standard SQL
#standardSQL
SELECT id, state,
IFNULL(
TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, MINUTE),
24*60 - TIMESTAMP_DIFF(ts, TIMESTAMP_TRUNC(ts, DAY), MINUTE)
) AS duration_minutes
FROM `project.dataset.table`
You can test, play with above using dummy data from your question:
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, 1 state, TIMESTAMP('2018-08-17 10:40:00') ts UNION ALL
SELECT 1, 2, '2018-08-17 12:40:00' UNION ALL
SELECT 1, 1, '2018-08-17 14:40:00' UNION ALL
SELECT 2, 1, '2018-08-17 09:00:00' UNION ALL
SELECT 2, 2, '2018-08-17 12:00:00'
)
SELECT id, state,
IFNULL(
TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, MINUTE),
24*60 - TIMESTAMP_DIFF(ts, TIMESTAMP_TRUNC(ts, DAY), MINUTE)
) AS duration_minutes
FROM `project.dataset.table`
-- ORDER BY id, ts
with result as below
Row id state duration_minutes
1 1 1 120
2 1 2 120
3 1 1 560
4 2 1 180
5 2 2 720
If you need your output formatted exactly the qay you showed in question - use below
#standardSQL
SELECT id, state, ts, duration_minutes,
FORMAT('%i hours %i minutes', DIV(duration_minutes, 60), MOD(duration_minutes, 60)) duration
FROM (
SELECT id, state, ts,
IFNULL(
TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, MINUTE),
24*60 - TIMESTAMP_DIFF(ts, TIMESTAMP_TRUNC(ts, DAY), MINUTE)
) AS duration_minutes
FROM `project.dataset.table`
)
In this case you output will look like below
Row id state ts duration_minutes duration
1 1 1 2018-08-17 10:40:00 UTC 120 2 hours 0 minutes
2 1 2 2018-08-17 12:40:00 UTC 120 2 hours 0 minutes
3 1 1 2018-08-17 14:40:00 UTC 560 9 hours 20 minutes
4 2 1 2018-08-17 09:00:00 UTC 180 3 hours 0 minutes
5 2 2 2018-08-17 12:00:00 UTC 720 12 hours 0 minutes
Sure, you will most likely still need to adjust above to your particular case - but you've got a good start I think

BigQuery: how to do semi left join?

I couldn't come up with a good title for this question. Sorry about that.
I have two tables A and B. Both have timestamps and shares a common ID between them. Here are schemas of both tables:
Table A:
========
a_id int,
common_id int,
ts timestamp
...
Table B:
========
b_id int,
common_id int,
ts timestamp,
temperature int
Table A is more like device data whenever it changes its status. Table B is more IoT data which contains a temperature of a device every minute or so.
What I want to do is to create a Table C from these two tables. Table C would be in essence Table A + its temperature in closest time from table B.
How can I do this purely in BigQuery SQL? The temperature info doesn't need to be precise.
Below option (for BigQuery Standard SQL) assumes that in addition of temperature from table b you still need all the rest of values from respective row
#standardSQL
SELECT
ARRAY_AGG(
STRUCT(a_id, a.common_id, a.ts, b_id, b.ts AS b_ts, temperature)
ORDER BY ABS(TIMESTAMP_DIFF(a.ts, b.ts, SECOND))
LIMIT 1
)[SAFE_OFFSET(0)].*
FROM `project.dataset.table_a` a
LEFT JOIN `project.dataset.table_b` b
ON a.common_id = b.common_id
AND ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30
GROUP BY TO_JSON_STRING(a)
I smoke-tested it with below generated dummy data
#standardSQL
WITH `project.dataset.table_a` AS (
SELECT CAST(1000000 * RAND() AS INT64) a_id, common_id, ts
FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2018-01-01 00:00:00', '2018-01-01 23:59:59', INTERVAL 45*60 + 27 SECOND)) ts
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) common_id
), `project.dataset.table_b` AS (
SELECT CAST(1000000 * RAND() AS INT64) b_id, common_id, ts, CAST(60 + 40 * RAND() AS INT64) temperature
FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2018-01-01 00:00:00', '2018-01-01 23:59:59', INTERVAL 1 MINUTE)) ts
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) common_id
)
SELECT
ARRAY_AGG(
STRUCT(a_id, a.common_id, a.ts, b_id, b.ts AS b_ts, temperature)
ORDER BY ABS(TIMESTAMP_DIFF(a.ts, b.ts, SECOND))
LIMIT 1
)[SAFE_OFFSET(0)].*
FROM `project.dataset.table_a` a
LEFT JOIN `project.dataset.table_b` b
ON a.common_id = b.common_id
AND ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30
GROUP BY TO_JSON_STRING(a)
with example of few rows from output:
Row a_id common_id ts b_id b_ts temperature
1 276623 1 2018-01-01 00:00:00 UTC 166995 2018-01-01 00:00:00 UTC 74
2 218354 1 2018-01-01 00:45:27 UTC 464901 2018-01-01 00:45:00 UTC 87
3 265634 1 2018-01-01 01:30:54 UTC 565385 2018-01-01 01:31:00 UTC 87
4 758075 1 2018-01-01 02:16:21 UTC 55894 2018-01-01 02:16:00 UTC 84
5 306355 1 2018-01-01 03:01:48 UTC 844429 2018-01-01 03:02:00 UTC 92
6 348502 1 2018-01-01 03:47:15 UTC 375859 2018-01-01 03:47:00 UTC 90
7 774920 1 2018-01-01 04:32:42 UTC 438164 2018-01-01 04:33:00 UTC 61
Here - I set table_b to have temperature for each minute for 10 devices during the whole day of '2018-01-01' and in table_a I set status changed each 45 min 27 sec for same 10 devices during same day. a_id and b_id - just random numbers between 0 and 999999
Note: ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30 clause in JOIN controls period that you can consider ok to look for closest ts (in case if some IoT entries are absent from table_b
Measuring the closest time by TIMESTAMP_DIFF(a.ts,b.ts, SECOND) - by its absolute value to get the closest in any direction:
WITH a AS (
SELECT 1 id, TIMESTAMP('2018-01-01 11:01:00') ts
UNION ALL SELECT 1, ('2018-01-02 10:00:00')
UNION ALL SELECT 2, ('2018-01-02 10:00:00')
)
, b AS (
SELECT 1 id, TIMESTAMP('2018-01-01 12:01:00') ts, 43 temp
UNION ALL SELECT 1, TIMESTAMP('2018-01-01 12:06:00'), 47
)
SELECT *,
(SELECT temp
FROM b
WHERE a.id=b.id
ORDER BY ABS(TIMESTAMP_DIFF(a.ts,b.ts, SECOND))
LIMIT 1) temp
FROM a